Our goal is to explore, analyze and model a dataset containing information on approximately 12,000 commercially available wines.
A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Use a “count regression model for your predictions”
Below we’ll display a few basic EDA techniques to gain insight into our wine dataset.
We see that most variables are somewhat normally distributed.
The distribution profiles show right skew in variables ‘AcidIndex’, and ‘STARS’.
Also we notice that some of these variables like STARS, Target, LabelAppeal etc. have discrete values, meaning they are categorical.
There are 12,795 rows and 15 columns (features). Of all 15 columns, 0 are discrete, 15 are continuous, and 0 are all missing. There are 8,200 missing values out of 191,925 data points.
There are some wine quality measures that are negative that should not be. I will take take the absolute value of these. Another good reason to take the absolute value, is that I don’t know the methodology of how this data was collected and put together.
train$FixedAcidity <- abs(train$FixedAcidity)
test$FixedAcidity <- abs(test$FixedAcidity)
train$VolatileAcidity <- abs(train$VolatileAcidity)
test$VolatileAcidity <- abs(test$VolatileAcidity)
train$CitricAcid <- abs(train$CitricAcid)
test$CitricAcid <- abs(test$CitricAcid)
train$ResidualSugar <- abs(train$ResidualSugar)
test$ResidualSugar <- abs(test$ResidualSugar)
train$Chlorides <- abs(train$Chlorides)
test$Chlorides <- abs(test$Chlorides)
train$FreeSulfurDioxide <- abs(train$FreeSulfurDioxide)
test$FreeSulfurDioxide <- abs(test$FreeSulfurDioxide)
train$TotalSulfurDioxide <- abs(train$TotalSulfurDioxide)
test$TotalSulfurDioxide <- abs(test$TotalSulfurDioxide)
train$Sulphates <- abs(train$Sulphates)
test$Sulphates <- abs(test$Sulphates)
For LabelAppeal, we will add the min.
train$LabelAppeal <- train$LabelAppeal + abs(min(train$LabelAppeal))
test$LabelAppeal <- test$LabelAppeal + abs(min(test$LabelAppeal))
plot_missing(train)
We need to make decisions about pH, ResidualSugar, Chlorides, Free SulfurDioxide, Alcohol, TotalSulfurDioxide, Sulphates, and STARS. After reviewing the test set, we are missing the same variables at about the same rate.
With 26% missing it is possible that this is a reult of simply critics not getting around to reviewing, or it potentially being subpar, lets just assign N/A to 0.
train$STARS[is.na(train$STARS)] <- 0
test$STARS[is.na(test$STARS)] <- 0
Elsewhere, let’s look at using MICE for imputation.
mice_imputes <- mice(train, m = 2, maxit = 2, print = FALSE)
densityplot(mice_imputes)
We can see that each of the remaining variables with missing values seem to be MAR, as the mice imputation distributions roughly match the existing.
We’ll also run the mice imputation again on both the train and test set. Instead of using it for our models, however, we’ll simplify our run and fill in our data. It doesn’t account for variability, but it should do fine for the sake of this exercise.
mice_train <- mice(train, m = 1, maxit = 1, print = FALSE)
train <- complete(mice_train)
mice_test <- mice(test, m = 1, maxit = 1, print = FALSE)
test <- complete(mice_test)
plot_correlation(train)
And finally, after our analysis, (and so we can use it in in our model), we’ll update STARS to become a factor variable.
train$STARS <- as.factor(train$STARS)
test$STARS <- as.factor(test$STARS)
set.seed(123)
split <- sample(1:nrow(train), .8*nrow(train))
holdout <- train[-split,]
train <- train[split,]
p1 <- glm(TARGET ~ . , data=train, family="poisson")
#summary(p1)
p2 <- glm(TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, data=train, family="poisson")
#summary(p2)
nb1 <- glm.nb(TARGET ~ . , data=train)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
#summary(nb1)
nb2 <- glm.nb(TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, data=train)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
#summary(nb2)
We see that there are an inflated number of 0s in our counts target, so let’s try this model across the negative binomial Distribution.
zMod1 <- zeroinfl(TARGET ~ . |STARS, data=train, dist="negbin")
#summary(zMod1)
zMod2 <- zeroinfl(TARGET ~ VolatileAcidity + Chlorides + Density + Alcohol + LabelAppeal + AcidIndex | STARS, data=train, dist="negbin")
#summary(zMod2)
mlr1 <- lm(TARGET ~ ., data = train)
summary(mlr1)
##
## Call:
## lm(formula = TARGET ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7920 -0.8635 0.0247 0.8380 5.6943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.842e+00 4.951e-01 5.741 9.68e-09 ***
## ï..INDEX 1.604e-06 2.776e-06 0.578 0.56347
## FixedAcidity 1.625e-04 2.616e-03 0.062 0.95048
## VolatileAcidity -8.947e-02 2.325e-02 -3.849 0.00012 ***
## CitricAcid 3.784e-02 2.127e-02 1.779 0.07521 .
## ResidualSugar -5.529e-04 5.150e-04 -1.074 0.28298
## Chlorides -5.326e-02 5.516e-02 -0.966 0.33431
## FreeSulfurDioxide 2.561e-04 1.206e-04 2.123 0.03377 *
## TotalSulfurDioxide 2.403e-04 7.886e-05 3.047 0.00231 **
## Density -8.643e-01 4.848e-01 -1.783 0.07468 .
## pH -3.532e-02 1.917e-02 -1.842 0.06545 .
## Sulphates -3.029e-02 1.953e-02 -1.551 0.12091
## Alcohol 9.438e-03 3.486e-03 2.708 0.00678 **
## LabelAppeal 4.641e-01 1.516e-02 30.616 < 2e-16 ***
## AcidIndex -2.012e-01 1.005e-02 -20.017 < 2e-16 ***
## STARS1 1.369e+00 3.680e-02 37.209 < 2e-16 ***
## STARS2 2.404e+00 3.573e-02 67.269 < 2e-16 ***
## STARS3 2.993e+00 4.130e-02 72.485 < 2e-16 ***
## STARS4 3.683e+00 6.564e-02 56.109 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.305 on 10217 degrees of freedom
## Multiple R-squared: 0.542, Adjusted R-squared: 0.5412
## F-statistic: 671.6 on 18 and 10217 DF, p-value: < 2.2e-16
mlr2 <- lm(TARGET ~ . -CitricAcid -FixedAcidity -Chlorides - ResidualSugar -Density - TotalSulfurDioxide - FreeSulfurDioxide - Alcohol -pH -Sulphates, data = train)
summary(mlr2)
##
## Call:
## lm(formula = TARGET ~ . - CitricAcid - FixedAcidity - Chlorides -
## ResidualSugar - Density - TotalSulfurDioxide - FreeSulfurDioxide -
## Alcohol - pH - Sulphates, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8697 -0.8713 0.0385 0.8455 5.7465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.029e+00 9.077e-02 22.356 < 2e-16 ***
## ï..INDEX 1.643e-06 2.778e-06 0.591 0.554
## VolatileAcidity -9.158e-02 2.326e-02 -3.937 8.29e-05 ***
## LabelAppeal 4.635e-01 1.517e-02 30.555 < 2e-16 ***
## AcidIndex -2.027e-01 9.841e-03 -20.600 < 2e-16 ***
## STARS1 1.375e+00 3.681e-02 37.359 < 2e-16 ***
## STARS2 2.412e+00 3.573e-02 67.504 < 2e-16 ***
## STARS3 3.006e+00 4.126e-02 72.838 < 2e-16 ***
## STARS4 3.694e+00 6.562e-02 56.296 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.307 on 10227 degrees of freedom
## Multiple R-squared: 0.5404, Adjusted R-squared: 0.54
## F-statistic: 1503 on 8 and 10227 DF, p-value: < 2.2e-16
let’s test each of our models agains the validation set.
getMAE <- function(x) {
mean(abs(holdout$TARGET - x))
}
getRMSE <- function(x) {
sqrt(mean((holdout$TARGET - x)^2))
}
results <- data.frame(Model = c("Poisson1",
"Poisson2",
"NegBinom1",
"NegBinom2",
"ZeroCounts1",
"ZeroCounts2"),
MAE = c(getMAE(predict.lm(p1, holdout)),
getMAE(predict.lm(p2, holdout)),
getMAE(predict.lm(nb1, holdout)),
getMAE(predict.lm(nb2, holdout)),
getMAE(predict(zMod1, holdout,
type="response")),
getMAE(predict(zMod2, holdout,
type="response"))
),
RMSE = c(getRMSE(predict.lm(p1, holdout)),
getRMSE(predict.lm(p2, holdout)),
getRMSE(predict.lm(nb1, holdout)),
getRMSE(predict.lm(nb2, holdout)),
getRMSE(predict(zMod1, holdout,
type="response")),
getRMSE(predict(zMod2, holdout,
type="response"))
)
)
knitr::kable(results)
| Model | MAE | RMSE |
|---|---|---|
| Poisson1 | 2.228444 | 2.597023 |
| Poisson2 | 2.229372 | 2.597655 |
| NegBinom1 | 2.228444 | 2.597023 |
| NegBinom2 | 2.229372 | 2.597655 |
| ZeroCounts1 | 1.025186 | 1.315584 |
| ZeroCounts2 | 1.047941 | 1.361338 |
Zmod1 looks like the best model as far as performance goes but we were asked to use the “Poisson count regression model”
summary(zMod1)
##
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = train, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.41561 -0.54954 0.01781 0.43257 2.54872
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.171e+00 NA NA NA
## ï..INDEX 1.433e-08 NA NA NA
## FixedAcidity 1.060e-04 NA NA NA
## VolatileAcidity -1.619e-02 NA NA NA
## CitricAcid 4.441e-04 NA NA NA
## ResidualSugar -3.793e-05 NA NA NA
## Chlorides -1.073e-02 NA NA NA
## FreeSulfurDioxide 2.553e-05 NA NA NA
## TotalSulfurDioxide -8.436e-06 NA NA NA
## Density -3.190e-01 NA NA NA
## pH 8.940e-04 NA NA NA
## Sulphates 1.835e-03 NA NA NA
## Alcohol 6.254e-03 NA NA NA
## LabelAppeal 2.215e-01 NA NA NA
## AcidIndex -2.991e-02 NA NA NA
## STARS1 5.914e-02 NA NA NA
## STARS2 1.872e-01 NA NA NA
## STARS3 2.884e-01 NA NA NA
## STARS4 3.831e-01 NA NA NA
## Log(theta) 1.784e+01 NA NA NA
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3434 NA NA NA
## STARS1 -1.9887 NA NA NA
## STARS2 -5.8652 NA NA NA
## STARS3 -20.0042 NA NA NA
## STARS4 -20.0033 NA NA NA
##
## Theta = 56015999.7204
## Number of iterations in BFGS optimization: 38
## Log-likelihood: -1.664e+04 on 25 Df
We make our final predictions and create a dataframe with the prediction.
### Prediction
# Predict...
predictions <- predict(p2, train)
train$TARGET <- predictions
write.csv(test, 'wine_predictions.csv', row.names=FALSE)
head(train)
```