winepredict

Problem

Our goal is to explore, analyze and model a dataset containing information on approximately 12,000 commercially available wines.

A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Use a “count regression model for your predictions”

1. DATA EXPLORATION

Below we’ll display a few basic EDA techniques to gain insight into our wine dataset.

We see that most variables are somewhat normally distributed.

The distribution profiles show right skew in variables ‘AcidIndex’, and ‘STARS’.

Also we notice that some of these variables like STARS, Target, LabelAppeal etc. have discrete values, meaning they are categorical.

Basic Statistics

There are 12,795 rows and 15 columns (features). Of all 15 columns, 0 are discrete, 15 are continuous, and 0 are all missing. There are 8,200 missing values out of 191,925 data points.

Histogram of Variables

2. DATA PREPARATION

Negative Values

There are some wine quality measures that are negative that should not be. I will take take the absolute value of these. Another good reason to take the absolute value, is that I don’t know the methodology of how this data was collected and put together.

train$FixedAcidity <- abs(train$FixedAcidity)
test$FixedAcidity <- abs(test$FixedAcidity)

train$VolatileAcidity <- abs(train$VolatileAcidity)
test$VolatileAcidity <- abs(test$VolatileAcidity)

train$CitricAcid <- abs(train$CitricAcid)
test$CitricAcid <- abs(test$CitricAcid)

train$ResidualSugar <- abs(train$ResidualSugar)
test$ResidualSugar <- abs(test$ResidualSugar)

train$Chlorides <- abs(train$Chlorides)
test$Chlorides <- abs(test$Chlorides)

train$FreeSulfurDioxide <- abs(train$FreeSulfurDioxide)
test$FreeSulfurDioxide <- abs(test$FreeSulfurDioxide)

train$TotalSulfurDioxide <- abs(train$TotalSulfurDioxide)
test$TotalSulfurDioxide <- abs(test$TotalSulfurDioxide)

train$Sulphates <- abs(train$Sulphates)
test$Sulphates <- abs(test$Sulphates)

For LabelAppeal, we will add the min.

train$LabelAppeal <- train$LabelAppeal + abs(min(train$LabelAppeal))
test$LabelAppeal <- test$LabelAppeal + abs(min(test$LabelAppeal))

Plot and Review Missing

plot_missing(train)

We need to make decisions about pH, ResidualSugar, Chlorides, Free SulfurDioxide, Alcohol, TotalSulfurDioxide, Sulphates, and STARS. After reviewing the test set, we are missing the same variables at about the same rate.

With 26% missing it is possible that this is a reult of simply critics not getting around to reviewing, or it potentially being subpar, lets just assign N/A to 0.

train$STARS[is.na(train$STARS)] <- 0
test$STARS[is.na(test$STARS)] <- 0

Elsewhere, let’s look at using MICE for imputation.

mice_imputes <- mice(train, m = 2, maxit = 2, print = FALSE)
densityplot(mice_imputes)

We can see that each of the remaining variables with missing values seem to be MAR, as the mice imputation distributions roughly match the existing.

We’ll also run the mice imputation again on both the train and test set. Instead of using it for our models, however, we’ll simplify our run and fill in our data. It doesn’t account for variability, but it should do fine for the sake of this exercise.

mice_train <-  mice(train, m = 1, maxit = 1, print = FALSE)
train <- complete(mice_train)

mice_test <- mice(test, m = 1, maxit = 1, print = FALSE)
test <- complete(mice_test)

Correlation Review

plot_correlation(train)

And finally, after our analysis, (and so we can use it in in our model), we’ll update STARS to become a factor variable.

train$STARS <- as.factor(train$STARS)
test$STARS <- as.factor(test$STARS)

3. BUILD MODELS

Create Holdout

set.seed(123)
split <- sample(1:nrow(train), .8*nrow(train))
  
holdout <- train[-split,]
train <- train[split,]

Model 1: Poisson

p1 <- glm(TARGET ~ . , data=train, family="poisson")
#summary(p1)

Model 2: Poisson Reduced

p2 <- glm(TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, data=train, family="poisson")
#summary(p2)

Model 3: Negative Binomial

nb1 <- glm.nb(TARGET ~ . , data=train)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

#summary(nb1)

Model 4: Negative Binomial Reduced

nb2 <- glm.nb(TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, data=train)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

#summary(nb2)

Model 5: Zero Dispersion Counts

We see that there are an inflated number of 0s in our counts target, so let’s try this model across the negative binomial Distribution.

zMod1 <- zeroinfl(TARGET ~ . |STARS, data=train, dist="negbin")
#summary(zMod1)

Model 6: Zero Dispersion Counts: Reduced

zMod2 <- zeroinfl(TARGET ~ VolatileAcidity + Chlorides + Density + Alcohol + LabelAppeal + AcidIndex | STARS, data=train, dist="negbin")
#summary(zMod2)

model 7: Multiple linear Regression

mlr1 <- lm(TARGET ~ ., data = train)
summary(mlr1)

## 
## Call:
## lm(formula = TARGET ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7920 -0.8635  0.0247  0.8380  5.6943 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.842e+00  4.951e-01   5.741 9.68e-09 ***
## ï..INDEX            1.604e-06  2.776e-06   0.578  0.56347    
## FixedAcidity        1.625e-04  2.616e-03   0.062  0.95048    
## VolatileAcidity    -8.947e-02  2.325e-02  -3.849  0.00012 ***
## CitricAcid          3.784e-02  2.127e-02   1.779  0.07521 .  
## ResidualSugar      -5.529e-04  5.150e-04  -1.074  0.28298    
## Chlorides          -5.326e-02  5.516e-02  -0.966  0.33431    
## FreeSulfurDioxide   2.561e-04  1.206e-04   2.123  0.03377 *  
## TotalSulfurDioxide  2.403e-04  7.886e-05   3.047  0.00231 ** 
## Density            -8.643e-01  4.848e-01  -1.783  0.07468 .  
## pH                 -3.532e-02  1.917e-02  -1.842  0.06545 .  
## Sulphates          -3.029e-02  1.953e-02  -1.551  0.12091    
## Alcohol             9.438e-03  3.486e-03   2.708  0.00678 ** 
## LabelAppeal         4.641e-01  1.516e-02  30.616  < 2e-16 ***
## AcidIndex          -2.012e-01  1.005e-02 -20.017  < 2e-16 ***
## STARS1              1.369e+00  3.680e-02  37.209  < 2e-16 ***
## STARS2              2.404e+00  3.573e-02  67.269  < 2e-16 ***
## STARS3              2.993e+00  4.130e-02  72.485  < 2e-16 ***
## STARS4              3.683e+00  6.564e-02  56.109  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.305 on 10217 degrees of freedom
## Multiple R-squared:  0.542,  Adjusted R-squared:  0.5412 
## F-statistic: 671.6 on 18 and 10217 DF,  p-value: < 2.2e-16

Model 8: Multiple Linear regression. Looking at the signif vars from Model 7

mlr2 <- lm(TARGET ~ . -CitricAcid -FixedAcidity -Chlorides - ResidualSugar -Density - TotalSulfurDioxide - FreeSulfurDioxide - Alcohol -pH -Sulphates, data = train)
summary(mlr2)

## 
## Call:
## lm(formula = TARGET ~ . - CitricAcid - FixedAcidity - Chlorides - 
##     ResidualSugar - Density - TotalSulfurDioxide - FreeSulfurDioxide - 
##     Alcohol - pH - Sulphates, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8697 -0.8713  0.0385  0.8455  5.7465 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.029e+00  9.077e-02  22.356  < 2e-16 ***
## ï..INDEX         1.643e-06  2.778e-06   0.591    0.554    
## VolatileAcidity -9.158e-02  2.326e-02  -3.937 8.29e-05 ***
## LabelAppeal      4.635e-01  1.517e-02  30.555  < 2e-16 ***
## AcidIndex       -2.027e-01  9.841e-03 -20.600  < 2e-16 ***
## STARS1           1.375e+00  3.681e-02  37.359  < 2e-16 ***
## STARS2           2.412e+00  3.573e-02  67.504  < 2e-16 ***
## STARS3           3.006e+00  4.126e-02  72.838  < 2e-16 ***
## STARS4           3.694e+00  6.562e-02  56.296  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.307 on 10227 degrees of freedom
## Multiple R-squared:  0.5404, Adjusted R-squared:   0.54 
## F-statistic:  1503 on 8 and 10227 DF,  p-value: < 2.2e-16

4. SELECT MODELS

let’s test each of our models agains the validation set.

getMAE <- function(x) {
    mean(abs(holdout$TARGET - x))
}

getRMSE <- function(x) {
    sqrt(mean((holdout$TARGET - x)^2))
}


results <- data.frame(Model = c("Poisson1",
                                "Poisson2",
                                "NegBinom1",
                                "NegBinom2",
                                "ZeroCounts1",
                                "ZeroCounts2"), 
                     MAE = c(getMAE(predict.lm(p1, holdout)),
                             getMAE(predict.lm(p2, holdout)),
                             getMAE(predict.lm(nb1, holdout)),
                             getMAE(predict.lm(nb2, holdout)),
                             getMAE(predict(zMod1, holdout,
                                            type="response")),
                             getMAE(predict(zMod2, holdout,
                                            type="response"))
                             ),
                     RMSE = c(getRMSE(predict.lm(p1, holdout)),
                              getRMSE(predict.lm(p2, holdout)),
                              getRMSE(predict.lm(nb1, holdout)),
                              getRMSE(predict.lm(nb2, holdout)),
                              getRMSE(predict(zMod1, holdout,
                                            type="response")),
                              getRMSE(predict(zMod2, holdout,
                                            type="response"))
                              )
)

knitr::kable(results)

Model	MAE	RMSE
Poisson1	2.228444	2.597023
Poisson2	2.229372	2.597655
NegBinom1	2.228444	2.597023
NegBinom2	2.229372	2.597655
ZeroCounts1	1.025186	1.315584
ZeroCounts2	1.047941	1.361338

Zmod1 looks like the best model as far as performance goes but we were asked to use the “Poisson count regression model”

summary(zMod1)

## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = train, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.41561 -0.54954  0.01781  0.43257  2.54872 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)
## (Intercept)         1.171e+00         NA      NA       NA
## ï..INDEX            1.433e-08         NA      NA       NA
## FixedAcidity        1.060e-04         NA      NA       NA
## VolatileAcidity    -1.619e-02         NA      NA       NA
## CitricAcid          4.441e-04         NA      NA       NA
## ResidualSugar      -3.793e-05         NA      NA       NA
## Chlorides          -1.073e-02         NA      NA       NA
## FreeSulfurDioxide   2.553e-05         NA      NA       NA
## TotalSulfurDioxide -8.436e-06         NA      NA       NA
## Density            -3.190e-01         NA      NA       NA
## pH                  8.940e-04         NA      NA       NA
## Sulphates           1.835e-03         NA      NA       NA
## Alcohol             6.254e-03         NA      NA       NA
## LabelAppeal         2.215e-01         NA      NA       NA
## AcidIndex          -2.991e-02         NA      NA       NA
## STARS1              5.914e-02         NA      NA       NA
## STARS2              1.872e-01         NA      NA       NA
## STARS3              2.884e-01         NA      NA       NA
## STARS4              3.831e-01         NA      NA       NA
## Log(theta)          1.784e+01         NA      NA       NA
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.3434         NA      NA       NA
## STARS1       -1.9887         NA      NA       NA
## STARS2       -5.8652         NA      NA       NA
## STARS3      -20.0042         NA      NA       NA
## STARS4      -20.0033         NA      NA       NA
## 
## Theta = 56015999.7204 
## Number of iterations in BFGS optimization: 38 
## Log-likelihood: -1.664e+04 on 25 Df

I guess what really stands out to me the most after looking at all these types of models is the variation on estimated intercept of “stars”. I’m not really sure why this is but a first guess would be that some models exagerate effects of some varaibles. You can also see that the “multiple linear regression 2” model has one of the higher median residuals. When this happens its uaually a signal to check if your residuals are symetri, but these numbers are so small im not sure it matters much.

Make Predictions

We make our final predictions and create a dataframe with the prediction.

### Prediction
# Predict...

predictions <- predict(p2, train)

train$TARGET <- predictions

write.csv(test, 'wine_predictions.csv', row.names=FALSE)

head(train)

```