Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine.

HINT: Sometimes, the fact that a variable is missing is actually predictive of the target.

You can only use the variables given to you (or variables that you derive from the variables provided).

Below is a short description of the variables of interest in the data set:

Data Exploration

dim(train_df)

## [1] 12795    16

The training dataframe has 12795 rows and 16 columns with 1 target variable and 15 features.

str(train_df)

## Classes 'data.table' and 'data.frame':   12795 obs. of  16 variables:
##  $ INDEX             : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...
##  - attr(*, ".internal.selfref")=<externalptr>

show_summary <- function(df) {
    cat(rep("+", 50), "\n")
    cat(paste("DIMENSIONS : (", nrow(df), ", ", ncol(df), ")\n", sep = ""), "\n")
    cat(rep("+", 50), "\n")
    cat("COLUMNS:\n", "\n")
    col_names <- names(df)
    cat(paste(col_names, ", "))
    cat(rep("+", 50), "\n")
    cat("DATA INFO:\n", "\n")
    cat(sapply(df, class), "\n")
    cat(rep("+", 50), "\n")
    cat("MISSING VALUES:\n", "\n")
    missing_values <- colSums(is.na(df))
    cat(paste(col_names, ": ", missing_values, "\n"))
}

show_summary(train_df)

## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## DIMENSIONS : (12795, 16)
##  
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## COLUMNS:
##  
## INDEX ,  TARGET ,  FixedAcidity ,  VolatileAcidity ,  CitricAcid ,  ResidualSugar ,  Chlorides ,  FreeSulfurDioxide ,  TotalSulfurDioxide ,  Density ,  pH ,  Sulphates ,  Alcohol ,  LabelAppeal ,  AcidIndex ,  STARS , + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## DATA INFO:
##  
## integer integer numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric integer integer integer 
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## MISSING VALUES:
##  
## INDEX :  0 
##  TARGET :  0 
##  FixedAcidity :  0 
##  VolatileAcidity :  0 
##  CitricAcid :  0 
##  ResidualSugar :  616 
##  Chlorides :  638 
##  FreeSulfurDioxide :  647 
##  TotalSulfurDioxide :  682 
##  Density :  0 
##  pH :  395 
##  Sulphates :  1210 
##  Alcohol :  653 
##  LabelAppeal :  0 
##  AcidIndex :  0 
##  STARS :  3359

There the different chemical components of wine and its STAR rating on 1-5 scale where 5 stars is the highest quality of wine.

par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))

for (col_name in names(train_df)[-1]) {
    if (is.numeric(train_df[[col_name]])) {
        hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
    }
}

par(mfrow = c(1, 1))

par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))

for (col_name in names(train_df)[-1]) {
    if (is.numeric(train_df[[col_name]])) {
        boxplot(train_df[[col_name]], main = paste(col_name), horizontal = TRUE,
            ylab = "Value")
    }
}

par(mfrow = c(1, 1))

All of the features distribution is center around the mean which is what we want. Further supported with boxplots that the distributions look normal.

Data Preparation

Firstly, we want to remove the INDEX column from both the training and testing dataframe.

train_df <- train_df |>
    dplyr::select(-INDEX)
test_df <- test_df |>
    dplyr::select(-IN)

We then split the training data in to train-test split since the evaluation data is unlabeled thus we can not use for test the models’ performance. In addition, we chose to use MICE imputation method to fill out the missing values in the data.

# Impute missing data using mice
set.seed(123123)
imputed_data <- mice(train_df, m = 5, method = "pmm")

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

complete_train_df <- complete(imputed_data, 1)
test_df <- test_df[, 2:15]
test_imputed <- mice(test_df, m = 5, method = "pmm")

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

complete_test_df <- complete(test_imputed, 1)

correlation_matrix <- cor(complete_train_df[, 1:14])
corrplot(correlation_matrix, method = "color", type = "upper", addCoef.col = "black",
    number.cex = 0.5)

The features are have little to no correlation with the target and with other predictor variables.

complete_train_df$AcidIndex <- log(complete_train_df$AcidIndex)

hist(complete_train_df$AcidIndex)

set.seed(123123)
train_indices <- createDataPartition(complete_train_df$TARGET, p = 0.8, list = FALSE)
train_model_df <- complete_train_df[train_indices, ]
test_model_df <- complete_train_df[-train_indices, ]

x_train <- train_model_df[, 2:15] |>
    as.matrix()
y_train <- train_model_df$TARGET |>
    as.matrix() |>
    as.numeric()

Model Building

Poisson Models

Poisson Model 1 with MICE Imputed Data

performance_metrics <- function(model, true_values, predicted_values) {
    deviance <- deviance(model)
    aic <- AIC(model)
    bic <- BIC(model)
    mse_val <- mean((true_values - predicted_values)^2)
    rmse_val <- sqrt(mse_val)
    mae_val <- mean(abs(predicted_values - true_values))

    metrics_df <- data.frame(Deviance = deviance, AIC = aic, BIC = bic, MSE = mse_val,
        RMSE = rmse_val, MAE = mae_val)
    return(metrics_df)
}

pm_1 <- glm(TARGET ~ ., data = train_model_df, family = poisson)
summary(pm_1)

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = train_model_df)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.198e+00  2.288e-01   9.603  < 2e-16 ***
## FixedAcidity       -1.222e-03  9.214e-04  -1.326 0.184803    
## VolatileAcidity    -4.087e-02  7.321e-03  -5.582 2.37e-08 ***
## CitricAcid          8.012e-03  6.564e-03   1.221 0.222197    
## ResidualSugar       6.622e-05  1.684e-04   0.393 0.694146    
## Chlorides          -5.183e-02  1.788e-02  -2.898 0.003751 ** 
## FreeSulfurDioxide   1.356e-04  3.821e-05   3.550 0.000386 ***
## TotalSulfurDioxide  9.716e-05  2.480e-05   3.917 8.96e-05 ***
## Density            -3.637e-01  2.143e-01  -1.697 0.089646 .  
## pH                 -1.551e-02  8.406e-03  -1.845 0.065039 .  
## Sulphates          -1.257e-02  6.021e-03  -2.088 0.036835 *  
## Alcohol             3.353e-03  1.539e-03   2.179 0.029358 *  
## LabelAppeal         1.492e-01  6.780e-03  22.005  < 2e-16 ***
## AcidIndex          -6.954e-01  3.974e-02 -17.499  < 2e-16 ***
## STARS               3.394e-01  6.275e-03  54.087  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18374  on 10237  degrees of freedom
## Residual deviance: 12937  on 10223  degrees of freedom
## AIC: 38498
## 
## Number of Fisher Scoring iterations: 5

pm_1_predictions <- predict(pm_1, newdata = test_model_df)
performance_metrics(pm_1, test_model_df$TARGET, pm_1_predictions)

##   Deviance     AIC     BIC     MSE     RMSE      MAE
## 1 12936.87 38497.9 38606.4 6.81127 2.609841 2.304567

Poisson Model 2 Shrinkage Method Lasso Variable Selection

lasso_model_poisson <- cv.glmnet(x_train, y_train, family = "poisson", alpha = 1,
    nfolds = 5)

best_lambda <- lasso_model_poisson$lambda.min

selected_features_poisson <- which(coef(lasso_model_poisson, s = best_lambda) !=
    0)
selected_features_lasso <- which(coef(lasso_model_poisson, s = best_lambda) != 0)
print(paste("Selected features by LASSO:", selected_features_lasso))

##  [1] "Selected features by LASSO: 1"  "Selected features by LASSO: 2" 
##  [3] "Selected features by LASSO: 3"  "Selected features by LASSO: 4" 
##  [5] "Selected features by LASSO: 5"  "Selected features by LASSO: 6" 
##  [7] "Selected features by LASSO: 7"  "Selected features by LASSO: 8" 
##  [9] "Selected features by LASSO: 9"  "Selected features by LASSO: 10"
## [11] "Selected features by LASSO: 11" "Selected features by LASSO: 12"
## [13] "Selected features by LASSO: 13" "Selected features by LASSO: 14"
## [15] "Selected features by LASSO: 15"

Since lasso returns that all predictors should be in the model that would the same as model 1. Therefore, for this model we will manually keep the features that are the most significant based on model 1.

pm_lasso <- glm(TARGET ~ VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide +
    LabelAppeal + AcidIndex + STARS, data = train_model_df, family = poisson)
summary(pm_lasso)

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide + 
##     TotalSulfurDioxide + LabelAppeal + AcidIndex + STARS, family = poisson, 
##     data = train_model_df)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.825e+00  8.189e-02  22.285  < 2e-16 ***
## VolatileAcidity    -4.127e-02  7.319e-03  -5.639 1.71e-08 ***
## FreeSulfurDioxide   1.333e-04  3.820e-05   3.490 0.000482 ***
## TotalSulfurDioxide  9.567e-05  2.477e-05   3.863 0.000112 ***
## LabelAppeal         1.487e-01  6.777e-03  21.938  < 2e-16 ***
## AcidIndex          -7.062e-01  3.903e-02 -18.095  < 2e-16 ***
## STARS               3.414e-01  6.256e-03  54.579  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18374  on 10237  degrees of freedom
## Residual deviance: 12965  on 10231  degrees of freedom
## AIC: 38510
## 
## Number of Fisher Scoring iterations: 5

m2_predictions <- predict(pm_lasso, test_model_df)
performance_metrics(pm_lasso, test_model_df$TARGET, m2_predictions)

##   Deviance   AIC      BIC     MSE     RMSE      MAE
## 1 12964.97 38510 38560.63 6.81459 2.610477 2.305362

Poisson Model 3 Tree Based Feature Selection

require(randomForest)

rf_model <- randomForest(x_train, y_train, ntree = 100, importance = TRUE)
print(rf_model)

## 
## Call:
##  randomForest(x = x_train, y = y_train, ntree = 100, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 1.810821
##                     % Var explained: 51.41

feature_importance <- importance(rf_model)
sorted_features <- sort(feature_importance[, 1], decreasing = TRUE)

top_features <- names(sorted_features)[1:5]

print(top_features)

## [1] "STARS"              "LabelAppeal"        "AcidIndex"         
## [4] "VolatileAcidity"    "TotalSulfurDioxide"

pm_3 <- glm(TARGET ~ STARS + LabelAppeal + AcidIndex + VolatileAcidity + Chlorides,
    data = train_model_df, family = poisson)
summary(pm_3)

## 
## Call:
## glm(formula = TARGET ~ STARS + LabelAppeal + AcidIndex + VolatileAcidity + 
##     Chlorides, family = poisson, data = train_model_df)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.855998   0.081576  22.752  < 2e-16 ***
## STARS            0.341355   0.006252  54.599  < 2e-16 ***
## LabelAppeal      0.148702   0.006780  21.932  < 2e-16 ***
## AcidIndex       -0.711879   0.038961 -18.271  < 2e-16 ***
## VolatileAcidity -0.042063   0.007314  -5.751 8.86e-09 ***
## Chlorides       -0.052673   0.017861  -2.949  0.00319 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18374  on 10237  degrees of freedom
## Residual deviance: 12984  on 10232  degrees of freedom
## AIC: 38527
## 
## Number of Fisher Scoring iterations: 5

pm3_predictions <- predict(pm_3, newdata = test_model_df)
performance_metrics(pm_3, test_model_df$TARGET, pm3_predictions)

##   Deviance      AIC      BIC      MSE     RMSE     MAE
## 1 12983.61 38526.64 38570.04 6.813023 2.610177 2.30607

Negative Binomial

NB Model 1 without Imputed data

nb_1 <- glm.nb(TARGET ~ ., data = train_df)
summary(nb_1)

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = train_df, init.theta = 140198.2884, 
##     link = log)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.593e+00  2.506e-01   6.359 2.03e-10 ***
## FixedAcidity        3.293e-04  1.053e-03   0.313  0.75446    
## VolatileAcidity    -2.560e-02  8.353e-03  -3.065  0.00218 ** 
## CitricAcid         -7.259e-04  7.575e-03  -0.096  0.92365    
## ResidualSugar      -6.141e-05  1.941e-04  -0.316  0.75166    
## Chlorides          -3.007e-02  2.056e-02  -1.463  0.14347    
## FreeSulfurDioxide   6.734e-05  4.404e-05   1.529  0.12621    
## TotalSulfurDioxide  2.081e-05  2.855e-05   0.729  0.46618    
## Density            -3.725e-01  2.462e-01  -1.513  0.13026    
## pH                 -4.661e-03  9.598e-03  -0.486  0.62722    
## Sulphates          -5.164e-03  7.052e-03  -0.732  0.46398    
## Alcohol             3.948e-03  1.771e-03   2.229  0.02579 *  
## LabelAppeal         1.771e-01  7.954e-03  22.271  < 2e-16 ***
## AcidIndex          -4.870e-02  5.903e-03  -8.251  < 2e-16 ***
## STARS               1.871e-01  7.487e-03  24.992  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(140198.3) family taken to be 1)
## 
##     Null deviance: 5843.9  on 6435  degrees of freedom
## Residual deviance: 4009.1  on 6421  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23174
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  140198 
##           Std. Err.:  234984 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -23141.85

nb_1_predictions <- predict(nb_1, test_model_df)
performance_metrics(nb_1, test_model_df$TARGET, nb_1_predictions)

##   Deviance      AIC      BIC      MSE     RMSE      MAE
## 1 4009.079 23173.85 23282.17 5.474343 2.339731 2.111443

par(mfrow = c(2, 2))
plot(nb_1)

NB Model 2 Stepwise Feature Selection

full_model <- glm.nb(TARGET ~ ., data = train_model_df)

nb_2 <- stepAIC(full_model, direction = "both")

## Start:  AIC=38498.07
## TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
##     Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
##     pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
## 
##                      Df   AIC
## - ResidualSugar       1 38496
## - CitricAcid          1 38498
## - FixedAcidity        1 38498
## <none>                  38498
## - Density             1 38499
## - pH                  1 38499
## - Sulphates           1 38500
## - Alcohol             1 38501
## - Chlorides           1 38504
## - FreeSulfurDioxide   1 38509
## - TotalSulfurDioxide  1 38511
## - VolatileAcidity     1 38527
## - AcidIndex           1 38808
## - LabelAppeal         1 38980
## - STARS               1 41355
## 
## Step:  AIC=38496.23
## TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS
## 
##                      Df   AIC
## - CitricAcid          1 38496
## - FixedAcidity        1 38496
## <none>                  38496
## - Density             1 38497
## - pH                  1 38498
## + ResidualSugar       1 38498
## - Sulphates           1 38499
## - Alcohol             1 38499
## - Chlorides           1 38503
## - FreeSulfurDioxide   1 38507
## - TotalSulfurDioxide  1 38510
## - VolatileAcidity     1 38525
## - AcidIndex           1 38807
## - LabelAppeal         1 38978
## - STARS               1 41355
## 
## Step:  AIC=38495.7
## TARGET ~ FixedAcidity + VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS
## 
##                      Df   AIC
## - FixedAcidity        1 38495
## <none>                  38496
## + CitricAcid          1 38496
## - Density             1 38497
## - pH                  1 38497
## + ResidualSugar       1 38498
## - Sulphates           1 38498
## - Alcohol             1 38498
## - Chlorides           1 38502
## - FreeSulfurDioxide   1 38506
## - TotalSulfurDioxide  1 38509
## - VolatileAcidity     1 38525
## - AcidIndex           1 38805
## - LabelAppeal         1 38978
## - STARS               1 41357
## 
## Step:  AIC=38495.47
## TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + 
##     STARS
## 
##                      Df   AIC
## <none>                  38495
## + FixedAcidity        1 38496
## + CitricAcid          1 38496
## - Density             1 38496
## - pH                  1 38497
## + ResidualSugar       1 38497
## - Sulphates           1 38498
## - Alcohol             1 38498
## - Chlorides           1 38502
## - FreeSulfurDioxide   1 38506
## - TotalSulfurDioxide  1 38509
## - VolatileAcidity     1 38525
## - AcidIndex           1 38820
## - LabelAppeal         1 38979
## - STARS               1 41356

summary(nb_2)

## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS, data = train_model_df, init.theta = 48525.14821, 
##     link = log)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.207e+00  2.287e-01   9.650  < 2e-16 ***
## VolatileAcidity    -4.107e-02  7.318e-03  -5.612 2.00e-08 ***
## Chlorides          -5.186e-02  1.789e-02  -2.899 0.003742 ** 
## FreeSulfurDioxide   1.357e-04  3.821e-05   3.552 0.000383 ***
## TotalSulfurDioxide  9.797e-05  2.479e-05   3.952 7.75e-05 ***
## Density            -3.686e-01  2.143e-01  -1.720 0.085438 .  
## pH                 -1.549e-02  8.405e-03  -1.843 0.065356 .  
## Sulphates          -1.285e-02  6.020e-03  -2.135 0.032780 *  
## Alcohol             3.369e-03  1.539e-03   2.190 0.028556 *  
## LabelAppeal         1.493e-01  6.780e-03  22.025  < 2e-16 ***
## AcidIndex          -7.005e-01  3.920e-02 -17.871  < 2e-16 ***
## STARS               3.395e-01  6.273e-03  54.120  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(48525.15) family taken to be 1)
## 
##     Null deviance: 18373  on 10237  degrees of freedom
## Residual deviance: 12940  on 10226  degrees of freedom
## AIC: 38497
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  48525 
##           Std. Err.:  63351 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38471.47

nb_2_predictions <- predict(nb_2, test_model_df)
performance_metrics(nb_2, test_model_df$TARGET, nb_2_predictions)

##   Deviance      AIC      BIC      MSE     RMSE      MAE
## 1  12939.8 38497.47 38591.51 6.809707 2.609541 2.304512

par(mfrow = c(2, 2))
plot(nb_2)

### NB Model 3 Tree Based Feature Selection

nb_3 <- glm.nb(TARGET ~ STARS + LabelAppeal + AcidIndex + VolatileAcidity + Chlorides,
    data = train_model_df)
summary(nb_3)

## 
## Call:
## glm.nb(formula = TARGET ~ STARS + LabelAppeal + AcidIndex + VolatileAcidity + 
##     Chlorides, data = train_model_df, init.theta = 48454.10433, 
##     link = log)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.856021   0.081579  22.751  < 2e-16 ***
## STARS            0.341359   0.006252  54.598  < 2e-16 ***
## LabelAppeal      0.148701   0.006780  21.931  < 2e-16 ***
## AcidIndex       -0.711894   0.038963 -18.271  < 2e-16 ***
## VolatileAcidity -0.042064   0.007314  -5.751 8.86e-09 ***
## Chlorides       -0.052675   0.017862  -2.949  0.00319 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(48454.1) family taken to be 1)
## 
##     Null deviance: 18373  on 10237  degrees of freedom
## Residual deviance: 12983  on 10232  degrees of freedom
## AIC: 38529
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  48454 
##           Std. Err.:  63478 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38514.81

nb_3_predictions <- predict(nb_3, test_model_df)
performance_metrics(nb_3, test_model_df$TARGET, nb_3_predictions)

##   Deviance      AIC      BIC      MSE     RMSE      MAE
## 1 12983.14 38528.81 38579.44 6.813019 2.610176 2.306069

Model Selection

model_names <- c("Poisson Model 1", "Poisson Model 2", "Poisson Model 3", "NB Model 1",
    "NB Model 2", "NB Model 3")

pm1_metrics <- performance_metrics(pm_1, test_model_df$TARGET, pm_1_predictions)

pm2_metrics <- performance_metrics(pm_lasso, test_model_df$TARGET, m2_predictions)

pm3_metrics <- performance_metrics(pm_3, test_model_df$TARGET, pm3_predictions)

nb1_metrics <- performance_metrics(nb_1, test_model_df$TARGET, nb_1_predictions)

nb2_metrics <- performance_metrics(nb_2, test_model_df$TARGET, nb_2_predictions)

nb3_metrics <- performance_metrics(nb_3, test_model_df$TARGET, nb_3_predictions)

all_models_metrics <- rbind(pm1_metrics, pm2_metrics, pm3_metrics, nb1_metrics, nb2_metrics,
    nb3_metrics)

final_table <- cbind(Model = model_names, all_models_metrics)

data.table(final_table)

##              Model  Deviance      AIC      BIC      MSE     RMSE      MAE
## 1: Poisson Model 1 12936.868 38497.90 38606.40 6.811270 2.609841 2.304567
## 2: Poisson Model 2 12964.969 38510.00 38560.63 6.814590 2.610477 2.305362
## 3: Poisson Model 3 12983.607 38526.64 38570.04 6.813023 2.610177 2.306070
## 4:      NB Model 1  4009.079 23173.85 23282.17 5.474343 2.339731 2.111443
## 5:      NB Model 2 12939.800 38497.47 38591.51 6.809707 2.609541 2.304512
## 6:      NB Model 3 12983.140 38528.81 38579.44 6.813019 2.610176 2.306069

All models performed almost identical for which we did not expect. We will Negative Binomial Model 1 which had the better performance metrics to predict the evaluation dataframe.

Predictions

predictions_df <- predict(nb_1, newdata = complete_test_df, type = "response")
predicted_test_df <- cbind(TARGET = round(predictions_df), test_df)
data.table(predicted_test_df)

##       TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
##    1:      3          5.4          -0.860       0.27         -10.7     0.092
##    2:      4         12.4           0.385      -0.76         -19.7     1.169
##    3:      3          7.2           1.750       0.17         -33.0     0.065
##    4:      2          6.2           0.100       1.80           1.0    -0.179
##    5:      3         11.4           0.210       0.28           1.2     0.038
##   ---                                                                       
## 3331:      3          7.8           1.180       0.24            NA     0.034
## 3332:      6         -0.4          -0.980       0.34           1.3     0.226
## 3333:      4          6.6           0.410       0.22            NA     0.035
## 3334:      3          6.1          -0.220       1.15           1.1     0.041
## 3335:      4          7.1           0.210       0.31          14.6     0.021
##       FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
##    1:                23                398 0.98527 5.02      0.64   12.30
##    2:               -37                 68 0.99048 3.37      1.09   16.00
##    3:                 9                 76 1.04641 4.61      0.68    8.55
##    4:               104                 89 0.98877 3.20      2.11   12.30
##    5:                70                 53 1.02899 2.54     -0.07    4.80
##   ---                                                                    
## 3331:                29               -120 0.99031 3.10      0.40    6.10
## 3332:               136                  0 0.99176 3.07     -1.22   14.70
## 3333:                23                117 1.00962 2.84      0.39    8.44
## 3334:                32                 92 1.01541 3.26        NA   17.20
## 3335:               281                142 0.99215 3.17     -0.37    9.70
##       LabelAppeal AcidIndex STARS
##    1:          -1         6    NA
##    2:           0         6     2
##    3:           0         8     1
##    4:          -1         8     1
##    5:           0        10    NA
##   ---                            
## 3331:           0         7     2
## 3332:           1         9     4
## 3333:           0         7     2
## 3334:           0         7     1
## 3335:           0         8     3

Count Regression

Nick Climaco

2023-12-13