#Libraries
library(DHARMa)
library(kableExtra)
library(knitr)
library(readr)
library(tidyverse)
library(corrplot)
library(dplyr)
library(GGally)
library(caret)
library(pROC)
library(glmnet)
library(MASS)
library(car)
library(correlationfunnel)
library(faraway)
library(arm)
library(performance)
library(see)
library(reshape2)
library(readr)
#library(tidymodels)
#library(rms)
library(smotefamily)
library(themis)
library(skimr)
library(DataExplorer)
library(naniar)
library(mice)
library(corrr)
library(FactoMineR)
library(ggcorrplot)
library(factoextra)

1 Background

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

INDEX: Identification Variable (do not use) AcidIndex: Proprietary method of testing total acidity of wine by using a weighted average Alcohol: Alcohol Content Chlorides: Chloride content of wine CitricAcid: Citric Acid Content Density: Density of Wine FixedAcidity: Fixed Acidity of Wine FreeSulfurDioxide: Sulfur Dioxide content of wine LabelAppeal: Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don’t like the design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales. ResidualSugar: Residual Sugar of wine STARS: Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor. A high number of stars suggests high sales Sulphates: Sulfate content of wine TotalSulfurDioxide: Total Sulfur Dioxide of Wine VolatileAcidity: Volatile Acid content of wine pH: pH of wine

2 Data Exploration

2.1 Data Import

raw_data <- read.csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-621/refs/heads/main/wine-training-data.csv")
eval_data <- read.csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-621/refs/heads/main/wine-evaluation-data.csv")

2.2 Data Summary

The wine training dataset contains 12,795 observations and 16 variables, where each observation is an available wine at the wine company. All the features are numerical - LabelAppeal and STARS is an ordinal variable (scores).

Missing data:

Only 50.3% of all rows are not completely missing. Given the 50.3% complete rows, there are only 4% total missing observations.

The missing data breakdown bar chart shows that pH, ResidualSugar, Chlorides, FreeSulfurDioxide, Alcohol, TotalSulfurDioxide, Sulphates and STARS have missing values. STARS has the most missing values, yet there could be a reason this value is blank given it’s a review. Looking at the missing factor plot, we can see there is a clear pattern in the heatmap for STARS - missing data strongly relates to a lower TARGET value. People who do not like the wine might also choose to not give a review at all. The missing upset plot does not show any interesting missing data patterns between predictors.

There are no duplicate observations.

Summary statistics of the predictor variables in the training dataset are included in the table below. Key metrics such as minimum, maximum, mean, median, and standard deviation (SD) help us understand the range, central tendency, and variability of each variable. The predictors have a wide range of spread. Many show a spread with negative values which does not makes sense. That will need to be addressed later.

Model Specific Checks:

Variance-to-mean ratio: Ratios between 0.5 and 1.5 are generally acceptable for a Poisson distribution, so a value of 1.2 means we can use the poisson regression model. Yet, greater than 1 means there is overdispersion, in which case the poisson regression model might not be the best. Negative binomial regression will also be tested which should help with overdispersion.

# Visualize data types
plot_str(raw_data)

# Visualize EDA
plot_intro(raw_data)

# Visualize Missing Data
plot_missing(raw_data)

# Check how missingness relates to other variables
gg_miss_upset(raw_data, nsets = 10)

# Check how missingness affects TARGET
gg_miss_fct(x = raw_data, fct = TARGET)

# Check mean vs variance (should be similar for Poisson Regression Model)
mean(raw_data$TARGET)
## [1] 3.029074
var(raw_data$TARGET)
## [1] 3.710895
var(raw_data$TARGET) / mean(raw_data$TARGET) # Should be close to 1
## [1] 1.225092
# Glimpse of the data
head(raw_data)
##   INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1     1      3          3.2           1.160      -0.98          54.2    -0.567
## 2     2      3          4.5           0.160      -0.81          26.1    -0.425
## 3     4      5          7.1           2.640      -0.88          14.8     0.037
## 4     5      3          5.7           0.385       0.04          18.8    -0.425
## 5     6      4          8.0           0.330      -1.26           9.4        NA
## 6     7      0         11.3           0.320       0.59           2.2     0.556
##   FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
## 1                NA                268 0.99280 3.33     -0.59     9.9
## 2                15               -327 1.02792 3.38      0.70      NA
## 3               214                142 0.99518 3.12      0.48    22.0
## 4                22                115 0.99640 2.24      1.83     6.2
## 5              -167                108 0.99457 3.12      1.77    13.7
## 6               -37                 15 0.99940 3.20      1.29    15.4
##   LabelAppeal AcidIndex STARS
## 1           0         8     2
## 2          -1         7     3
## 3          -1         8     3
## 4          -1         6     1
## 5           0         9     2
## 6           0        11    NA
#the Index column had no impact on the target variable, it was dropped 
raw_data <- raw_data[ , -1]

# Check for duplicates
duplicates <- duplicated(raw_data)

# Print the duplicates
print(raw_data[duplicates, ])
##  [1] TARGET             FixedAcidity       VolatileAcidity    CitricAcid        
##  [5] ResidualSugar      Chlorides          FreeSulfurDioxide  TotalSulfurDioxide
##  [9] Density            pH                 Sulphates          Alcohol           
## [13] LabelAppeal        AcidIndex          STARS             
## <0 rows> (or 0-length row.names)

2.2 Data Cleaning

FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Sulphates and Alcohol contain negative values which, according to online research (and common knowledge), does not make sense. We will treat these as invalid, and therefore convert these values to missing/NA. LabelAppeal is a score and can be negative.

Approximately 21.37% of the TARGET values are zeros, which supports the appropriateness of checking for Zero-Inflated Poisson (ZIP) or Negative Binomial models for this data.

#Converting ordinal variables as factor data type.
raw_data$LabelAppeal <- as.factor(raw_data$LabelAppeal)
raw_data$AcidIndex <- as.factor(raw_data$AcidIndex) 
raw_data$STARS <- as.factor(raw_data$STARS)
# raw_data$STARS <- factor(raw_data$STARS,
#                          ordered = TRUE,
#                          levels = c(1, 2, 3, 4))

# If value is negative (most likely inputted incorrectly), update to NA
raw_data$FixedAcidity =  ifelse(raw_data$FixedAcidity < 0, NA, raw_data$FixedAcidity)
raw_data$VolatileAcidity =  ifelse(raw_data$VolatileAcidity < 0, NA, raw_data$VolatileAcidity)
raw_data$CitricAcid =  ifelse(raw_data$CitricAcid < 0, NA, raw_data$CitricAcid)
raw_data$ResidualSugar =  ifelse(raw_data$ResidualSugar < 0, NA, raw_data$ResidualSugar)
raw_data$Chlorides =  ifelse(raw_data$Chlorides < 0, NA, raw_data$Chlorides)
raw_data$FreeSulfurDioxide =  ifelse(raw_data$FreeSulfurDioxide < 0, NA, raw_data$FreeSulfurDioxide)
raw_data$TotalSulfurDioxide =  ifelse(raw_data$TotalSulfurDioxide < 0, NA, raw_data$TotalSulfurDioxide)
raw_data$Sulphates =  ifelse(raw_data$Sulphates < 0, NA, raw_data$Sulphates)
raw_data$Alcohol =  ifelse(raw_data$Alcohol < 0, NA, raw_data$Alcohol)

# Summary statistics
raw_data %>%
  summary() %>%
  kable(caption = "Descriptive Statistics of Predictor Variables") %>%
  kable_styling()
Descriptive Statistics of Predictor Variables
TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol LabelAppeal AcidIndex STARS
Min. :0.000 Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. :0.000 Min. : 0.00 Min. : 0.0 Min. :0.8881 Min. :0.480 Min. :0.000 Min. : 0.00 -2: 504 7 :4878 1 :3042
1st Qu.:2.000 1st Qu.: 6.200 1st Qu.:0.2400 1st Qu.:0.2800 1st Qu.: 2.40 1st Qu.:0.042 1st Qu.: 25.00 1st Qu.: 100.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.:0.440 1st Qu.: 9.00 -1:3136 8 :4142 2 :3570
Median :3.000 Median : 7.200 Median :0.3600 Median :0.3900 Median : 8.70 Median :0.063 Median : 43.00 Median : 150.0 Median :0.9945 Median :3.200 Median :0.575 Median :10.40 0 :5617 9 :1427 3 :2212
Mean :3.029 Mean : 8.668 Mean :0.6195 Mean :0.6474 Mean : 19.39 Mean :0.188 Mean : 91.67 Mean : 204.9 Mean :0.9942 Mean :3.208 Mean :0.863 Mean :10.61 1 :3048 6 :1197 4 : 612
3rd Qu.:4.000 3rd Qu.:10.400 3rd Qu.:0.8500 3rd Qu.:0.8500 3rd Qu.: 29.50 3rd Qu.:0.289 3rd Qu.:131.00 3rd Qu.: 258.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.:1.140 3rd Qu.:12.40 2 : 490 10 : 551 NA’s:3359
Max. :8.000 Max. :34.400 Max. :3.6800 Max. :3.8600 Max. :141.15 Max. :1.351 Max. :623.00 Max. :1057.0 Max. :1.0992 Max. :6.130 Max. :4.240 Max. :26.50 NA 11 : 258 NA
NA NA’s :1621 NA’s :2827 NA’s :2966 NA’s :3752 NA’s :3835 NA’s :3683 NA’s :3186 NA NA’s :395 NA’s :3571 NA’s :771 NA (Other): 342 NA

2.3 Data Distribution

TARGET is bimodal with a peak at 4 and 0 cases purchased - frequently either people purchase a couple of cases, or none at all. This could indicate a potential excess of zeros issue with our models.

Many numeric predictors are unimodal with heavy tails. These include Chlorides, CitricAcid, Density, FixedAcidity, FreeSulfurDioxide, pH, ResidualSugar, Sulphates, TotalSulfurDioxide, and VolatileAcidity. AcidIndex and Alochol is right skewed, and LabelAppeal looks normal. For the heavy tailed and skewed predictors transformation will need to be added.

As we saw earlier, predictors have a wide range of spread. For example, Sulphates has a very small spread compared to FreeSulfurDioxide which has a much larger spread.

The QQ plots and the box plots show the heavy tails shown in the histograms (many extreme values). However, based on the feature meanings and provided information, there is no reason to believe that any of these extreme values are mistakes or data errors. As such, we will not remove the extreme values, as they could be predictive of the target. For a higher TARGET (6-8 cases purchased), we can see that the spread for many of the predictors is smaller, but that could just be due to the fact there is not as many observations. For most of the predictors, there is not a clear value for higher cases purchased as it is around the same for each “bucket” of cases purchased.

Unsurprisingly, a lower AcidIndex and higher amount of STARS and LabelAppeal have higher TARGET values.

For the ordinal features, the most frequent LabelAppeal rating is 0 meaning a neutral feeling about label, and -1 and 1 rating frequencies are almost equal. Interestingly, -2 and 2 for LabelAppeal also have almost equal frequencies. The most frequent AcidIndex values are 7 and 8 (mid-range) and the most frequent values for STARS is 2 with NA being a close second, which means ratings are on the lower side. A STARS rating of 4 is the least frequent rating, which means highly rated wines are harder to come by.

# Plot histograms
plot_histogram(raw_data)

# Categorical Variables
data_raw_categorical <- raw_data |>
  dplyr::select(where(is.factor) | where(is.character))
categorical_predictors <- names(data_raw_categorical)[names(data_raw_categorical) != "TARGET"]
plot_bar(data_raw_categorical)

data_raw_numeric <- raw_data |>
  dplyr::select(where(is.numeric))
numerical_predictors <- names(data_raw_numeric)

# Closer look at target distribution (# of orders)
hist(as.numeric(raw_data$TARGET), main="TARGET Distribution",
     xlab="Cases Purchased", col="lightblue", breaks=100)

# QQ Plot
plot_qq(raw_data, sampled_rows = 1000L)

# Box plots - visualize outliers
ggplot(stack(data_raw_numeric), aes(x = ind, y = values)) + 
  geom_boxplot(color = 'skyblue', outlier.color = 'red') +
  coord_cartesian(ylim = c(0, 1250)) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        panel.background = element_rect(fill = 'grey96')) +
  labs(title = "Boxplots of Predictor Variables", x="Predictors")

# Box plots by TARGET
plot_boxplot(raw_data, by = "TARGET", ggtheme = theme_light())

# Scatter plots by TARGET
plot_scatterplot(raw_data, by = "TARGET", sampled_rows = 1000L)

2.4 Identifying Correlations

Not too many features have high correlation with the TARGET and with each other.

Positive correlations with TARGET (Cases purchased): 1. STARS: 0.56 (a positive review for the wine indicates someone is more likely purchase the wine) 2. LabelAppeal: 0.36 (Better bottle design, more likely to purchase the wine) 3. Alcohol: 0.06 (Higher alcohol content leads to more purchases) 4. AcidIndex: -0.25 (Higher total acidity wines are less likely to be purchased)

Negative correlations with TARGET: 1. AcidIndex: -0.18 (Higher acid index is less likely to be purchased) 2. VolatileAcidity: -0.10 (Higher volatile acidity is less likely to be purchased) 3. FixedAcidity: -0.06 (Higher fixed acidity is less likely to be purchased)

Correlation Among Predictors:We see that the features have very low correlations with each other, meaning that there is not much multicollinearity present in the dataset. This means that the assumptions of linear regression are more likely to be met. 1. LabelAppeal <-> STARS: 0.33 (Better bottle design, more likely for a positive review) 2. FixedAcidity <-> AcidIndex: 0.19 (Higher fixed acidity, higher acid index) 3. AcidIndex <-> STARS: -0.09 (Higher acid index, lower review) 4. CitricAcid <-> AcidIndex: 0.06 (Higher citric acid, higher acid index) 5. Alcohol <-> STARS: 0.06 (Higher alcohol content, more likely for a positive review)

Initial business insights:

The correlation funnel shows that the following values for the predictors correspond to a high TARGET (4 cases purchased and above):

  • LabelAppeal 1 and 2 (Better bottle design, more likely to purchase the wine)
  • Chlorides -Inf - 0.044 (Lower chloride content of wine)
  • Density 0.98868-0.994715 (Average for density)
  • AcidIndex -Inf - 7 (Lower AcidIndex)

Two-way Cross-Tabulations - Categorical Variables: Based on the p-value, all the categorical variables are statistically significant.

Lastly, the relationship scatter plots did not show any interesting relationships between predictors.

#Correlation matrix with target

numeric_vars <- raw_data %>% select_if(is.numeric) 
cor_matrix <- cor(numeric_vars, use="pairwise.complete.obs")
corr_target <- cor_matrix[,"TARGET"]
corr_target_sorted <- sort(corr_target, decreasing = TRUE)

kable(as.data.frame(corr_target_sorted), col.names = c("Correlation with Target"), digits = 2)
Correlation with Target
TARGET 1.00
Alcohol 0.07
TotalSulfurDioxide 0.06
FreeSulfurDioxide 0.05
CitricAcid 0.02
ResidualSugar 0.01
pH -0.01
Density -0.04
Sulphates -0.04
Chlorides -0.05
FixedAcidity -0.06
VolatileAcidity -0.10
#Correlation heatmap
corrplot::corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black",
         number.cex = 0.7, diag = FALSE)

# Correlation top 5 positive and negative
corr_target <- cor_matrix[, "TARGET"]
corr_target <- corr_target[names(corr_target) != "TARGET"]  # Remove self-correlation

top_pos <- sort(corr_target, decreasing = TRUE)[1:3]
top_neg <- sort(corr_target, decreasing = FALSE)[1:3]

combined_df <- data.frame(
  Feature = c(names(top_pos), names(top_neg)),
  Correlation = c(top_pos, top_neg)) %>%
  mutate(Direction = ifelse(Correlation > 0, "Positive", "Negative"))

ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
  labs(title = "Top Features Correlated with TARGET",
       x = "Feature",
       y = "Correlation with Target",
       fill = "Direction") +
  theme_minimal()

# Correlation among predictors
melt_cor <- melt(cor_matrix)
melt_cor <- melt_cor |>
  filter(Var1 != "TARGET") |>
  filter(Var2 != "TARGET")
filtered_cor <- melt_cor[melt_cor$Var1 != melt_cor$Var2 & as.numeric(melt_cor$Var1) < as.numeric(melt_cor$Var2), ]
sorted_cor <- filtered_cor[order(abs(filtered_cor$value), decreasing = TRUE), ]
head(sorted_cor)
##                   Var1               Var2       value
## 68     VolatileAcidity TotalSulfurDioxide -0.04959490
## 117 TotalSulfurDioxide            Alcohol -0.04003116
## 70       ResidualSugar TotalSulfurDioxide  0.03744668
## 100       FixedAcidity          Sulphates  0.03204314
## 72   FreeSulfurDioxide TotalSulfurDioxide  0.03082415
## 104          Chlorides          Sulphates  0.02345348
# Correlation Funnel
set.seed(123)

raw_data_v2 <- raw_data |>
  dplyr::select(-STARS)
raw_data_v2 <- na.omit(raw_data_v2)
raw_data_binarized_tbl <- raw_data_v2 %>%
    binarize(n_bins = 4, thresh_infreq = 0.01)
raw_data_correlated_tbl <- raw_data_binarized_tbl %>%
    correlationfunnel::correlate(target = TARGET__4_Inf)
raw_data_correlated_tbl %>%
    correlationfunnel::plot_correlation_funnel(interactive = FALSE)
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the correlationfunnel package.
##   Please report the issue at
##   <https://github.com/business-science/correlationfunnel/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

ggpairs(data_raw_numeric, progress = FALSE)

Two-way Cross-Tabulations - Categorical Variables:

# Comparing categorical features with the target variable - Chi-square

# got this piece of code from the internet
# Loop through categorical variables, and apply chi square test for significance
results <- map_dfr(categorical_predictors, function(var) {
  tibble(
    variable = var,
    chisq_p_value = chisq.test(table(raw_data[[var]], raw_data$TARGET))$p.value
  )
})
results %>%
  arrange(desc(chisq_p_value)) %>%
  kbl() %>%
  kable_styling(full_width = TRUE)
variable chisq_p_value
AcidIndex 0
LabelAppeal 0
STARS 0

3 Data Preparation

3.1 Fix Missing Values

Besides STARS, let’s replace the missing values with the median. MICE imputation is complex, slow, and can create unrealistic synthetic values. For Poisson regression, it often does not outperform simpler methods unless you know the missingness mechanism.

To handle missing data for STARS, we’ll test out 2 options:

  • Since we know missing STARS indicates a lower TARGET, add a new rating value/category 0 for STARS which will mean very poor
  • Impute STARS missing values with median, but add a new logical indicator “is_stars_missing” variable
# Fill in NA with median
data_imputed <- raw_data %>%
   mutate_at(vars(c("FixedAcidity", "VolatileAcidity", "CitricAcid", "ResidualSugar", "Chlorides", "FreeSulfurDioxide", "TotalSulfurDioxide", "pH", "Sulphates", "Alcohol")), ~ifelse(is.na(.), median(., na.rm = TRUE), .))

# Check for missing values again
plot_missing(data_imputed )

# Handling STARS

# Option 1: STARS_new
# Create a new rating value of 0
data_imputed$STARS_new =  ifelse(is.na(data_imputed$STARS), 0, data_imputed$STARS)

# Option 2: is_stars_missing & STARS_imputed
# Create new missing indicator variable
data_imputed  <- data_imputed  %>%
  mutate(is_stars_missing = is.na(STARS))


# Convert STARS to numeric as it is a factor
data_imputed$STARS <- as.numeric(as.character(data_imputed$STARS))


# impute STARS missing values using the median
data_imputed  <- data_imputed %>%
  mutate(STARS_imputed = coalesce(STARS, median(STARS, na.rm = TRUE)))

# Convert missing indicator and STARS variables to factor
data_imputed$is_stars_missing <- as.factor(data_imputed$is_stars_missing)
data_imputed$STARS <- as.factor(data_imputed$STARS)
# data_imputed$STARS <- factor(data_imputed$STARS,
#                          ordered = TRUE,
#                          levels = c(1, 2, 3, 4))
data_imputed$STARS_new <- as.factor(data_imputed$STARS_new)
# data_imputed$STARS_new <- factor(data_imputed$STARS_new,
#                          ordered = TRUE,
#                          levels = c(0, 1, 2, 3, 4))
data_imputed$STARS_imputed <- as.factor(data_imputed$STARS_imputed)


# Check for missing values again
plot_missing(data_imputed )

3.2 Double Check Correlations and Distributions

ResidualSugar’s distribution is the only one that looks different, but that should be okay for this case. All the STARS related variables show correlation, but that is to be expected. Nothing else looks like it has changed too drastically after the transformations.

# Plot histograms
plot_histogram(data_imputed)

# Plot correlations
plot_correlation(data_imputed)

# Comparing categorical features with the target variable - Chi-square

# got this piece of code from the internet
# Loop through categorical variables, and apply chi square test for significance

data_imputed_categorical <- data_imputed |>
  dplyr::select(where(is.factor) | where(is.character))
data_imputed_categorical_predictors <- names(data_imputed_categorical)[names(data_imputed_categorical) != "TARGET"]

results <- map_dfr(data_imputed_categorical_predictors, function(var) {
  tibble(
    variable = var,
    chisq_p_value = chisq.test(table(data_imputed[[var]], data_imputed$TARGET))$p.value
  )
})
## Warning in chisq.test(table(data_imputed[[var]], data_imputed$TARGET)):
## Chi-squared approximation may be incorrect
## Warning in chisq.test(table(data_imputed[[var]], data_imputed$TARGET)):
## Chi-squared approximation may be incorrect
## Warning in chisq.test(table(data_imputed[[var]], data_imputed$TARGET)):
## Chi-squared approximation may be incorrect
## Warning in chisq.test(table(data_imputed[[var]], data_imputed$TARGET)):
## Chi-squared approximation may be incorrect
## Warning in chisq.test(table(data_imputed[[var]], data_imputed$TARGET)):
## Chi-squared approximation may be incorrect
## Warning in chisq.test(table(data_imputed[[var]], data_imputed$TARGET)):
## Chi-squared approximation may be incorrect
results %>%
  arrange(desc(chisq_p_value)) %>%
  kbl() %>%
  kable_styling(full_width = TRUE)
variable chisq_p_value
AcidIndex 0
LabelAppeal 0
STARS 0
STARS_new 0
is_stars_missing 0
STARS_imputed 0

3.3 Transform skewed predictors

Use log, square root, and box cox transformations for the skewed predictors. The histograms and QQ plots definitely show an improvement.

data_transform <- data_imputed 

# Log and sqrt transformation
data_transform <- data_transform%>%
  mutate(
#    LOG_AcidIndex = log(data_transform$AcidIndex + 1),
    SQRT_Alcohol = sqrt(data_transform$Alcohol +1),
     LOG_Chlorides = log(data_transform$Chlorides + 1),
    # ASINH_Chlorides = asinh(data_transform$Chlorides + 1),
     LOG_CitricAcid = log(data_transform$CitricAcid + 1),
    SQRT_FixedAcidity = sqrt(data_transform$FixedAcidity +1),
    LOG_FreeSulfurDioxide = log(data_transform$FreeSulfurDioxide + 1),
    LOG_ResidualSugar = log(data_transform$ResidualSugar + 1),
     LOG_Sulphates = log(data_transform$Sulphates + 1),
    LOG_TotalSulfurDioxide = log(data_transform$TotalSulfurDioxide + 1),
     LOG_VolatileAcidity  = log(data_transform$VolatileAcidity + 1)
  )

# Box Cox transformation also being checked if its any favorable
bc_model_1 <- BoxCoxTrans(data_transform$CitricAcid + 1)
bc_model_2 <- BoxCoxTrans(data_transform$Chlorides + 1)
bc_model_3 <- BoxCoxTrans(data_transform$VolatileAcidity + 1)
bc_model_4 <- BoxCoxTrans(data_transform$Sulphates + 1)
data_transform$BC_CitricAcid <- predict(bc_model_1, data_transform$CitricAcid + 1)
data_transform$BC_Chlorides <- predict(bc_model_2, data_transform$Chlorides + 1)
data_transform$BC_VolatileAcidity <- predict(bc_model_3, data_transform$VolatileAcidity + 1)
data_transform$BC_Sulphates <- predict(bc_model_4, data_transform$Sulphates + 1)

data_transform |>
  dplyr::select(where(is.numeric)) |>
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
  filter(!is.na(Value)) |>
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~Feature, scales = "free") +
  labs(title = "Histograms of Numerical Features", x = NULL, y = "Frequency") +
  theme_minimal()

# QQ Plot
plot_qq(data_transform, sampled_rows = 1000L)

3.4 Adding new features

To limit the amount of data/predictors, create new feature variables which combines the existing features.

# 4. Create new features
  # total acidity -- combine all acidity predictors
  data_transform$CombinedAcidity <- data_transform$VolatileAcidity + data_transform$FixedAcidity + data_transform$CitricAcid

3.5 Check for Class Imbalance

YES, there is imbalance—but this is NORMAL for count data. Hence we will not be changing anything.

The distribution is not dominated by zeros as it is not > 50% . Hence looks like it is not Zero Inflated.

table(data_transform$TARGET)
## 
##    0    1    2    3    4    5    6    7    8 
## 2734  244 1091 2611 3177 2014  765  142   17
prop.table(table(data_transform$TARGET))
## 
##           0           1           2           3           4           5 
## 0.213677218 0.019069949 0.085267683 0.204064088 0.248300117 0.157405236 
##           6           7           8 
## 0.059788980 0.011098085 0.001328644

3.5 Split Data for Validation

To ensure objective model evaluation and prevent overfitting, we split each dataset into training (80%) and testing (20%) sets using a consistent random seed (set.seed(123)) for reproducibility.

# Split data for validation
set.seed(123)

train_idx <- createDataPartition(y = raw_data$TARGET, p = 0.8, list = FALSE)

# Create train/test splits for each dataset
train <- raw_data[train_idx, ]
test <- raw_data[-train_idx, ]

# Imputed predictors dataset
train_imputed <- data_imputed[train_idx, ]
test_imputed <- data_imputed[-train_idx, ]

# Transformed predictors dataset
train_transformed <- data_transform[train_idx, ]
test_transformed <- data_transform[-train_idx, ]

4 Build Models

4.1 Poisson Regression Models

4.1 Model 1A: All original predictors w/ STARS_new

This model is the base model, but since poisson regression models can’t have missing data, we test out the first option we created for handling STARS.

set.seed(123)
# Model 1 - All original predictors except for STARS_new since poisson regression models can't have missing values

# Poisson Regression Model
pr_model1a <- glm(TARGET ~ AcidIndex + Alcohol + Chlorides +
             CitricAcid + Density + FixedAcidity + FreeSulfurDioxide +
             LabelAppeal + ResidualSugar + STARS_new + Sulphates + TotalSulfurDioxide +
             VolatileAcidity + pH,
             family = poisson, data = train_transformed)
summary(pr_model1a)
## 
## Call:
## glm(formula = TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + 
##     Density + FixedAcidity + FreeSulfurDioxide + LabelAppeal + 
##     ResidualSugar + STARS_new + Sulphates + TotalSulfurDioxide + 
##     VolatileAcidity + pH, family = poisson, data = train_transformed)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.320e-01  3.832e-01   0.606  0.54481    
## AcidIndex5         -1.011e-01  3.244e-01  -0.312  0.75540    
## AcidIndex6         -9.765e-02  3.176e-01  -0.307  0.75846    
## AcidIndex7         -1.287e-01  3.173e-01  -0.406  0.68494    
## AcidIndex8         -1.659e-01  3.173e-01  -0.523  0.60104    
## AcidIndex9         -2.679e-01  3.178e-01  -0.843  0.39912    
## AcidIndex10        -4.304e-01  3.191e-01  -1.349  0.17742    
## AcidIndex11        -7.457e-01  3.232e-01  -2.307  0.02106 *  
## AcidIndex12        -7.536e-01  3.294e-01  -2.288  0.02216 *  
## AcidIndex13        -5.981e-01  3.321e-01  -1.801  0.07171 .  
## AcidIndex14        -7.065e-01  3.495e-01  -2.022  0.04322 *  
## AcidIndex15        -2.394e-01  4.040e-01  -0.593  0.55349    
## AcidIndex16        -9.554e-01  5.488e-01  -1.741  0.08171 .  
## AcidIndex17        -1.276e+01  1.269e+02  -0.101  0.91991    
## Alcohol             5.128e-03  1.662e-03   3.085  0.00204 ** 
## Chlorides          -6.947e-02  2.961e-02  -2.346  0.01898 *  
## CitricAcid          1.179e-02  1.046e-02   1.128  0.25942    
## Density            -2.576e-01  2.128e-01  -1.210  0.22618    
## FixedAcidity       -3.026e-04  1.282e-03  -0.236  0.81341    
## FreeSulfurDioxide   1.511e-04  6.097e-05   2.479  0.01318 *  
## LabelAppeal-1       2.308e-01  4.175e-02   5.528 3.23e-08 ***
## LabelAppeal0        4.200e-01  4.066e-02  10.331  < 2e-16 ***
## LabelAppeal1        5.583e-01  4.143e-02  13.477  < 2e-16 ***
## LabelAppeal2        6.883e-01  4.690e-02  14.676  < 2e-16 ***
## ResidualSugar       1.908e-04  2.750e-04   0.694  0.48795    
## STARS_new1          7.757e-01  2.194e-02  35.360  < 2e-16 ***
## STARS_new2          1.088e+00  2.055e-02  52.936  < 2e-16 ***
## STARS_new3          1.209e+00  2.162e-02  55.908  < 2e-16 ***
## STARS_new4          1.316e+00  2.738e-02  48.070  < 2e-16 ***
## Sulphates          -1.033e-02  9.858e-03  -1.048  0.29449    
## TotalSulfurDioxide  1.065e-04  3.880e-05   2.745  0.00605 ** 
## VolatileAcidity    -5.714e-02  1.173e-02  -4.873 1.10e-06 ***
## pH                 -7.310e-03  8.525e-03  -0.858  0.39116    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18252  on 10237  degrees of freedom
## Residual deviance: 10742  on 10205  degrees of freedom
## AIC: 36376
## 
## Number of Fisher Scoring iterations: 10

4.1 Model 1B: All original predictors w/ STARS indicator variable

  • This model is the base model, but since poisson regression models cant have missing data, we need to do something with STARS. This tests out the second option we created for handling STARS, which imputes STARS, but adds a logical indicator variable for “is missing”.

  • The results show that Wines without a STARS rating sell about 66% less than similar wines with a rating(exp(-1.088) ≈ 0.337). Wines with more stars sell more. The model reduces deviance by ~41% → very solid improvement.

  • AIC is good for a Poisson model at this scale.

  • Dispersion = 1 → no overdispersion → Poisson GLM is appropriate.

  • Strong predictors: Alcohol (positive) Chlorides (negative) FreeSulfurDioxide (slightly positive) LabelAppeal (very strong — all levels) VolatileAcidity (strong negative) TotalSulfurDioxide (small positive)

  • Weak / noise-like predictors: (These are not statistically significant) CitricAcid Density FixedAcidity ResidualSugar Sulphates pH Many AcidIndex levels

set.seed(123)
# Model 1B - All original predictors w/ STARS_imputed + is_stars_missing

# Poisson Regression Model
pr_model1b <- glm(TARGET ~ AcidIndex + Alcohol + Chlorides +
             CitricAcid + Density + FixedAcidity + FreeSulfurDioxide +
             LabelAppeal + ResidualSugar + STARS_imputed + is_stars_missing + Sulphates + TotalSulfurDioxide +
             VolatileAcidity + pH,
             family = poisson, data = train_transformed)

# View key model results
summary(pr_model1b)
## 
## Call:
## glm(formula = TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + 
##     Density + FixedAcidity + FreeSulfurDioxide + LabelAppeal + 
##     ResidualSugar + STARS_imputed + is_stars_missing + Sulphates + 
##     TotalSulfurDioxide + VolatileAcidity + pH, family = poisson, 
##     data = train_transformed)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           1.008e+00  3.832e-01   2.630  0.00854 ** 
## AcidIndex5           -1.011e-01  3.244e-01  -0.312  0.75540    
## AcidIndex6           -9.765e-02  3.176e-01  -0.307  0.75846    
## AcidIndex7           -1.287e-01  3.173e-01  -0.406  0.68494    
## AcidIndex8           -1.659e-01  3.173e-01  -0.523  0.60104    
## AcidIndex9           -2.679e-01  3.178e-01  -0.843  0.39912    
## AcidIndex10          -4.304e-01  3.191e-01  -1.349  0.17742    
## AcidIndex11          -7.457e-01  3.232e-01  -2.307  0.02106 *  
## AcidIndex12          -7.536e-01  3.294e-01  -2.288  0.02216 *  
## AcidIndex13          -5.981e-01  3.321e-01  -1.801  0.07171 .  
## AcidIndex14          -7.065e-01  3.495e-01  -2.022  0.04322 *  
## AcidIndex15          -2.394e-01  4.040e-01  -0.593  0.55349    
## AcidIndex16          -9.554e-01  5.488e-01  -1.741  0.08171 .  
## AcidIndex17          -1.276e+01  1.269e+02  -0.101  0.91991    
## Alcohol               5.128e-03  1.662e-03   3.085  0.00204 ** 
## Chlorides            -6.947e-02  2.961e-02  -2.346  0.01898 *  
## CitricAcid            1.179e-02  1.046e-02   1.128  0.25942    
## Density              -2.576e-01  2.128e-01  -1.210  0.22618    
## FixedAcidity         -3.026e-04  1.282e-03  -0.236  0.81341    
## FreeSulfurDioxide     1.511e-04  6.097e-05   2.479  0.01318 *  
## LabelAppeal-1         2.308e-01  4.175e-02   5.528 3.23e-08 ***
## LabelAppeal0          4.200e-01  4.066e-02  10.331  < 2e-16 ***
## LabelAppeal1          5.583e-01  4.143e-02  13.477  < 2e-16 ***
## LabelAppeal2          6.883e-01  4.690e-02  14.676  < 2e-16 ***
## ResidualSugar         1.908e-04  2.750e-04   0.694  0.48795    
## STARS_imputed2        3.122e-01  1.602e-02  19.492  < 2e-16 ***
## STARS_imputed3        4.329e-01  1.741e-02  24.859  < 2e-16 ***
## STARS_imputed4        5.405e-01  2.431e-02  22.234  < 2e-16 ***
## is_stars_missingTRUE -1.088e+00  2.055e-02 -52.936  < 2e-16 ***
## Sulphates            -1.033e-02  9.858e-03  -1.048  0.29449    
## TotalSulfurDioxide    1.065e-04  3.880e-05   2.745  0.00605 ** 
## VolatileAcidity      -5.714e-02  1.173e-02  -4.873 1.10e-06 ***
## pH                   -7.310e-03  8.525e-03  -0.858  0.39116    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18252  on 10237  degrees of freedom
## Residual deviance: 10742  on 10205  degrees of freedom
## AIC: 36376
## 
## Number of Fisher Scoring iterations: 10

4.1 Model 1C: Stepwise on Model 1A

step_model1a <- stats::step(pr_model1a, direction = "both")
## Start:  AIC=36376.47
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density + 
##     FixedAcidity + FreeSulfurDioxide + LabelAppeal + ResidualSugar + 
##     STARS_new + Sulphates + TotalSulfurDioxide + VolatileAcidity + 
##     pH
## 
##                      Df Deviance   AIC
## - FixedAcidity        1    10743 36375
## - ResidualSugar       1    10743 36375
## - pH                  1    10743 36375
## - Sulphates           1    10744 36376
## - CitricAcid          1    10744 36376
## - Density             1    10744 36376
## <none>                     10742 36376
## - Chlorides           1    10748 36380
## - FreeSulfurDioxide   1    10749 36381
## - TotalSulfurDioxide  1    10750 36382
## - Alcohol             1    10752 36384
## - VolatileAcidity     1    10767 36399
## - AcidIndex          13    11066 36674
## - LabelAppeal         4    11299 36925
## - STARS_new           4    15378 41004
## 
## Step:  AIC=36374.52
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density + 
##     FreeSulfurDioxide + LabelAppeal + ResidualSugar + STARS_new + 
##     Sulphates + TotalSulfurDioxide + VolatileAcidity + pH
## 
##                      Df Deviance   AIC
## - ResidualSugar       1    10743 36373
## - pH                  1    10743 36373
## - Sulphates           1    10744 36374
## - CitricAcid          1    10744 36374
## - Density             1    10744 36374
## <none>                     10743 36375
## + FixedAcidity        1    10742 36376
## - Chlorides           1    10748 36378
## - FreeSulfurDioxide   1    10749 36379
## - TotalSulfurDioxide  1    10750 36380
## - Alcohol             1    10752 36382
## - VolatileAcidity     1    10767 36397
## - AcidIndex          13    11075 36681
## - LabelAppeal         4    11299 36923
## - STARS_new           4    15379 41003
## 
## Step:  AIC=36373.01
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density + 
##     FreeSulfurDioxide + LabelAppeal + STARS_new + Sulphates + 
##     TotalSulfurDioxide + VolatileAcidity + pH
## 
##                      Df Deviance   AIC
## - pH                  1    10744 36372
## - Sulphates           1    10744 36372
## - CitricAcid          1    10744 36372
## - Density             1    10744 36372
## <none>                     10743 36373
## + ResidualSugar       1    10743 36375
## + FixedAcidity        1    10743 36375
## - Chlorides           1    10749 36377
## - FreeSulfurDioxide   1    10749 36377
## - TotalSulfurDioxide  1    10751 36379
## - Alcohol             1    10752 36380
## - VolatileAcidity     1    10767 36395
## - AcidIndex          13    11076 36680
## - LabelAppeal         4    11300 36922
## - STARS_new           4    15381 41003
## 
## Step:  AIC=36371.73
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density + 
##     FreeSulfurDioxide + LabelAppeal + STARS_new + Sulphates + 
##     TotalSulfurDioxide + VolatileAcidity
## 
##                      Df Deviance   AIC
## - Sulphates           1    10745 36371
## - CitricAcid          1    10745 36371
## - Density             1    10745 36371
## <none>                     10744 36372
## + pH                  1    10743 36373
## + ResidualSugar       1    10743 36373
## + FixedAcidity        1    10744 36374
## - Chlorides           1    10749 36375
## - FreeSulfurDioxide   1    10750 36376
## - TotalSulfurDioxide  1    10751 36377
## - Alcohol             1    10753 36379
## - VolatileAcidity     1    10768 36394
## - AcidIndex          13    11076 36678
## - LabelAppeal         4    11300 36920
## - STARS_new           4    15385 41005
## 
## Step:  AIC=36370.87
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density + 
##     FreeSulfurDioxide + LabelAppeal + STARS_new + TotalSulfurDioxide + 
##     VolatileAcidity
## 
##                      Df Deviance   AIC
## - CitricAcid          1    10746 36370
## - Density             1    10746 36370
## <none>                     10745 36371
## + Sulphates           1    10744 36372
## + pH                  1    10744 36372
## + ResidualSugar       1    10744 36372
## + FixedAcidity        1    10745 36373
## - Chlorides           1    10750 36375
## - FreeSulfurDioxide   1    10751 36375
## - TotalSulfurDioxide  1    10752 36376
## - Alcohol             1    10754 36378
## - VolatileAcidity     1    10769 36393
## - AcidIndex          13    11078 36679
## - LabelAppeal         4    11301 36919
## - STARS_new           4    15388 41007
## 
## Step:  AIC=36370.07
## TARGET ~ AcidIndex + Alcohol + Chlorides + Density + FreeSulfurDioxide + 
##     LabelAppeal + STARS_new + TotalSulfurDioxide + VolatileAcidity
## 
##                      Df Deviance   AIC
## - Density             1    10748 36370
## <none>                     10746 36370
## + CitricAcid          1    10745 36371
## + Sulphates           1    10745 36371
## + pH                  1    10745 36371
## + ResidualSugar       1    10746 36372
## + FixedAcidity        1    10746 36372
## - Chlorides           1    10752 36374
## - FreeSulfurDioxide   1    10752 36374
## - TotalSulfurDioxide  1    10754 36376
## - Alcohol             1    10756 36378
## - VolatileAcidity     1    10771 36393
## - AcidIndex          13    11079 36677
## - LabelAppeal         4    11303 36919
## - STARS_new           4    15395 41011
## 
## Step:  AIC=36369.59
## TARGET ~ AcidIndex + Alcohol + Chlorides + FreeSulfurDioxide + 
##     LabelAppeal + STARS_new + TotalSulfurDioxide + VolatileAcidity
## 
##                      Df Deviance   AIC
## <none>                     10748 36370
## + Density             1    10746 36370
## + CitricAcid          1    10746 36370
## + Sulphates           1    10746 36370
## + pH                  1    10747 36371
## + ResidualSugar       1    10747 36371
## + FixedAcidity        1    10748 36372
## - Chlorides           1    10754 36373
## - FreeSulfurDioxide   1    10754 36374
## - TotalSulfurDioxide  1    10755 36375
## - Alcohol             1    10757 36377
## - VolatileAcidity     1    10772 36392
## - AcidIndex          13    11082 36678
## - LabelAppeal         4    11304 36918
## - STARS_new           4    15400 41014
summary(step_model1a)
## 
## Call:
## glm(formula = TARGET ~ AcidIndex + Alcohol + Chlorides + FreeSulfurDioxide + 
##     LabelAppeal + STARS_new + TotalSulfurDioxide + VolatileAcidity, 
##     family = poisson, data = train_transformed)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -5.230e-02  3.200e-01  -0.163  0.87017    
## AcidIndex5         -1.015e-01  3.240e-01  -0.313  0.75414    
## AcidIndex6         -9.536e-02  3.171e-01  -0.301  0.76364    
## AcidIndex7         -1.260e-01  3.167e-01  -0.398  0.69082    
## AcidIndex8         -1.626e-01  3.168e-01  -0.513  0.60773    
## AcidIndex9         -2.648e-01  3.172e-01  -0.835  0.40387    
## AcidIndex10        -4.284e-01  3.185e-01  -1.345  0.17863    
## AcidIndex11        -7.458e-01  3.226e-01  -2.312  0.02079 *  
## AcidIndex12        -7.501e-01  3.288e-01  -2.281  0.02253 *  
## AcidIndex13        -5.940e-01  3.314e-01  -1.792  0.07309 .  
## AcidIndex14        -7.042e-01  3.487e-01  -2.019  0.04346 *  
## AcidIndex15        -2.255e-01  4.035e-01  -0.559  0.57622    
## AcidIndex16        -9.523e-01  5.481e-01  -1.738  0.08230 .  
## AcidIndex17        -1.275e+01  1.269e+02  -0.100  0.91995    
## Alcohol             5.179e-03  1.662e-03   3.117  0.00183 ** 
## Chlorides          -7.114e-02  2.960e-02  -2.404  0.01623 *  
## FreeSulfurDioxide   1.517e-04  6.094e-05   2.489  0.01282 *  
## LabelAppeal-1       2.315e-01  4.174e-02   5.547 2.91e-08 ***
## LabelAppeal0        4.203e-01  4.066e-02  10.337  < 2e-16 ***
## LabelAppeal1        5.591e-01  4.143e-02  13.497  < 2e-16 ***
## LabelAppeal2        6.879e-01  4.690e-02  14.667  < 2e-16 ***
## STARS_new1          7.763e-01  2.193e-02  35.395  < 2e-16 ***
## STARS_new2          1.089e+00  2.054e-02  53.008  < 2e-16 ***
## STARS_new3          1.210e+00  2.161e-02  55.999  < 2e-16 ***
## STARS_new4          1.317e+00  2.737e-02  48.125  < 2e-16 ***
## TotalSulfurDioxide  1.063e-04  3.876e-05   2.742  0.00611 ** 
## VolatileAcidity    -5.727e-02  1.172e-02  -4.885 1.04e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18252  on 10237  degrees of freedom
## Residual deviance: 10748  on 10211  degrees of freedom
## AIC: 36370
## 
## Number of Fisher Scoring iterations: 10

4.1 Model 1D: Stepwise on Model 1B

Stepwise selected a model with these predictors:

Important: AcidIndex Alcohol Chlorides FreeSulfurDioxide LabelAppeal STARS_imputed is_stars_missing TotalSulfurDioxide VolatileAcidity

Dropped/Not Important: FixedAcidity ResidualSugar pH Sulphates CitricAcid Density

AIC: AIC=36369.59

step_model1b <- stepAIC(pr_model1b, direction = "both",Trace = 0)
## Start:  AIC=36376.47
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density + 
##     FixedAcidity + FreeSulfurDioxide + LabelAppeal + ResidualSugar + 
##     STARS_imputed + is_stars_missing + Sulphates + TotalSulfurDioxide + 
##     VolatileAcidity + pH
## 
##                      Df Deviance   AIC
## - FixedAcidity        1    10743 36375
## - ResidualSugar       1    10743 36375
## - pH                  1    10743 36375
## - Sulphates           1    10744 36376
## - CitricAcid          1    10744 36376
## - Density             1    10744 36376
## <none>                     10742 36376
## - Chlorides           1    10748 36380
## - FreeSulfurDioxide   1    10749 36381
## - TotalSulfurDioxide  1    10750 36382
## - Alcohol             1    10752 36384
## - VolatileAcidity     1    10767 36399
## - AcidIndex          13    11066 36674
## - LabelAppeal         4    11299 36925
## - STARS_imputed       3    11551 37179
## - is_stars_missing    1    14033 39665
## 
## Step:  AIC=36374.52
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density + 
##     FreeSulfurDioxide + LabelAppeal + ResidualSugar + STARS_imputed + 
##     is_stars_missing + Sulphates + TotalSulfurDioxide + VolatileAcidity + 
##     pH
## 
##                      Df Deviance   AIC
## - ResidualSugar       1    10743 36373
## - pH                  1    10743 36373
## - Sulphates           1    10744 36374
## - CitricAcid          1    10744 36374
## - Density             1    10744 36374
## <none>                     10743 36375
## + FixedAcidity        1    10742 36376
## - Chlorides           1    10748 36378
## - FreeSulfurDioxide   1    10749 36379
## - TotalSulfurDioxide  1    10750 36380
## - Alcohol             1    10752 36382
## - VolatileAcidity     1    10767 36397
## - AcidIndex          13    11075 36681
## - LabelAppeal         4    11299 36923
## - STARS_imputed       3    11552 37178
## - is_stars_missing    1    14034 39664
## 
## Step:  AIC=36373.01
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density + 
##     FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing + 
##     Sulphates + TotalSulfurDioxide + VolatileAcidity + pH
## 
##                      Df Deviance   AIC
## - pH                  1    10744 36372
## - Sulphates           1    10744 36372
## - CitricAcid          1    10744 36372
## - Density             1    10744 36372
## <none>                     10743 36373
## + ResidualSugar       1    10743 36375
## + FixedAcidity        1    10743 36375
## - Chlorides           1    10749 36377
## - FreeSulfurDioxide   1    10749 36377
## - TotalSulfurDioxide  1    10751 36379
## - Alcohol             1    10752 36380
## - VolatileAcidity     1    10767 36395
## - AcidIndex          13    11076 36680
## - LabelAppeal         4    11300 36922
## - STARS_imputed       3    11553 37177
## - is_stars_missing    1    14034 39662
## 
## Step:  AIC=36371.73
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density + 
##     FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing + 
##     Sulphates + TotalSulfurDioxide + VolatileAcidity
## 
##                      Df Deviance   AIC
## - Sulphates           1    10745 36371
## - CitricAcid          1    10745 36371
## - Density             1    10745 36371
## <none>                     10744 36372
## + pH                  1    10743 36373
## + ResidualSugar       1    10743 36373
## + FixedAcidity        1    10744 36374
## - Chlorides           1    10749 36375
## - FreeSulfurDioxide   1    10750 36376
## - TotalSulfurDioxide  1    10751 36377
## - Alcohol             1    10753 36379
## - VolatileAcidity     1    10768 36394
## - AcidIndex          13    11076 36678
## - LabelAppeal         4    11300 36920
## - STARS_imputed       3    11554 37176
## - is_stars_missing    1    14038 39664
## 
## Step:  AIC=36370.87
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density + 
##     FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing + 
##     TotalSulfurDioxide + VolatileAcidity
## 
##                      Df Deviance   AIC
## - CitricAcid          1    10746 36370
## - Density             1    10746 36370
## <none>                     10745 36371
## + Sulphates           1    10744 36372
## + pH                  1    10744 36372
## + ResidualSugar       1    10744 36372
## + FixedAcidity        1    10745 36373
## - Chlorides           1    10750 36375
## - FreeSulfurDioxide   1    10751 36375
## - TotalSulfurDioxide  1    10752 36376
## - Alcohol             1    10754 36378
## - VolatileAcidity     1    10769 36393
## - AcidIndex          13    11078 36679
## - LabelAppeal         4    11301 36919
## - STARS_imputed       3    11556 37175
## - is_stars_missing    1    14039 39663
## 
## Step:  AIC=36370.07
## TARGET ~ AcidIndex + Alcohol + Chlorides + Density + FreeSulfurDioxide + 
##     LabelAppeal + STARS_imputed + is_stars_missing + TotalSulfurDioxide + 
##     VolatileAcidity
## 
##                      Df Deviance   AIC
## - Density             1    10748 36370
## <none>                     10746 36370
## + CitricAcid          1    10745 36371
## + Sulphates           1    10745 36371
## + pH                  1    10745 36371
## + ResidualSugar       1    10746 36372
## + FixedAcidity        1    10746 36372
## - Chlorides           1    10752 36374
## - FreeSulfurDioxide   1    10752 36374
## - TotalSulfurDioxide  1    10754 36376
## - Alcohol             1    10756 36378
## - VolatileAcidity     1    10771 36393
## - AcidIndex          13    11079 36677
## - LabelAppeal         4    11303 36919
## - STARS_imputed       3    11557 37175
## - is_stars_missing    1    14044 39666
## 
## Step:  AIC=36369.59
## TARGET ~ AcidIndex + Alcohol + Chlorides + FreeSulfurDioxide + 
##     LabelAppeal + STARS_imputed + is_stars_missing + TotalSulfurDioxide + 
##     VolatileAcidity
## 
##                      Df Deviance   AIC
## <none>                     10748 36370
## + Density             1    10746 36370
## + CitricAcid          1    10746 36370
## + Sulphates           1    10746 36370
## + pH                  1    10747 36371
## + ResidualSugar       1    10747 36371
## + FixedAcidity        1    10748 36372
## - Chlorides           1    10754 36373
## - FreeSulfurDioxide   1    10754 36374
## - TotalSulfurDioxide  1    10755 36375
## - Alcohol             1    10757 36377
## - VolatileAcidity     1    10772 36392
## - AcidIndex          13    11082 36678
## - LabelAppeal         4    11304 36918
## - STARS_imputed       3    11559 37175
## - is_stars_missing    1    14048 39668
summary(step_model1b)
## 
## Call:
## glm(formula = TARGET ~ AcidIndex + Alcohol + Chlorides + FreeSulfurDioxide + 
##     LabelAppeal + STARS_imputed + is_stars_missing + TotalSulfurDioxide + 
##     VolatileAcidity, family = poisson, data = train_transformed)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           7.240e-01  3.200e-01   2.262  0.02367 *  
## AcidIndex5           -1.015e-01  3.240e-01  -0.313  0.75414    
## AcidIndex6           -9.536e-02  3.171e-01  -0.301  0.76364    
## AcidIndex7           -1.260e-01  3.167e-01  -0.398  0.69082    
## AcidIndex8           -1.626e-01  3.168e-01  -0.513  0.60773    
## AcidIndex9           -2.648e-01  3.172e-01  -0.835  0.40387    
## AcidIndex10          -4.284e-01  3.185e-01  -1.345  0.17863    
## AcidIndex11          -7.458e-01  3.226e-01  -2.312  0.02079 *  
## AcidIndex12          -7.501e-01  3.288e-01  -2.281  0.02253 *  
## AcidIndex13          -5.940e-01  3.314e-01  -1.792  0.07309 .  
## AcidIndex14          -7.042e-01  3.487e-01  -2.019  0.04346 *  
## AcidIndex15          -2.255e-01  4.035e-01  -0.559  0.57622    
## AcidIndex16          -9.523e-01  5.481e-01  -1.738  0.08230 .  
## AcidIndex17          -1.275e+01  1.269e+02  -0.100  0.91995    
## Alcohol               5.179e-03  1.662e-03   3.117  0.00183 ** 
## Chlorides            -7.114e-02  2.960e-02  -2.404  0.01623 *  
## FreeSulfurDioxide     1.517e-04  6.094e-05   2.489  0.01282 *  
## LabelAppeal-1         2.315e-01  4.174e-02   5.547 2.91e-08 ***
## LabelAppeal0          4.203e-01  4.066e-02  10.337  < 2e-16 ***
## LabelAppeal1          5.591e-01  4.143e-02  13.497  < 2e-16 ***
## LabelAppeal2          6.879e-01  4.690e-02  14.667  < 2e-16 ***
## STARS_imputed2        3.127e-01  1.601e-02  19.529  < 2e-16 ***
## STARS_imputed3        4.337e-01  1.741e-02  24.914  < 2e-16 ***
## STARS_imputed4        5.411e-01  2.431e-02  22.258  < 2e-16 ***
## is_stars_missingTRUE -1.089e+00  2.054e-02 -53.008  < 2e-16 ***
## TotalSulfurDioxide    1.063e-04  3.876e-05   2.742  0.00611 ** 
## VolatileAcidity      -5.727e-02  1.172e-02  -4.885 1.04e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18252  on 10237  degrees of freedom
## Residual deviance: 10748  on 10211  degrees of freedom
## AIC: 36370
## 
## Number of Fisher Scoring iterations: 10

4.1 Model 2A: Transformed predictors w/ STARS indicator variable

Box-cox is not necessarily needed for poisson, since the predictors do not need to be normally distributed (only the residuals should ideally be normally distributed).

Poisson regression assumes log link on the mean, variance = mean (or NB variance > mean) and no normality requirement. Null deviance: 18252 on 10237 degrees of freedom Residual deviance: 10701 on 10205 degrees of freedom

This means the model explains ~40% of the deviance, which is respectable for sales-count prediction.

AIC: 36335

set.seed(123)
# Model 2A - All transformed predictors except Box-cox w/ STARS_imputed + is_stars_missing

# Poisson Regression Model
pr_model2a <- glm(TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
             LOG_CitricAcid + Density + SQRT_FixedAcidity + LOG_FreeSulfurDioxide +
             LabelAppeal + LOG_ResidualSugar + STARS_imputed + is_stars_missing + LOG_Sulphates + LOG_TotalSulfurDioxide + LOG_VolatileAcidity + pH,
             family = poisson, data = train_transformed)

# View key model results
summary(pr_model2a)
## 
## Call:
## glm(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides + 
##     LOG_CitricAcid + Density + SQRT_FixedAcidity + LOG_FreeSulfurDioxide + 
##     LabelAppeal + LOG_ResidualSugar + STARS_imputed + is_stars_missing + 
##     LOG_Sulphates + LOG_TotalSulfurDioxide + LOG_VolatileAcidity + 
##     pH, family = poisson, data = train_transformed)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              0.685627   0.388030   1.767  0.07724 .  
## AcidIndex5              -0.095278   0.324422  -0.294  0.76900    
## AcidIndex6              -0.096417   0.317604  -0.304  0.76145    
## AcidIndex7              -0.130118   0.317281  -0.410  0.68173    
## AcidIndex8              -0.165732   0.317361  -0.522  0.60152    
## AcidIndex9              -0.263934   0.317803  -0.830  0.40626    
## AcidIndex10             -0.415322   0.319194  -1.301  0.19320    
## AcidIndex11             -0.726621   0.323303  -2.247  0.02461 *  
## AcidIndex12             -0.722677   0.329529  -2.193  0.02830 *  
## AcidIndex13             -0.573343   0.332169  -1.726  0.08434 .  
## AcidIndex14             -0.683363   0.349551  -1.955  0.05059 .  
## AcidIndex15             -0.236369   0.404091  -0.585  0.55859    
## AcidIndex16             -0.938628   0.548728  -1.711  0.08716 .  
## AcidIndex17            -12.746968 126.889324  -0.100  0.91998    
## SQRT_Alcohol             0.034064   0.010984   3.101  0.00193 ** 
## LOG_Chlorides           -0.094544   0.039468  -2.395  0.01660 *  
## LOG_CitricAcid           0.037606   0.020680   1.818  0.06899 .  
## Density                 -0.266816   0.212829  -1.254  0.20997    
## SQRT_FixedAcidity       -0.002293   0.008169  -0.281  0.77897    
## LOG_FreeSulfurDioxide    0.028216   0.006364   4.434 9.26e-06 ***
## LabelAppeal-1            0.232349   0.041745   5.566 2.61e-08 ***
## LabelAppeal0             0.421596   0.040660  10.369  < 2e-16 ***
## LabelAppeal1             0.558933   0.041430  13.491  < 2e-16 ***
## LabelAppeal2             0.687593   0.046905  14.659  < 2e-16 ***
## LOG_ResidualSugar        0.008681   0.005862   1.481  0.13866    
## STARS_imputed2           0.309103   0.016025  19.289  < 2e-16 ***
## STARS_imputed3           0.430572   0.017422  24.715  < 2e-16 ***
## STARS_imputed4           0.539338   0.024309  22.187  < 2e-16 ***
## is_stars_missingTRUE    -1.080252   0.020584 -52.481  < 2e-16 ***
## LOG_Sulphates           -0.021862   0.021542  -1.015  0.31017    
## LOG_TotalSulfurDioxide   0.037061   0.007933   4.671 2.99e-06 ***
## LOG_VolatileAcidity     -0.116069   0.022305  -5.204 1.95e-07 ***
## pH                      -0.006466   0.008523  -0.759  0.44805    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18252  on 10237  degrees of freedom
## Residual deviance: 10701  on 10205  degrees of freedom
## AIC: 36335
## 
## Number of Fisher Scoring iterations: 10

4.1 Model 2B: Transformed predictors w/ STARS indicator and reduced variables

Density, FixedAcidity (sqrt),Sulphates (log), pH, ResidualSugar (log), CitricAcid are removed as its p value is high and as observed in stepwise selection.

Null deviance: 18252 on 10237 degrees of freedom Residual deviance: 10710 on 10211 degrees of freedom AIC: 36332

pr_model2b <- glm(TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
              LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing  + LOG_TotalSulfurDioxide + LOG_VolatileAcidity,
             family = poisson, data = train_transformed)


# View key model results
summary(pr_model2b)
## 
## Call:
## glm(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides + 
##     LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing + 
##     LOG_TotalSulfurDioxide + LOG_VolatileAcidity, family = poisson, 
##     data = train_transformed)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              0.408632   0.325098   1.257  0.20877    
## AcidIndex5              -0.099774   0.324005  -0.308  0.75813    
## AcidIndex6              -0.094697   0.317111  -0.299  0.76523    
## AcidIndex7              -0.127005   0.316750  -0.401  0.68845    
## AcidIndex8              -0.161721   0.316798  -0.510  0.60971    
## AcidIndex9              -0.259867   0.317179  -0.819  0.41261    
## AcidIndex10             -0.413412   0.318545  -1.298  0.19435    
## AcidIndex11             -0.726702   0.322612  -2.253  0.02429 *  
## AcidIndex12             -0.717667   0.328826  -2.183  0.02907 *  
## AcidIndex13             -0.567891   0.331442  -1.713  0.08664 .  
## AcidIndex14             -0.679543   0.348730  -1.949  0.05134 .  
## AcidIndex15             -0.220163   0.403494  -0.546  0.58531    
## AcidIndex16             -0.933608   0.548093  -1.703  0.08850 .  
## AcidIndex17            -12.732403 126.889876  -0.100  0.92007    
## SQRT_Alcohol             0.033951   0.010974   3.094  0.00198 ** 
## LOG_Chlorides           -0.097672   0.039446  -2.476  0.01328 *  
## LOG_FreeSulfurDioxide    0.028718   0.006356   4.518 6.23e-06 ***
## LabelAppeal-1            0.233074   0.041741   5.584 2.35e-08 ***
## LabelAppeal0             0.421678   0.040659  10.371  < 2e-16 ***
## LabelAppeal1             0.559917   0.041426  13.516  < 2e-16 ***
## LabelAppeal2             0.686947   0.046902  14.646  < 2e-16 ***
## STARS_imputed2           0.309782   0.016020  19.337  < 2e-16 ***
## STARS_imputed3           0.431544   0.017416  24.779  < 2e-16 ***
## STARS_imputed4           0.540208   0.024308  22.224  < 2e-16 ***
## is_stars_missingTRUE    -1.082050   0.020575 -52.592  < 2e-16 ***
## LOG_TotalSulfurDioxide   0.037833   0.007917   4.779 1.76e-06 ***
## LOG_VolatileAcidity     -0.116714   0.022301  -5.234 1.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18252  on 10237  degrees of freedom
## Residual deviance: 10710  on 10211  degrees of freedom
## AIC: 36332
## 
## Number of Fisher Scoring iterations: 10

4.1 Model 3: Combined predictors w/ STARS indicator variable

Uses CombinedAcidity

Null deviance: 18252 on 10237 degrees of freedom Residual deviance: 10737 on 10211 degrees of freedom AIC: 36359

pr_model3 <- glm(TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
              LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing  + LOG_TotalSulfurDioxide +CombinedAcidity,
             family = poisson, data = train_transformed)


# View key model results
summary(pr_model3)
## 
## Call:
## glm(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides + 
##     LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing + 
##     LOG_TotalSulfurDioxide + CombinedAcidity, family = poisson, 
##     data = train_transformed)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             3.730e-01  3.251e-01   1.147  0.25124    
## AcidIndex5             -1.261e-01  3.240e-01  -0.389  0.69713    
## AcidIndex6             -1.148e-01  3.171e-01  -0.362  0.71727    
## AcidIndex7             -1.453e-01  3.168e-01  -0.459  0.64647    
## AcidIndex8             -1.806e-01  3.168e-01  -0.570  0.56876    
## AcidIndex9             -2.797e-01  3.172e-01  -0.882  0.37799    
## AcidIndex10            -4.335e-01  3.186e-01  -1.361  0.17363    
## AcidIndex11            -7.501e-01  3.227e-01  -2.324  0.02010 *  
## AcidIndex12            -7.362e-01  3.289e-01  -2.238  0.02521 *  
## AcidIndex13            -5.908e-01  3.315e-01  -1.782  0.07473 .  
## AcidIndex14            -7.102e-01  3.489e-01  -2.036  0.04178 *  
## AcidIndex15            -2.395e-01  4.036e-01  -0.593  0.55285    
## AcidIndex16            -9.174e-01  5.486e-01  -1.672  0.09447 .  
## AcidIndex17            -1.280e+01  1.268e+02  -0.101  0.91962    
## SQRT_Alcohol            3.355e-02  1.097e-02   3.059  0.00222 ** 
## LOG_Chlorides          -9.968e-02  3.946e-02  -2.526  0.01154 *  
## LOG_FreeSulfurDioxide   2.955e-02  6.350e-03   4.654 3.26e-06 ***
## LabelAppeal-1           2.381e-01  4.173e-02   5.705 1.16e-08 ***
## LabelAppeal0            4.265e-01  4.065e-02  10.493  < 2e-16 ***
## LabelAppeal1            5.652e-01  4.141e-02  13.648  < 2e-16 ***
## LabelAppeal2            6.926e-01  4.689e-02  14.771  < 2e-16 ***
## STARS_imputed2          3.123e-01  1.601e-02  19.502  < 2e-16 ***
## STARS_imputed3          4.336e-01  1.741e-02  24.907  < 2e-16 ***
## STARS_imputed4          5.441e-01  2.429e-02  22.397  < 2e-16 ***
## is_stars_missingTRUE   -1.086e+00  2.056e-02 -52.851  < 2e-16 ***
## LOG_TotalSulfurDioxide  3.947e-02  7.913e-03   4.988 6.09e-07 ***
## CombinedAcidity        -7.925e-04  1.267e-03  -0.626  0.53145    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18252  on 10237  degrees of freedom
## Residual deviance: 10737  on 10211  degrees of freedom
## AIC: 36359
## 
## Number of Fisher Scoring iterations: 10

Dispersion = 0.8698125 indicates mild underdispersion, but is not a concern as it often indicates: Good model fit, Residuals are not excessively variable, No need to switch models.

disp <- sum(residuals(pr_model2b, type="pearson")^2) / pr_model1b$df.residual
disp
## [1] 0.8703239

4.2 Negative Binomial Regression Models

4.2 Model 1: Original variables with STAR imputed

The predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

Theta (dispersion parameter) is extremely large 40,995. The NB algorithm struggled to estimate theta accurately, thus the warning. This typically means the model is too close to Poisson for NB to add value. Overdispersion is not present or is very small.

AIC = 36,379 — comparable to Poisson AIC.

# Negative Binomial Model 1
nb_model1 <- glm.nb(TARGET ~ AcidIndex + Alcohol + Chlorides +
             CitricAcid + Density + FixedAcidity + FreeSulfurDioxide +
             LabelAppeal + ResidualSugar + STARS_imputed + is_stars_missing + Sulphates + TotalSulfurDioxide +
             VolatileAcidity + pH, data = train_transformed, init.theta = 40957.00204, link = log)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in glm.nb(TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + :
## alternation limit reached
summary(nb_model1)
## 
## Call:
## glm.nb(formula = TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + 
##     Density + FixedAcidity + FreeSulfurDioxide + LabelAppeal + 
##     ResidualSugar + STARS_imputed + is_stars_missing + Sulphates + 
##     TotalSulfurDioxide + VolatileAcidity + pH, data = train_transformed, 
##     init.theta = 40994.61229, link = log)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           1.008e+00  3.832e-01   2.630  0.00854 ** 
## AcidIndex5           -1.011e-01  3.244e-01  -0.312  0.75537    
## AcidIndex6           -9.767e-02  3.176e-01  -0.308  0.75843    
## AcidIndex7           -1.287e-01  3.173e-01  -0.406  0.68491    
## AcidIndex8           -1.660e-01  3.173e-01  -0.523  0.60101    
## AcidIndex9           -2.680e-01  3.178e-01  -0.843  0.39911    
## AcidIndex10          -4.305e-01  3.192e-01  -1.349  0.17741    
## AcidIndex11          -7.457e-01  3.233e-01  -2.307  0.02106 *  
## AcidIndex12          -7.536e-01  3.294e-01  -2.288  0.02216 *  
## AcidIndex13          -5.981e-01  3.321e-01  -1.801  0.07171 .  
## AcidIndex14          -7.065e-01  3.495e-01  -2.022  0.04322 *  
## AcidIndex15          -2.394e-01  4.040e-01  -0.593  0.55347    
## AcidIndex16          -9.554e-01  5.488e-01  -1.741  0.08171 .  
## AcidIndex17          -3.750e+01  3.001e+07   0.000  1.00000    
## Alcohol               5.128e-03  1.662e-03   3.084  0.00204 ** 
## Chlorides            -6.947e-02  2.961e-02  -2.346  0.01898 *  
## CitricAcid            1.179e-02  1.046e-02   1.128  0.25942    
## Density              -2.576e-01  2.128e-01  -1.210  0.22619    
## FixedAcidity         -3.026e-04  1.282e-03  -0.236  0.81342    
## FreeSulfurDioxide     1.511e-04  6.097e-05   2.479  0.01318 *  
## LabelAppeal-1         2.308e-01  4.175e-02   5.528 3.23e-08 ***
## LabelAppeal0          4.200e-01  4.066e-02  10.331  < 2e-16 ***
## LabelAppeal1          5.583e-01  4.143e-02  13.477  < 2e-16 ***
## LabelAppeal2          6.883e-01  4.690e-02  14.675  < 2e-16 ***
## ResidualSugar         1.908e-04  2.751e-04   0.694  0.48794    
## STARS_imputed2        3.122e-01  1.602e-02  19.491  < 2e-16 ***
## STARS_imputed3        4.329e-01  1.742e-02  24.858  < 2e-16 ***
## STARS_imputed4        5.405e-01  2.431e-02  22.233  < 2e-16 ***
## is_stars_missingTRUE -1.088e+00  2.055e-02 -52.935  < 2e-16 ***
## Sulphates            -1.033e-02  9.858e-03  -1.048  0.29449    
## TotalSulfurDioxide    1.065e-04  3.880e-05   2.745  0.00605 ** 
## VolatileAcidity      -5.714e-02  1.173e-02  -4.873 1.10e-06 ***
## pH                   -7.311e-03  8.525e-03  -0.858  0.39113    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40994.61) family taken to be 1)
## 
##     Null deviance: 18251  on 10237  degrees of freedom
## Residual deviance: 10742  on 10205  degrees of freedom
## AIC: 36379
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40995 
##           Std. Err.:  38244 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -36310.81

4.2 Model 2: Transformed variables with STAR imputed

Theta (dispersion parameter) is extremely large 41,187. The NB algorithm struggled to estimate theta accurately, thus the warning. This typically means the model is too close to Poisson for NB to add value. Overdispersion is very small.

AcidIndex is messy espacially AcidIndex17 as it has estimate = -37.48
Looks like it is Very rare category (almost no records),Complete or quasi-complete separation and the Model cannot estimate its effect → numerical instability. AIC = 36,334 — comparable to Poisson AIC.

nb_model2  <- glm.nb(TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
              LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing  + LOG_TotalSulfurDioxide + LOG_VolatileAcidity,
             data = train_transformed)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in glm.nb(TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
## LOG_FreeSulfurDioxide + : alternation limit reached
summary(nb_model2)
## 
## Call:
## glm.nb(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides + 
##     LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing + 
##     LOG_TotalSulfurDioxide + LOG_VolatileAcidity, data = train_transformed, 
##     init.theta = 41187.11289, link = log)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             4.086e-01  3.251e-01   1.257  0.20878    
## AcidIndex5             -9.979e-02  3.240e-01  -0.308  0.75810    
## AcidIndex6             -9.471e-02  3.171e-01  -0.299  0.76519    
## AcidIndex7             -1.270e-01  3.168e-01  -0.401  0.68842    
## AcidIndex8             -1.617e-01  3.168e-01  -0.511  0.60969    
## AcidIndex9             -2.599e-01  3.172e-01  -0.819  0.41259    
## AcidIndex10            -4.134e-01  3.186e-01  -1.298  0.19435    
## AcidIndex11            -7.267e-01  3.226e-01  -2.253  0.02429 *  
## AcidIndex12            -7.177e-01  3.288e-01  -2.182  0.02907 *  
## AcidIndex13            -5.679e-01  3.315e-01  -1.713  0.08664 .  
## AcidIndex14            -6.796e-01  3.487e-01  -1.949  0.05134 .  
## AcidIndex15            -2.202e-01  4.035e-01  -0.546  0.58529    
## AcidIndex16            -9.336e-01  5.481e-01  -1.703  0.08850 .  
## AcidIndex17            -3.748e+01  3.001e+07   0.000  1.00000    
## SQRT_Alcohol            3.395e-02  1.098e-02   3.093  0.00198 ** 
## LOG_Chlorides          -9.767e-02  3.945e-02  -2.476  0.01329 *  
## LOG_FreeSulfurDioxide   2.872e-02  6.356e-03   4.518 6.23e-06 ***
## LabelAppeal-1           2.331e-01  4.174e-02   5.584 2.35e-08 ***
## LabelAppeal0            4.217e-01  4.066e-02  10.371  < 2e-16 ***
## LabelAppeal1            5.599e-01  4.143e-02  13.516  < 2e-16 ***
## LabelAppeal2            6.869e-01  4.690e-02  14.646  < 2e-16 ***
## STARS_imputed2          3.098e-01  1.602e-02  19.336  < 2e-16 ***
## STARS_imputed3          4.315e-01  1.742e-02  24.778  < 2e-16 ***
## STARS_imputed4          5.402e-01  2.431e-02  22.223  < 2e-16 ***
## is_stars_missingTRUE   -1.082e+00  2.057e-02 -52.591  < 2e-16 ***
## LOG_TotalSulfurDioxide  3.784e-02  7.917e-03   4.779 1.76e-06 ***
## LOG_VolatileAcidity    -1.167e-01  2.230e-02  -5.233 1.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(41187.11) family taken to be 1)
## 
##     Null deviance: 18251  on 10237  degrees of freedom
## Residual deviance: 10709  on 10211  degrees of freedom
## AIC: 36334
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  41187 
##           Std. Err.:  38484 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -36278.1

4.3 Zero-Inflated Model (extra)

4.3 Model 1: Zero-Inflated Poisson (ZIP) Model

As there are too many predictors via AcidIndex categories, LabelAppeal levels, STARS indicators etc, ZIP cannot estimate 17+ dummy variables in zero-inflation any it may produce NA standard errors and unreliable coefficient estimates. Hence using only 2–4 predictors that logically explain excess zeros, not the count value.

The results show that these variables increase predicted TARGET: SQRT_Alcohol (positive, highly significant) FreeSulfurDioxide LabelAppeal (very strong and monotonic) STARS levels (strong positive) AcidIndex moderately negative at high levels LOG_VolatileAcidity (negative)

These variables do not influence count much: TotalSulfurDioxide (non-significant) Many AcidIndex categories (flat)

The vuong Test shows that the Poisson model is way better as the Vuong z-statistic < - 1.96

library(pscl)   # for zeroinfl()


zip_model1<- zeroinfl(
  TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
    LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing +
    LOG_TotalSulfurDioxide + LOG_VolatileAcidity |
    LabelAppeal + is_stars_missing,
  data = train_transformed,
  dist = "poisson"
)

summary(zip_model1)
## 
## Call:
## zeroinfl(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides + 
##     LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing + 
##     LOG_TotalSulfurDioxide + LOG_VolatileAcidity | LabelAppeal + is_stars_missing, 
##     data = train_transformed, dist = "poisson")
## 
## Pearson residuals:
##       Min        1Q    Median        3Q       Max 
## -2.016342 -0.460885  0.004954  0.420080  3.829268 
## 
## Count model coefficients (poisson with log link):
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              0.255397   0.330177   0.774 0.439217    
## AcidIndex5               0.018491   0.328226   0.056 0.955073    
## AcidIndex6               0.033527   0.321246   0.104 0.916880    
## AcidIndex7               0.007095   0.320884   0.022 0.982359    
## AcidIndex8              -0.014261   0.320938  -0.044 0.964557    
## AcidIndex9              -0.069162   0.321361  -0.215 0.829600    
## AcidIndex10             -0.173341   0.322902  -0.537 0.591390    
## AcidIndex11             -0.380668   0.328052  -1.160 0.245890    
## AcidIndex12             -0.285785   0.336223  -0.850 0.395333    
## AcidIndex13             -0.205579   0.338612  -0.607 0.543769    
## AcidIndex14             -0.369095   0.360221  -1.025 0.305536    
## AcidIndex15             -0.028509   0.421429  -0.068 0.946066    
## AcidIndex16             -0.379628   0.603070  -0.629 0.529027    
## AcidIndex17            -12.733210 267.007828  -0.048 0.961964    
## SQRT_Alcohol             0.044272   0.011187   3.957 7.58e-05 ***
## LOG_Chlorides           -0.068288   0.040352  -1.692 0.090589 .  
## LOG_FreeSulfurDioxide    0.014236   0.006580   2.163 0.030507 *  
## LabelAppeal-1            0.398920   0.045012   8.862  < 2e-16 ***
## LabelAppeal0             0.665417   0.044120  15.082  < 2e-16 ***
## LabelAppeal1             0.858689   0.045032  19.068  < 2e-16 ***
## LabelAppeal2             1.013235   0.050552  20.043  < 2e-16 ***
## STARS_imputed2           0.241131   0.017728  13.602  < 2e-16 ***
## STARS_imputed3           0.336338   0.019366  17.368  < 2e-16 ***
## STARS_imputed4           0.423897   0.025954  16.333  < 2e-16 ***
## is_stars_missingTRUE    -0.201249   0.022380  -8.992  < 2e-16 ***
## LOG_TotalSulfurDioxide   0.006696   0.008321   0.805 0.421014    
## LOG_VolatileAcidity     -0.077351   0.022838  -3.387 0.000707 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -6.2824     0.4111 -15.282  < 2e-16 ***
## LabelAppeal-1          1.5251     0.3917   3.894 9.87e-05 ***
## LabelAppeal0           2.2396     0.3895   5.750 8.94e-09 ***
## LabelAppeal1           2.9723     0.3961   7.503 6.24e-14 ***
## LabelAppeal2           3.3821     0.4460   7.583 3.38e-14 ***
## is_stars_missingTRUE   4.5215     0.1577  28.666  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 47 
## Log-likelihood: -1.673e+04 on 33 Df
# ZIP vs Poisson
vuong(pr_model2b, zip_model1)
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   -35.05579 model2 > model1 < 2.22e-16
## AIC-corrected         -34.90612 model2 > model1 < 2.22e-16
## BIC-corrected         -34.36480 model2 > model1 < 2.22e-16

4.4 Multiple linear regression models

4.4 Model 1: Basic Model

Highly significant (p < 0.001, ***): VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal levels, STARS_imputed3, STARS_imputed4

Significant (p < 0.05, *): Density, pH, Sulphates, STARS_imputed2, CitricAcid, Chlorides

Not significant (p > 0.05): FixedAcidity, ResidualSugar, most AcidIndex levels

R squared ~ 0.35 → Model explains ~35% of the variance in TARGET. Residual SE ~1.56 → Typical prediction error magnitude. F-statistic highly significant → Model as a whole is statistically meaningful.

cat("\n=== LM1: Basic Model  ===\n")
## 
## === LM1: Basic Model  ===
 lm_model1 <- lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
     Density + pH + Sulphates + Alcohol + LabelAppeal + 
     AcidIndex + STARS_imputed, data = data_transform)

summary(lm_model1)
## 
## Call:
## lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + 
##     STARS_imputed, data = data_transform)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2451 -0.8678  0.2754  1.0801  4.7328 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.335e+00  1.040e+00   2.244 0.024824 *  
## FixedAcidity       -1.958e-03  3.069e-03  -0.638 0.523390    
## VolatileAcidity    -2.442e-01  2.716e-02  -8.991  < 2e-16 ***
## CitricAcid          7.584e-02  2.547e-02   2.978 0.002907 ** 
## ResidualSugar       4.528e-04  6.701e-04   0.676 0.499211    
## Chlorides          -2.482e-01  6.929e-02  -3.582 0.000342 ***
## FreeSulfurDioxide   6.210e-04  1.508e-04   4.117 3.86e-05 ***
## TotalSulfurDioxide  4.575e-04  9.446e-05   4.844 1.29e-06 ***
## Density            -1.248e+00  5.203e-01  -2.398 0.016504 *  
## pH                 -5.147e-02  2.065e-02  -2.493 0.012695 *  
## Sulphates          -5.672e-02  2.363e-02  -2.401 0.016384 *  
## Alcohol             1.986e-02  4.036e-03   4.921 8.70e-07 ***
## LabelAppeal-1       6.469e-01  7.490e-02   8.637  < 2e-16 ***
## LabelAppeal0        1.237e+00  7.287e-02  16.974  < 2e-16 ***
## LabelAppeal1        1.753e+00  7.609e-02  23.043  < 2e-16 ***
## LabelAppeal2        2.301e+00  1.004e-01  22.903  < 2e-16 ***
## AcidIndex5          7.958e-01  9.187e-01   0.866 0.386391    
## AcidIndex6          7.660e-01  9.021e-01   0.849 0.395831    
## AcidIndex7          6.631e-01  9.014e-01   0.736 0.461944    
## AcidIndex8          5.002e-01  9.015e-01   0.555 0.578975    
## AcidIndex9          9.078e-02  9.022e-01   0.101 0.919853    
## AcidIndex10        -4.801e-01  9.037e-01  -0.531 0.595222    
## AcidIndex11        -1.065e+00  9.066e-01  -1.175 0.240041    
## AcidIndex12        -1.276e+00  9.119e-01  -1.400 0.161609    
## AcidIndex13        -9.033e-01  9.206e-01  -0.981 0.326536    
## AcidIndex14        -1.025e+00  9.301e-01  -1.102 0.270367    
## AcidIndex15        -1.500e-01  1.057e+00  -0.142 0.887117    
## AcidIndex16        -1.613e+00  1.140e+00  -1.416 0.156872    
## AcidIndex17        -1.985e+00  1.078e+00  -1.841 0.065579 .  
## STARS_imputed2     -1.277e-01  3.414e-02  -3.742 0.000184 ***
## STARS_imputed3      1.502e+00  4.492e-02  33.443  < 2e-16 ***
## STARS_imputed4      2.169e+00  7.128e-02  30.424  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.558 on 12763 degrees of freedom
## Multiple R-squared:  0.3476, Adjusted R-squared:  0.346 
## F-statistic: 219.4 on 31 and 12763 DF,  p-value: < 2.2e-16

4.4 Model 2: Transformed variables

R squared ~ 0.55 indicates that model explains ~55% of the variance, a substantial improvement over 0.35 from the previous model. Residual SE smaller indicates that predictions are closer to observed values. Highly significant F-statistic indicates that model is meaningful overall.

Hence lm_model2 is clearly better. It improves fit, reduces residual error, and focuses on the most important predictors.

cat("\n=== LM1: Log-transformed response (recommended for skewed target)  ===\n")
## 
## === LM1: Log-transformed response (recommended for skewed target)  ===
 lm_model2 <- lm(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
              LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing  + LOG_TotalSulfurDioxide + LOG_VolatileAcidity, data = data_transform)

summary(lm_model2)
## 
## Call:
## lm(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides + 
##     LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing + 
##     LOG_TotalSulfurDioxide + LOG_VolatileAcidity, data = data_transform)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.103 -0.855  0.028  0.842  6.048 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.25810    0.76189   1.651 0.098703 .  
## AcidIndex5             -0.25387    0.76484  -0.332 0.739948    
## AcidIndex6             -0.15846    0.75096  -0.211 0.832879    
## AcidIndex7             -0.26134    0.75025  -0.348 0.727592    
## AcidIndex8             -0.35866    0.75029  -0.478 0.632632    
## AcidIndex9             -0.64678    0.75078  -0.861 0.388989    
## AcidIndex10            -0.92769    0.75200  -1.234 0.217361    
## AcidIndex11            -1.39437    0.75430  -1.849 0.064544 .  
## AcidIndex12            -1.39316    0.75866  -1.836 0.066329 .  
## AcidIndex13            -1.41262    0.76601  -1.844 0.065190 .  
## AcidIndex14            -1.23344    0.77361  -1.594 0.110872    
## AcidIndex15            -0.57621    0.87930  -0.655 0.512287    
## AcidIndex16            -1.63670    0.94851  -1.726 0.084453 .  
## AcidIndex17            -1.74480    0.89662  -1.946 0.051681 .  
## SQRT_Alcohol            0.10383    0.02199   4.723 2.35e-06 ***
## LOG_Chlorides          -0.27313    0.07749  -3.525 0.000425 ***
## LOG_FreeSulfurDioxide   0.07391    0.01249   5.918 3.34e-09 ***
## LabelAppeal-1           0.37044    0.06253   5.925 3.21e-09 ***
## LabelAppeal0            0.84005    0.06097  13.778  < 2e-16 ***
## LabelAppeal1            1.30534    0.06370  20.493  < 2e-16 ***
## LabelAppeal2            1.88433    0.08389  22.461  < 2e-16 ***
## STARS_imputed2          1.02458    0.03243  31.591  < 2e-16 ***
## STARS_imputed3          1.59047    0.03746  42.460  < 2e-16 ***
## STARS_imputed4          2.28275    0.05941  38.422  < 2e-16 ***
## is_stars_missingTRUE   -2.36369    0.03197 -73.932  < 2e-16 ***
## LOG_TotalSulfurDioxide  0.10835    0.01510   7.175 7.62e-13 ***
## LOG_VolatileAcidity    -0.33201    0.04353  -7.628 2.56e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.298 on 12768 degrees of freedom
## Multiple R-squared:  0.5467, Adjusted R-squared:  0.5458 
## F-statistic: 592.4 on 26 and 12768 DF,  p-value: < 2.2e-16

4.5 Check for multicollinearity

We know we have some strongly correlated features, so let’s address this potential issue by looking at the VIF for the models and removing predictors.

For the count regression models, luckily no predictor has a VIF > 5 (for Df > 1 (categorical), must look at GVIF^(1/(2*Df)) value), so we don’t need to remove any predictors.

cat("=== VIF for Count Regression Model 2a (Transformed Predictions with imputed values) ===\n")
## === VIF for Count Regression Model 2a (Transformed Predictions with imputed values) ===
car::vif(pr_model2a)
##                            GVIF Df GVIF^(1/(2*Df))
## AcidIndex              1.113178 13        1.004132
## SQRT_Alcohol           1.015515  1        1.007728
## LOG_Chlorides          1.004816  1        1.002405
## LOG_CitricAcid         1.011555  1        1.005761
## Density                1.005432  1        1.002712
## SQRT_FixedAcidity      1.027941  1        1.013874
## LOG_FreeSulfurDioxide  1.017914  1        1.008917
## LabelAppeal            1.147837  4        1.017384
## LOG_ResidualSugar      1.011349  1        1.005658
## STARS_imputed          1.331621  3        1.048890
## is_stars_missing       1.191941  1        1.091760
## LOG_Sulphates          1.003924  1        1.001960
## LOG_TotalSulfurDioxide 1.025321  1        1.012581
## LOG_VolatileAcidity    1.011188  1        1.005578
## pH                     1.008212  1        1.004098
cat("=== VIF for Count Regression Model 2b (Transformed Predictions with imputed values and reduced variables) ===\n")
## === VIF for Count Regression Model 2b (Transformed Predictions with imputed values and reduced variables) ===
car::vif(pr_model2b)
##                            GVIF Df GVIF^(1/(2*Df))
## AcidIndex              1.061706 13        1.002306
## SQRT_Alcohol           1.013544  1        1.006749
## LOG_Chlorides          1.003526  1        1.001762
## LOG_FreeSulfurDioxide  1.016303  1        1.008119
## LabelAppeal            1.145305  4        1.017104
## STARS_imputed          1.330172  3        1.048700
## is_stars_missing       1.190888  1        1.091278
## LOG_TotalSulfurDioxide 1.021614  1        1.010749
## LOG_VolatileAcidity    1.010478  1        1.005225
cat("=== VIF for Negative Binomial Regression Model 1 (Original variables with STAR imputed) ===\n")
## === VIF for Negative Binomial Regression Model 1 (Original variables with STAR imputed) ===
car::vif(nb_model1)
##                        GVIF Df GVIF^(1/(2*Df))
## AcidIndex          1.093487 13        1.003443
## Alcohol            1.014259  1        1.007104
## Chlorides          1.004569  1        1.002282
## CitricAcid         1.007985  1        1.003985
## Density            1.005802  1        1.002897
## FixedAcidity       1.028534  1        1.014167
## FreeSulfurDioxide  1.005196  1        1.002595
## LabelAppeal        1.147204  4        1.017314
## ResidualSugar      1.006058  1        1.003025
## STARS_imputed      1.331197  3        1.048835
## is_stars_missing   1.188221  1        1.090055
## Sulphates          1.003359  1        1.001678
## TotalSulfurDioxide 1.008743  1        1.004362
## VolatileAcidity    1.008293  1        1.004138
## pH                 1.008587  1        1.004284
cat("=== VIF for Negative Binomial Regression Model 1 (Transformed variables with STAR imputed) ===\n")
## === VIF for Negative Binomial Regression Model 1 (Transformed variables with STAR imputed) ===
car::vif(nb_model2)
##                            GVIF Df GVIF^(1/(2*Df))
## AcidIndex              1.061706 13        1.002306
## SQRT_Alcohol           1.013544  1        1.006749
## LOG_Chlorides          1.003526  1        1.001762
## LOG_FreeSulfurDioxide  1.016303  1        1.008119
## LabelAppeal            1.145303  4        1.017103
## STARS_imputed          1.330179  3        1.048701
## is_stars_missing       1.190897  1        1.091282
## LOG_TotalSulfurDioxide 1.021614  1        1.010749
## LOG_VolatileAcidity    1.010478  1        1.005225
cat("=== VIF for Multiple linear regression model 1 (Basic Model) ===\n")
## === VIF for Multiple linear regression model 1 (Basic Model) ===
car::vif(lm_model1)
##                        GVIF Df GVIF^(1/(2*Df))
## FixedAcidity       1.040478  1        1.020038
## VolatileAcidity    1.008520  1        1.004251
## CitricAcid         1.006970  1        1.003479
## ResidualSugar      1.005799  1        1.002895
## Chlorides          1.004192  1        1.002094
## FreeSulfurDioxide  1.004541  1        1.002268
## TotalSulfurDioxide 1.010929  1        1.005450
## Density            1.004997  1        1.002495
## pH                 1.006637  1        1.003313
## Sulphates          1.004492  1        1.002244
## Alcohol            1.011061  1        1.005515
## LabelAppeal        1.112231  4        1.013385
## AcidIndex          1.106704 13        1.003907
## STARS_imputed      1.128292  3        1.020321
cat("=== VIF for Multiple linear regression model 2 (Transformed variables with STAR imputed) ===\n")
## === VIF for Multiple linear regression model 2 (Transformed variables with STAR imputed) ===
car::vif(lm_model2)
##                            GVIF Df GVIF^(1/(2*Df))
## AcidIndex              1.096031 13        1.003533
## SQRT_Alcohol           1.010943  1        1.005457
## LOG_Chlorides          1.003974  1        1.001985
## LOG_FreeSulfurDioxide  1.024492  1        1.012172
## LabelAppeal            1.124884  4        1.014819
## STARS_imputed          1.595534  3        1.080980
## is_stars_missing       1.502366  1        1.225710
## LOG_TotalSulfurDioxide 1.034608  1        1.017157
## LOG_VolatileAcidity    1.014201  1        1.007075

4.5 Interpretation of coefficients

The LabelAppeal effect increases monotonically with LabelAppeal level. The magnitude is very similar across all models. There is strong, consistent positive impact: better label appeal strongly increases expected TARGET.

STARS_imputed has a strong positive effect for imputed stars 2–4 indicating that higher stars increase expected TARGET.

AcidIndex variables coefficients gradually decrease from AcidIndex5 → AcidIndex17.Hence smaller AcidIndex tends to slightly increase expected count (TARGET).

Alcohol, VolatileAcidity also have meaningful effects.

pH has smaller or even slightly negative effects in Poisson/NB vs linear regression. We are still preferring Poisson model as pH is not a significant predictor.

Model Comparison: Poisson vs NB have coefficients nearly identical for most predictors; NB may handle overdispersion better but doesn’t change effect direction/magnitude much.

Poisson 2a vs 2b have slight differences (~0.001–0.01), negligible practically. 2b seems slightly refined with log transformations.

Linear regression vs Poisson/NB comparison indicates that some variables flip direction (e.g., pH, Density), chemical variables are often more interpretable on log scale in Poisson/NB.

# Poisson regression

# Interpretation of coefficients (Model 2a)
cat("\n=== Coefficient Interpretation (Poisson regression Model 2a) ===\n")
## 
## === Coefficient Interpretation (Poisson regression Model 2a) ===
coef_exp <- exp(coef(pr_model2a))
print(round(coef_exp, 4))
##            (Intercept)             AcidIndex5             AcidIndex6 
##                 1.9850                 0.9091                 0.9081 
##             AcidIndex7             AcidIndex8             AcidIndex9 
##                 0.8780                 0.8473                 0.7680 
##            AcidIndex10            AcidIndex11            AcidIndex12 
##                 0.6601                 0.4835                 0.4855 
##            AcidIndex13            AcidIndex14            AcidIndex15 
##                 0.5636                 0.5049                 0.7895 
##            AcidIndex16            AcidIndex17           SQRT_Alcohol 
##                 0.3912                 0.0000                 1.0347 
##          LOG_Chlorides         LOG_CitricAcid                Density 
##                 0.9098                 1.0383                 0.7658 
##      SQRT_FixedAcidity  LOG_FreeSulfurDioxide          LabelAppeal-1 
##                 0.9977                 1.0286                 1.2616 
##           LabelAppeal0           LabelAppeal1           LabelAppeal2 
##                 1.5244                 1.7488                 1.9889 
##      LOG_ResidualSugar         STARS_imputed2         STARS_imputed3 
##                 1.0087                 1.3622                 1.5381 
##         STARS_imputed4   is_stars_missingTRUE          LOG_Sulphates 
##                 1.7149                 0.3395                 0.9784 
## LOG_TotalSulfurDioxide    LOG_VolatileAcidity                     pH 
##                 1.0378                 0.8904                 0.9936
# Interpretation of coefficients (Model 2b)
cat("\n=== Coefficient Interpretation (Poisson regression Model 2b) ===\n")
## 
## === Coefficient Interpretation (Poisson regression Model 2b) ===
coef_exp <- exp(coef(pr_model2b))
print(round(coef_exp, 4))
##            (Intercept)             AcidIndex5             AcidIndex6 
##                 1.5048                 0.9050                 0.9096 
##             AcidIndex7             AcidIndex8             AcidIndex9 
##                 0.8807                 0.8507                 0.7712 
##            AcidIndex10            AcidIndex11            AcidIndex12 
##                 0.6614                 0.4835                 0.4879 
##            AcidIndex13            AcidIndex14            AcidIndex15 
##                 0.5667                 0.5068                 0.8024 
##            AcidIndex16            AcidIndex17           SQRT_Alcohol 
##                 0.3931                 0.0000                 1.0345 
##          LOG_Chlorides  LOG_FreeSulfurDioxide          LabelAppeal-1 
##                 0.9069                 1.0291                 1.2625 
##           LabelAppeal0           LabelAppeal1           LabelAppeal2 
##                 1.5245                 1.7505                 1.9876 
##         STARS_imputed2         STARS_imputed3         STARS_imputed4 
##                 1.3631                 1.5396                 1.7164 
##   is_stars_missingTRUE LOG_TotalSulfurDioxide    LOG_VolatileAcidity 
##                 0.3389                 1.0386                 0.8898
# Negative Binomial Regression Models

# Interpretation of coefficients (Negative Binomial Regression Model 1)
cat("\n=== Coefficient Interpretation (Negative Binomial Regression Model 1) ===\n")
## 
## === Coefficient Interpretation (Negative Binomial Regression Model 1) ===
coef_exp <- exp(coef(nb_model1))
print(round(coef_exp, 4))
##          (Intercept)           AcidIndex5           AcidIndex6 
##               2.7395               0.9039               0.9069 
##           AcidIndex7           AcidIndex8           AcidIndex9 
##               0.8792               0.8471               0.7649 
##          AcidIndex10          AcidIndex11          AcidIndex12 
##               0.6502               0.4744               0.4707 
##          AcidIndex13          AcidIndex14          AcidIndex15 
##               0.5499               0.4934               0.7871 
##          AcidIndex16          AcidIndex17              Alcohol 
##               0.3847               0.0000               1.0051 
##            Chlorides           CitricAcid              Density 
##               0.9329               1.0119               0.7729 
##         FixedAcidity    FreeSulfurDioxide        LabelAppeal-1 
##               0.9997               1.0002               1.2596 
##         LabelAppeal0         LabelAppeal1         LabelAppeal2 
##               1.5220               1.7477               1.9904 
##        ResidualSugar       STARS_imputed2       STARS_imputed3 
##               1.0002               1.3664               1.5417 
##       STARS_imputed4 is_stars_missingTRUE            Sulphates 
##               1.7169               0.3369               0.9897 
##   TotalSulfurDioxide      VolatileAcidity                   pH 
##               1.0001               0.9445               0.9927
# Interpretation of coefficients (Negative Binomial Regression Model 1)
cat("\n=== Coefficient Interpretation (Negative Binomial Regression Model 2) ===\n")
## 
## === Coefficient Interpretation (Negative Binomial Regression Model 2) ===
coef_exp <- exp(coef(nb_model2))
print(round(coef_exp, 4))
##            (Intercept)             AcidIndex5             AcidIndex6 
##                 1.5048                 0.9050                 0.9096 
##             AcidIndex7             AcidIndex8             AcidIndex9 
##                 0.8807                 0.8507                 0.7711 
##            AcidIndex10            AcidIndex11            AcidIndex12 
##                 0.6614                 0.4835                 0.4879 
##            AcidIndex13            AcidIndex14            AcidIndex15 
##                 0.5667                 0.5068                 0.8024 
##            AcidIndex16            AcidIndex17           SQRT_Alcohol 
##                 0.3931                 0.0000                 1.0345 
##          LOG_Chlorides  LOG_FreeSulfurDioxide          LabelAppeal-1 
##                 0.9069                 1.0291                 1.2625 
##           LabelAppeal0           LabelAppeal1           LabelAppeal2 
##                 1.5245                 1.7505                 1.9876 
##         STARS_imputed2         STARS_imputed3         STARS_imputed4 
##                 1.3631                 1.5396                 1.7164 
##   is_stars_missingTRUE LOG_TotalSulfurDioxide    LOG_VolatileAcidity 
##                 0.3389                 1.0386                 0.8898
# Function to convert linear model log-coefficients into percent change
interpret_lm <- function(model) {
  coefs <- coef(model)
  pct   <- (exp(coefs) - 1) * 100
  out   <- data.frame(
    Coefficient = round(coefs, 4),
    Percent_Change = round(pct, 2)
  )
  return(out)
}

# Interpretation for LM1
cat("\n=== Coefficient Interpretation (LM1: Basic Model) ===\n")
## 
## === Coefficient Interpretation (LM1: Basic Model) ===
print(interpret_lm(lm_model1))
##                    Coefficient Percent_Change
## (Intercept)             2.3346         932.49
## FixedAcidity           -0.0020          -0.20
## VolatileAcidity        -0.2442         -21.67
## CitricAcid              0.0758           7.88
## ResidualSugar           0.0005           0.05
## Chlorides              -0.2482         -21.98
## FreeSulfurDioxide       0.0006           0.06
## TotalSulfurDioxide      0.0005           0.05
## Density                -1.2475         -71.28
## pH                     -0.0515          -5.02
## Sulphates              -0.0567          -5.51
## Alcohol                 0.0199           2.01
## LabelAppeal-1           0.6469          90.96
## LabelAppeal0            1.2370         244.53
## LabelAppeal1            1.7534         477.43
## LabelAppeal2            2.3005         897.92
## AcidIndex5              0.7958         121.62
## AcidIndex6              0.7660         115.12
## AcidIndex7              0.6631          94.08
## AcidIndex8              0.5002          64.91
## AcidIndex9              0.0908           9.50
## AcidIndex10            -0.4801         -38.13
## AcidIndex11            -1.0652         -65.53
## AcidIndex12            -1.2764         -72.10
## AcidIndex13            -0.9033         -59.48
## AcidIndex14            -1.0252         -64.13
## AcidIndex15            -0.1500         -13.93
## AcidIndex16            -1.6135         -80.08
## AcidIndex17            -1.9850         -86.26
## STARS_imputed2         -0.1277         -11.99
## STARS_imputed3          1.5023         349.20
## STARS_imputed4          2.1686         774.62
# Interpretation for LM2
cat("\n=== Coefficient Interpretation (LM2: Transformed Model with STARS imputed) ===\n")
## 
## === Coefficient Interpretation (LM2: Transformed Model with STARS imputed) ===
print(interpret_lm(lm_model2))
##                        Coefficient Percent_Change
## (Intercept)                 1.2581         251.87
## AcidIndex5                 -0.2539         -22.42
## AcidIndex6                 -0.1585         -14.65
## AcidIndex7                 -0.2613         -23.00
## AcidIndex8                 -0.3587         -30.14
## AcidIndex9                 -0.6468         -47.63
## AcidIndex10                -0.9277         -60.45
## AcidIndex11                -1.3944         -75.20
## AcidIndex12                -1.3932         -75.17
## AcidIndex13                -1.4126         -75.65
## AcidIndex14                -1.2334         -70.87
## AcidIndex15                -0.5762         -43.80
## AcidIndex16                -1.6367         -80.54
## AcidIndex17                -1.7448         -82.53
## SQRT_Alcohol                0.1038          10.94
## LOG_Chlorides              -0.2731         -23.90
## LOG_FreeSulfurDioxide       0.0739           7.67
## LabelAppeal-1               0.3704          44.84
## LabelAppeal0                0.8401         131.65
## LabelAppeal1                1.3053         268.89
## LabelAppeal2                1.8843         558.20
## STARS_imputed2              1.0246         178.59
## STARS_imputed3              1.5905         390.61
## STARS_imputed4              2.2828         880.36
## is_stars_missingTRUE       -2.3637         -90.59
## LOG_TotalSulfurDioxide      0.1083          11.44
## LOG_VolatileAcidity        -0.3320         -28.25

5 Select Models

5.1 Poisson Regression

# Evaluate poisson regression model
evaluate_pr_model <- function(model, test_data, model_name) {

  preds <- predict(model, newdata = test_data, type = "response")
  actual <- test_data$TARGET

  rmse <- sqrt(mean((preds - actual)^2))
  mae  <- mean(abs(preds - actual))
  r2   <- 1 - sum((preds - actual)^2) / sum((actual - mean(actual))^2)
  aic  <- AIC(model)
  
  # Poisson Regression Overdispersion test
  residual_dev <- model$deviance
  df_residual <- model$df.residual
  dispersion <- residual_dev / df_residual
  
  # Pearson chi-squared goodness of fit statistic
  pearson_chi2_statistic <- sum(residuals(model, type = "pearson")^2)
  degree_freedom_pearson <- df.residual(model)
  p_value <- 1 - pchisq(pearson_chi2_statistic, degree_freedom_pearson)
  
  # standard deviance/Pearson residuals can be unreliable for Poisson - DHARMa can help
  # DHARMa simulates residuals to standardized residuals
  simulationOutput <- simulateResiduals(model)
  # Plot the diagnostics
  plot(simulationOutput)
  testOutliers(model, type = "bootstrap")

  output <- paste(
    "\n=== Model Selection and Evaluation ===\n\n",
    "=== ", model_name, " Evaluation ===\n",
    "RMSE:", round(rmse, 4), 
    "| MAE:", round(mae, 4),
    "| R²:", round(r2, 4), "\n",
    "| AIC:", round(aic, 4), "\n",
    "| Pearson Chi-squared Goodness of Fit Statistic:", round(pearson_chi2_statistic, 4), "\n",
    "| Degrees of Freedom:", round(degree_freedom_pearson, 4), "\n",
    "| P-value:", round(p_value, 4), "\n",
    "| Dispersion Statistic:", round(dispersion, 4), "\n\n",
    sep = " "
  )

  cat(output)
}


pr_model_diagnostic_graphs <- function(model) {
  par(mfrow = c(2,3))
  plot(model, which = 1:6)
}

# === PR1 Evaluation ===
evaluate_pr_model(pr_model2a, test_transformed, "PR2a: Transformed variables with Stars Imputed")
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details

## 
## === Model Selection and Evaluation ===
## 
##  ===  PR2a: Transformed variables with Stars Imputed  Evaluation ===
##  RMSE: 1.3176 | MAE: 1.0247 | R²: 0.5372 
##  | AIC: 36334.9878 
##  | Pearson Chi-squared Goodness of Fit Statistic: 8874.9655 
##  | Degrees of Freedom: 10205 
##  | P-value: 1 
##  | Dispersion Statistic: 1.0486
pr_model_diagnostic_graphs(pr_model2a)

# Model 2 Evaluation

# === PR2 Evaluation ===
evaluate_pr_model(pr_model2b, test_transformed, "PR2b: Transformed variables with Stars Imputed")
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details

## 
## === Model Selection and Evaluation ===
## 
##  ===  PR2b: Transformed variables with Stars Imputed  Evaluation ===
##  RMSE: 1.3189 | MAE: 1.0257 | R²: 0.5363 
##  | AIC: 36331.7641 
##  | Pearson Chi-squared Goodness of Fit Statistic: 8881.655 
##  | Degrees of Freedom: 10211 
##  | P-value: 1 
##  | Dispersion Statistic: 1.0488
pr_model_diagnostic_graphs(pr_model2b)

5.2 Negative Binomial Regression Models

# === Negative Binomial Regression Evaluation ===
evaluate_pr_model(nb_model1, test_transformed, "NB1: Original variables with Stars Imputed")
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details

## 
## === Model Selection and Evaluation ===
## 
##  ===  NB1: Original variables with Stars Imputed  Evaluation ===
##  RMSE: 1.3178 | MAE: 1.0226 | R²: 0.5371 
##  | AIC: 36378.8095 
##  | Pearson Chi-squared Goodness of Fit Statistic: 8912.2927 
##  | Degrees of Freedom: 10205 
##  | P-value: 1 
##  | Dispersion Statistic: 1.0526
pr_model_diagnostic_graphs(nb_model1)

# === Negative Binomial Regression Evaluation ===
evaluate_pr_model(nb_model2, test_transformed, "NB2: Transformed variables with Stars Imputed")
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details

## 
## === Model Selection and Evaluation ===
## 
##  ===  NB2: Transformed variables with Stars Imputed  Evaluation ===
##  RMSE: 1.3189 | MAE: 1.0257 | R²: 0.5363 
##  | AIC: 36334.1035 
##  | Pearson Chi-squared Goodness of Fit Statistic: 8881.2506 
##  | Degrees of Freedom: 10211 
##  | P-value: 1 
##  | Dispersion Statistic: 1.0488
pr_model_diagnostic_graphs(nb_model2)

5.3 Multiple Linear Regression Models

# === Linear Regression Evaluation ===
evaluate_pr_model(lm_model1, test_transformed, "LM1: Basic variables with Stars Imputed")

## 
## === Model Selection and Evaluation ===
## 
##  ===  LM1: Basic variables with Stars Imputed  Evaluation ===
##  RMSE: 1.5537 | MAE: 1.2403 | R²: 0.3565 
##  | AIC: 47688.0218 
##  | Pearson Chi-squared Goodness of Fit Statistic: 30972.5625 
##  | Degrees of Freedom: 12763 
##  | P-value: 0 
##  | Dispersion Statistic:
pr_model_diagnostic_graphs(lm_model1)

# === Linear Regression Evaluation ===
evaluate_pr_model(lm_model2, test_transformed, "LM1: Basic variables with Stars Imputed")

## 
## === Model Selection and Evaluation ===
## 
##  ===  LM1: Basic variables with Stars Imputed  Evaluation ===
##  RMSE: 1.3184 | MAE: 1.0256 | R²: 0.5366 
##  | AIC: 43018.5555 
##  | Pearson Chi-squared Goodness of Fit Statistic: 21519.0519 
##  | Degrees of Freedom: 12768 
##  | P-value: 0 
##  | Dispersion Statistic:
pr_model_diagnostic_graphs(lm_model2)

5.4 Final model selection

This table summarizes the RMSE, RSQUARED, MAE, AIC, Pvalue and Dispersion for all SIX models.

NegBinom1 has the lowest RMSE (1.3178) and lowest MAE (1.0226) → slightly better predictive accuracy. Diagnostic plots show

NegBinom1 has highest R² (0.5371) → explains slightly more variance in TARGET.

Poisson Model 2a and 2b has slightly lower AIC (~36331.76) than NegBinom1 (36378.81). However Poisson dispersion ~1.0488 → mild overdispersion. NegBinom1 explicitly models overdispersion (dispersion = 1.0526).

Even if AIC is slightly lower for Poisson1, NegBinom1 is more robust because it accounts for overdispersion, which is a common issue in count data.

Diagnostic plots:

  • Poisson1: Goodness of fit shows overdispersion, outliers do exist, residual plots do not show random scatter around zero, nor roughly constant spread across fitted values (excess zeros, yet we do have categorical predictors)
  • Posson2: Goodness of fit shows overdispersion, outliers do exist, residual plots do not show random scatter around zero, nor roughly constant spread across fitted values (excess zeros, yet we do have categorical predictors)
  • NegBinom1: Goodness of fit shows overdispersion, outliers do exist, residual plots do not show random scatter around zero (that’s expected for NB), nor roughly constant spread across fitted values (excess zeros, yet we do have categorical predictors)
  • NegBinom2: Goodness of fit shows overdispersion, outliers do exist, residual plots do not show random scatter around zero (that’s expected for NB), nor roughly constant spread across fitted values (excess zeros, yet we do have categorical predictors)
  • MLM 1: Goodness of fit shows overdispersion, outliers do exist, residual plots do not show random scatter around zero (that’s expected for NB), nor roughly constant spread across fitted values
  • MLM 2: Goodness of fit shows overdispersion, outliers do exist, residual plots do not show random scatter around zero (that’s expected for NB), nor roughly constant spread across fitted values

Hence NegBinom1 is the best count regression model overall because:

  • It has slightly better predictive performance (lowest RMSE/MAE, highest R²).

  • It handles overdispersion, which Poisson models only approximate.

  • Its coefficients are interpretable and consistent with Poisson models.

In the future, one model we could explore could be the ZINB model or the quasi-Poisson model. Additionally, we could test out different models that have outliers removed.

results <- data.frame(Model = c("Poisson Model 2a",
                                "Poisson Model 2b",
                                "NegBinom1",
                                "NegBinom2",
                                "MLM 1",
                                "MLM 2"), 
                     
                     RMSE = c(1.3189,1.3189,1.3178,1.3189,1.5537,1.3184),
                     MAE = c(1.0257,1.0257,1.0226,1.0257,1.2403 ,1.0256 ),
                     R2 = c(0.5363,0.5363, 0.5371,0.5363,0.3565,0.5366   ),
                     AIC = c(36331.7641,36331.7641,36378.8095,36334.1035,47688.0218, 43018.5555),
                     P_value = c(1,1,1,1,0,0),
                    Dispersion_Statistic = c(1.0488,1.0488,1.0526,1.0488,NA,NA  )
)

knitr::kable(results)
Model RMSE MAE R2 AIC P_value Dispersion_Statistic
Poisson Model 2a 1.3189 1.0257 0.5363 36331.76 1 1.0488
Poisson Model 2b 1.3189 1.0257 0.5363 36331.76 1 1.0488
NegBinom1 1.3178 1.0226 0.5371 36378.81 1 1.0526
NegBinom2 1.3189 1.0257 0.5363 36334.10 1 1.0488
MLM 1 1.5537 1.2403 0.3565 47688.02 0 NA
MLM 2 1.3184 1.0256 0.5366 43018.56 0 NA
results <- data.frame(
  Model = c("Poisson1", "Poisson2", "NegBinom1", "NegBinom2", "MLM 1", "MLM 2"), 
  RMSE = c(1.3189, 1.3189, 1.3178, 1.3189, 1.5537, 1.3184),
  MAE = c(1.0257, 1.0257, 1.0226, 1.0257, 1.2403, 1.0256),
  R2 = c(0.5363, 0.5363, 0.5371, 0.5363, 0.3565, 0.5366),
  AIC = c(36331.7641, 36331.7641, 36378.8095, 36334.1035, 47688.0218, 43018.5555),
  P_value = c(1, 1, 1, 1, 0, 0),
  Dispersion_Statistic = c(1.0488, 1.0488, 1.0526, 1.0488, NA, NA)
)

knitr::kable(results)
Model RMSE MAE R2 AIC P_value Dispersion_Statistic
Poisson1 1.3189 1.0257 0.5363 36331.76 1 1.0488
Poisson2 1.3189 1.0257 0.5363 36331.76 1 1.0488
NegBinom1 1.3178 1.0226 0.5371 36378.81 1 1.0526
NegBinom2 1.3189 1.0257 0.5363 36334.10 1 1.0488
MLM 1 1.5537 1.2403 0.3565 47688.02 0 NA
MLM 2 1.3184 1.0256 0.5366 43018.56 0 NA

6 Predictions for Eval data

Data Checks

# Check missing data
eval_data %>%
  summarise_all(~ sum(is.na(.))) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
  filter(missing_count != 0) %>%
  arrange(desc(missing_count))
## # A tibble: 9 × 2
##   variable           missing_count
##   <chr>                      <int>
## 1 TARGET                      3335
## 2 STARS                        841
## 3 Sulphates                    310
## 4 Alcohol                      185
## 5 ResidualSugar                168
## 6 TotalSulfurDioxide           157
## 7 FreeSulfurDioxide            152
## 8 Chlorides                    138
## 9 pH                           104
# Check for duplicates
duplicates <- duplicated(eval_data)

# Print the duplicates
print(eval_data[duplicates, ])
##  [1] IN                 TARGET             FixedAcidity       VolatileAcidity   
##  [5] CitricAcid         ResidualSugar      Chlorides          FreeSulfurDioxide 
##  [9] TotalSulfurDioxide Density            pH                 Sulphates         
## [13] Alcohol            LabelAppeal        AcidIndex          STARS             
## <0 rows> (or 0-length row.names)

Data Cleanup

#Converting ordinal variables as factor data type.
eval_data$LabelAppeal= as.factor(eval_data$LabelAppeal)
eval_data$AcidIndex = as.factor(eval_data$AcidIndex) 




# If value is negative (most likely inputted incorrectly), update to NA
eval_data$FixedAcidity =  ifelse(eval_data$FixedAcidity < 0, NA, eval_data$FixedAcidity)
eval_data$VolatileAcidity =  ifelse(eval_data$VolatileAcidity < 0, NA, eval_data$VolatileAcidity)
eval_data$CitricAcid =  ifelse(eval_data$CitricAcid < 0, NA, eval_data$CitricAcid)
eval_data$ResidualSugar =  ifelse(eval_data$ResidualSugar < 0, NA, eval_data$ResidualSugar)
eval_data$Chlorides =  ifelse(eval_data$Chlorides < 0, NA, eval_data$Chlorides)
eval_data$FreeSulfurDioxide =  ifelse(eval_data$FreeSulfurDioxide < 0, NA, eval_data$FreeSulfurDioxide)
eval_data$TotalSulfurDioxide =  ifelse(eval_data$TotalSulfurDioxide < 0, NA, eval_data$TotalSulfurDioxide)
eval_data$Sulphates =  ifelse(eval_data$Sulphates < 0, NA, eval_data$Sulphates)
eval_data$Alcohol =  ifelse(eval_data$Alcohol < 0, NA, raw_data$Alcohol)


# Fill in NA with median
evaldata_imputed <- eval_data %>%
   mutate_at(vars(c("FixedAcidity", "VolatileAcidity", "CitricAcid", "ResidualSugar", "Chlorides", "FreeSulfurDioxide", "TotalSulfurDioxide", "pH", "Sulphates", "Alcohol")), ~ifelse(is.na(.), median(., na.rm = TRUE), .))

# Create new missing indicator variable
evaldata_imputed  <- evaldata_imputed  %>%
  mutate(is_stars_missing = is.na(STARS))



# Convert STARS to numeric as it is a factor
evaldata_imputed$STARS <- as.numeric(as.character(evaldata_imputed$STARS))


# impute STARS missing values using the median
evaldata_imputed  <- evaldata_imputed %>%
  mutate(STARS_imputed = coalesce(STARS, median(STARS, na.rm = TRUE)))

# Convert missing indicator and STARS variables to factor
evaldata_imputed$is_stars_missing <- as.factor(evaldata_imputed$is_stars_missing)
evaldata_imputed$STARS <- as.factor(evaldata_imputed$STARS)
evaldata_imputed$STARS_imputed <- as.factor(evaldata_imputed$STARS_imputed)

# remove TARGET variable
evaldata_imputed <- evaldata_imputed |> 
  dplyr::select(-TARGET)

# Check for missing values again
plot_missing(evaldata_imputed )

Generate predictions

The histogram shows the TARGET predicted distribution.We see that as 763 observations has STARS NA, the imputation generated a TARGET value of 1 for the same.hence there is a modal at 1.

#evaldata_transform <- evaldata_imputed 
#evaldata_transform <- evaldata_transform%>%
#  mutate(
#    SQRT_Alcohol = sqrt(evaldata_transform$Alcohol +1),
#    LOG_Chlorides = log(evaldata_transform$Chlorides + 1),
#    LOG_FreeSulfurDioxide = log(evaldata_transform$FreeSulfurDioxide + 1),
#    LOG_TotalSulfurDioxide = log(evaldata_transform$TotalSulfurDioxide + 1),
#    LOG_VolatileAcidity  = log(evaldata_transform$VolatileAcidity + 1)
#  )

finalpreds <- predict(nb_model1, evaldata_imputed,type = "response" )
finaldf <- cbind(evaldata_imputed,TARGET=round(finalpreds,0))

hist(finalpreds)

write.csv(finaldf, 'wine-predicted-data.csv', row.names = FALSE)