Abstract

This study analyzes and models wine sales data to predict the number of wine cases sold based on chemical and marketing features. After extensive data exploration, preprocessing (missing value imputation, transformations), and outlier handling, multiple models were built and evaluated, including Linear Regression (MLR), Poisson Regression, and Negative Binomial Regression.

The Negative Binomial Higher-Order Interaction Model emerged as the final model for prediction. It effectively handles overdispersion inherent in the target variable and achieves competitive predictive performance (Test RMSE = 1.584). Comparisons with the Polynomial MLR and Poisson Higher-Order Interaction models highlight the robustness of the Negative Binomial model in balancing residual behavior, accuracy, and suitability for count data.

Residual analyses revealed patterns of heteroscedasticity and non-normality in MLR and Poisson models, reinforcing the superiority of the Negative Binomial approach. Predictions across models were highly correlated, as confirmed by pairwise scatter plots, demonstrating consistency in predictive trends.

For submission, the Negative Binomial Higher-Order Interaction Model predictions will be provided, with visual comparisons to the Polynomial MLR, Higher-Order Interaction MLR, and Poisson Higher-Order Interaction models for further validation and insights.

Introduction

This document outlines the analysis and modeling of the wine dataset using R. The goal is to predict the number of wine cases sold based on chemical and marketing features.

Data Exploration

Load Libraries and Data

# Load required libraries
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(pscl)

## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

# Load data
train_data <- read.csv("https://raw.githubusercontent.com/hawa1983/DATA-621/refs/heads/main/Homework%205/wine-training-data.csv")
eval_data <- read.csv("https://raw.githubusercontent.com/hawa1983/DATA-621/refs/heads/main/Homework%205/wine-evaluation-data.csv")

cat("\nStructure of the training data\n")

## 
## Structure of the training data

str(train_data)

## 'data.frame':    12795 obs. of  16 variables:
##  $ INDEX             : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...

cat("\nStructure of the evaluation data\n")

## 
## Structure of the evaluation data

str(eval_data)

## 'data.frame':    3335 obs. of  16 variables:
##  $ INDEX             : int  3 9 10 18 21 30 31 37 39 47 ...
##  $ TARGET            : logi  NA NA NA NA NA NA ...
##  $ FixedAcidity      : num  5.4 12.4 7.2 6.2 11.4 17.6 15.5 15.9 11.6 3.8 ...
##  $ VolatileAcidity   : num  -0.86 0.385 1.75 0.1 0.21 0.04 0.53 1.19 0.32 0.22 ...
##  $ CitricAcid        : num  0.27 -0.76 0.17 1.8 0.28 -1.15 -0.53 1.14 0.55 0.31 ...
##  $ ResidualSugar     : num  -10.7 -19.7 -33 1 1.2 1.4 4.6 31.9 -50.9 -7.7 ...
##  $ Chlorides         : num  0.092 1.169 0.065 -0.179 0.038 ...
##  $ FreeSulfurDioxide : num  23 -37 9 104 70 -250 10 115 35 40 ...
##  $ TotalSulfurDioxide: num  398 68 76 89 53 140 17 381 83 129 ...
##  $ Density           : num  0.985 0.99 1.046 0.989 1.029 ...
##  $ pH                : num  5.02 3.37 4.61 3.2 2.54 3.06 3.07 2.99 3.32 4.72 ...
##  $ Sulphates         : num  0.64 1.09 0.68 2.11 -0.07 -0.02 0.75 0.31 2.18 -0.64 ...
##  $ Alcohol           : num  12.3 16 8.55 12.3 4.8 11.4 8.5 11.4 -0.5 10.9 ...
##  $ LabelAppeal       : int  -1 0 0 -1 0 1 0 1 0 0 ...
##  $ AcidIndex         : int  6 6 8 8 10 8 12 7 12 7 ...
##  $ STARS             : int  NA 2 1 1 NA 4 3 NA NA NA ...

Summary Statistics and Missing Data

# Load necessary libraries
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(dplyr)
library(tidyr)
library(moments) # For skewness and kurtosis

# Create vectors of numeric and categorical variables
numeric_vars <- names(train_data)[sapply(train_data, is.numeric)]
categorical_vars <- names(train_data)[sapply(train_data, is.factor)]

# Correctly select numerical variables using the predefined numeric_vars
numerical_vars <- train_data %>%
  dplyr::select(all_of(numeric_vars))

# Compute statistical summary including skewness, kurtosis, missing value counts, and percentages
statistical_summary <- numerical_vars %>%
  summarise(across(
    everything(),
    list(
      Min = ~round(min(., na.rm = TRUE), 2),
      Q1 = ~round(quantile(., 0.25, na.rm = TRUE), 2),
      Mean = ~round(mean(., na.rm = TRUE), 2),
      Median = ~round(median(., na.rm = TRUE), 2),
      Q3 = ~round(quantile(., 0.75, na.rm = TRUE), 2),
      Max = ~round(max(., na.rm = TRUE), 2),
      SD = ~round(sd(., na.rm = TRUE), 2),
      Variance = ~round(var(., na.rm = TRUE), 2),
      CV = ~round(sd(., na.rm = TRUE) / mean(., na.rm = TRUE), 2),
      Skewness = ~round(skewness(., na.rm = TRUE), 2),
      Kurtosis = ~round(kurtosis(., na.rm = TRUE), 2),
      ZeroCount = ~sum(. == 0, na.rm = TRUE),
      ZeroProportion = ~round(mean(. == 0, na.rm = TRUE) * 100, 2),
      Missing = ~sum(is.na(.)),
      PercentMissing = ~round(mean(is.na(.)) * 100, 2),
      Overdispersion = ~round(var(., na.rm = TRUE) / mean(., na.rm = TRUE), 2),
      Outliers = ~sum(. > (quantile(., 0.75, na.rm = TRUE) + 1.5 * IQR(., na.rm = TRUE)) | 
                       . < (quantile(., 0.25, na.rm = TRUE) - 1.5 * IQR(., na.rm = TRUE)), na.rm = TRUE)
    ),
    .names = "{.col}_{.fn}"
  )) %>%
  pivot_longer(
    cols = everything(),
    names_to = c("Variable", ".value"),
    names_pattern = "^(.*)_(.*)$"
  )

# Display the resulting summary table
statistical_summary

## # A tibble: 16 × 18
##    Variable          Min      Q1    Mean  Median      Q3    Max      SD Variance
##    <chr>           <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>    <dbl>
##  1 INDEX            1    4038.   8070.   8110    1.21e+4 1.61e4 4657.    2.17e+7
##  2 TARGET           0       2       3.03    3    4   e+0 8   e0    1.93  3.71e+0
##  3 FixedAcidity   -18.1     5.2     7.08    6.9  9.5 e+0 3.44e1    6.32  3.99e+1
##  4 VolatileAcid…   -2.79    0.13    0.32    0.28 6.4 e-1 3.68e0    0.78  6.1 e-1
##  5 CitricAcid      -3.24    0.03    0.31    0.31 5.8 e-1 3.86e0    0.86  7.4 e-1
##  6 ResidualSugar -128.     -2       5.42    3.9  1.59e+1 1.41e2   33.8   1.14e+3
##  7 Chlorides       -1.17   -0.03    0.05    0.05 1.5 e-1 1.35e0    0.32  1   e-1
##  8 FreeSulfurDi… -555       0      30.8    30    7   e+1 6.23e2  149.    2.21e+4
##  9 TotalSulfurD… -823      27     121.    123    2.08e+2 1.06e3  232.    5.38e+4
## 10 Density          0.89    0.99    0.99    0.99 1   e+0 1.1 e0    0.03  0      
## 11 pH               0.48    2.96    3.21    3.2  3.47e+0 6.13e0    0.68  4.6 e-1
## 12 Sulphates       -3.13    0.28    0.53    0.5  8.6 e-1 4.24e0    0.93  8.7 e-1
## 13 Alcohol         -4.7     9      10.5    10.4  1.24e+1 2.65e1    3.73  1.39e+1
## 14 LabelAppeal     -2      -1      -0.01    0    1   e+0 2   e0    0.89  7.9 e-1
## 15 AcidIndex        4       7       7.77    8    8   e+0 1.7 e1    1.32  1.75e+0
## 16 STARS            1       1       2.04    2    3   e+0 4   e0    0.9   8.1 e-1
## # ℹ 9 more variables: CV <dbl>, Skewness <dbl>, Kurtosis <dbl>,
## #   ZeroCount <int>, ZeroProportion <dbl>, Missing <int>, PercentMissing <dbl>,
## #   Overdispersion <dbl>, Outliers <int>

The summary statistics highlights key metrics for numerical variables in the dataset, including central tendencies (mean, median), variability (standard deviation, variance), distribution shape (skewness, kurtosis), and data integrity measures (missing values, outliers). Here’s a statement tailored for this modeling context:

analysis of the Summary Statistics:

Target Variable (TARGET):
- The mean (3.03) and median (3.00) indicate a relatively symmetrical distribution with low skewness (-0.33).
- However, the kurtosis (2.12) suggests a slightly heavier-tailed distribution than normal.
- Approximately 21.37% of the TARGET values are zeros, which supports the appropriateness of Zero-Inflated Poisson (ZIP) or Negative Binomial models for this data.
Variables with Missing Data:
- ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Sulphates, and Alcohol have notable missing percentages (ranging from 4.81% to 9.46%).
- Addressing missing values or incorporating missing value flags will be essential for modeling.
Overdispersion:
- Many variables, such as FreeSulfurDioxide (Overdispersion: 716.99) and ResidualSugar (210.20), exhibit significant overdispersion (variance > mean), which further justifies using Negative Binomial models.
Variables with Zeros:
- Some variables, such as LabelAppeal (43.90% zeros) and TARGET (21.37% zeros), have a significant proportion of zeros. This reinforces the potential need for Zero-Inflated Models to account for excess zeros.
Outliers:
- Variables such as ResidualSugar (3298 outliers) and Chlorides (3021 outliers) exhibit a high number of outliers, which could influence regression models. Techniques like transformations or robust regressions may help mitigate their impact.
Highly Skewed Variables:
- Variables such as LabelAppeal (Skewness: 0.01) and AcidIndex (1.65) indicate noticeable deviations from symmetry.
- Highly skewed variables may benefit from transformations (e.g., Box-Cox or log transformations).

Suitability of Models Based on Summary Statistics

Given the observed distribution of the TARGET variable and the characteristics of predictor variables:

Poisson Regression: Suitable for count data, but check for overdispersion in residuals.
Negative Binomial Regression: Preferable for variables like ResidualSugar and FreeSulfurDioxide, where overdispersion is evident.
Zero-Inflated Models: Particularly useful for handling excess zeros in variables like LabelAppeal and TARGET.
Multiple Linear Regression: Can be explored for a broader perspective on relationships but may require transformations to address non-normality and outliers.

This summary emphasizes the appropriate models and preprocessing steps based on the data’s statistical characteristics.

Visualizations

Key Variables for Visualization Analysis:

Based on the summary statistics and modeling context, here are the key variables and visuals. These visualizations will provide critical insights into the data distribution, relationships, and model suitability.

Target Variable (TARGET)

The histogram and bar plot reveal a multimodal distribution of the TARGET variable, with prominent peaks at 0, 3, and 4, and very few occurrences at higher values. The excess of zeros in the data suggests potential zero-inflation, which makes the TARGET variable suitable for models like Zero-Inflated Poisson and Zero-Inflated Negative Binomial, in addition to standard Poisson and Negative Binomial models. The non-uniformity and zero-inflation also indicate that Multiple Regression models will need to account for the skewed and count-based nature of the TARGET variable to ensure accurate predictions. These visuals underscore the importance of considering overdispersion and zero-inflation when selecting statistical models for analysis.

library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

# Remove NAs from ResidualSugar
numerical_vars_clean <- numerical_vars %>%
  dplyr::filter(!is.na(ResidualSugar))

# Create the histogram of TARGET
histogram_plot <- ggplot(numerical_vars_clean, aes(x = TARGET)) +
  geom_histogram(binwidth = 1, fill = "black", color = "black", alpha = 0.6) +
  labs(title = "Histogram of TARGET", x = "TARGET", y = "Frequency") +
  theme_minimal()

# Create the bar plot of TARGET value counts
bar_plot <- ggplot(numerical_vars_clean, aes(x = factor(TARGET))) +
  geom_bar(fill = "black", color = "black", alpha = 0.6) +
  labs(title = "Bar Plot of TARGET Value Counts", x = "TARGET", y = "Count") +
  theme_minimal()

# Use gridExtra to arrange the plots
gridExtra::grid.arrange(histogram_plot, bar_plot, ncol = 1)

The histogram and bar plot reveal a multimodal distribution of the TARGET variable, with prominent peaks around 0, 3, and 4, and fewer occurrences at higher values. This suggests that TARGET is not evenly distributed, making it suitable for count-based regression models such as Poisson, Negative Binomial, and Zero-Inflated models. The observed non-uniformity highlights the importance of selecting appropriate statistical models to account for overdispersion and the excess zero counts. Multiple regression models may also need to address the variable’s skewed distribution for accurate prediction.

Highly Skewed Variables (e.g., ResidualSugar, FreeSulfurDioxide)

The ResidualSugar variable shows heavy skewness and numerous outliers, as seen in the density and box plots. These issues suggest the need for transformations (e.g., log or Box-Cox) to stabilize variance and normalize the distribution. For Poisson and Negative Binomial models, overdispersion and extreme values may require robust handling, such as Zero-Inflated models. For multiple regression, preprocessing the variable will likely improve model performance and interpretability.

R Code:

library(ggplot2)
library(gridExtra)

# Remove NAs from ResidualSugar
numerical_vars_clean <- numerical_vars %>%
  dplyr::filter(!is.na(ResidualSugar))

# Density plot of ResidualSugar
density_plot <- ggplot(numerical_vars_clean, aes(x = ResidualSugar)) +
  geom_density(fill = "black", alpha = 0.6) +
  labs(title = "Density Plot of ResidualSugar", x = "ResidualSugar", y = "Density") +
  theme_minimal()

# Box plot of ResidualSugar
box_plot <- ggplot(numerical_vars_clean, aes(y = ResidualSugar)) +
  geom_boxplot(fill = "black", color = "black", alpha = 0.6) +
  labs(title = "Box Plot of ResidualSugar", y = "ResidualSugar") +
  theme_minimal()

# Use gridExtra to arrange the plots
gridExtra::grid.arrange(density_plot, box_plot, ncol = 2)

LabelAppeal and STARS

The visualizations highlight the relationships between the key predictors (STARS and LabelAppeal) and TARGET. The box plot demonstrates that TARGET increases with higher values of STARS, consistent with the theoretical expectation that better wine ratings lead to higher sales. Similarly, the bar plot shows a clear positive trend between LabelAppeal and the mean TARGET, aligning with the notion that visually appealing labels drive consumer purchases. These predictors are significant in the models and should be emphasized for their strong predictive power in explaining TARGET.

library(ggplot2)
library(gridExtra)

# Box plot of TARGET by STARS
box_plot <- ggplot(numerical_vars, aes(x = factor(STARS), y = TARGET)) +
  geom_boxplot(fill = "black", color = "black", alpha = 0.6) +
  labs(title = "Box Plot of TARGET by STARS", x = "STARS", y = "TARGET") +
  scale_x_discrete(labels = function(x) sprintf("%.4f", as.numeric(x))) +
  theme_minimal()

# Bar plot of TARGET by LabelAppeal
bar_plot <- ggplot(numerical_vars, aes(x = factor(LabelAppeal), y = TARGET)) +
  geom_bar(stat = "summary", fun = "mean", fill = "black", color = "black", alpha = 0.6) +
  labs(title = "Bar Plot of TARGET by LabelAppeal", x = "LabelAppeal", y = "Mean TARGET") +
  scale_x_discrete(labels = function(x) sprintf("%.4f", as.numeric(x))) +
  theme_minimal()

# Use gridExtra to arrange the plots
gridExtra::grid.arrange(box_plot, bar_plot, ncol = 1)

Correlations Between Numeric Predictors and TARGET

The scatter plot reveals a positive but weak linear relationship between Alcohol and TARGET, suggesting Alcohol may have a small predictive effect on the number of cases purchased. The correlation heatmap highlights stronger positive correlations for variables such as STARS and LabelAppeal with TARGET, aligning with their theoretical importance in the models. These findings guide variable selection and emphasize the need for both simple and complex model types (e.g., Poisson, Negative Binomial, Zero-Inflated, and Multiple Regression) to capture various relationships effectively.

# Install the ggcorrplot package if not already installed
if (!requireNamespace("ggcorrplot", quietly = TRUE)) {
  install.packages("ggcorrplot")
}

# Load necessary libraries
library(ggplot2)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggcorrplot)
library(gridExtra)

# Scatter plot: TARGET vs. Alcohol
scatter_plot <- ggplot(numerical_vars, aes(x = Alcohol, y = TARGET)) +
  geom_point(alpha = 0.6, color = 'blue') +
  geom_smooth(method = "lm", color = 'red', se = FALSE) +
  labs(title = "Scatter Plot of TARGET vs Alcohol", x = "Alcohol", y = "TARGET") +
  theme_minimal()

# Correlation heatmap for numeric variables
numeric_vars <- dplyr::select(numerical_vars, where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "complete.obs")
heatmap <- ggcorrplot(cor_matrix, lab = TRUE, digits = 1, title = "Correlation Heatmap")

# Display scatter plot and heatmap
scatter_plot

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 653 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 653 rows containing missing values or values outside the scale range
## (`geom_point()`).

heatmap

Zero-Inflation and Overdispersion

The bar plot illustrates the proportion of zeros across numeric variables, with the TARGET variable exhibiting a notable degree of zero inflation. This suggests the appropriateness of employing Zero-Inflated Poisson or Negative Binomial regression to account for the excess zeros effectively. In contrast, variables such as LabelAppeal show a higher presence of non-zero values, indicating their suitability for traditional regression models without zero-inflation considerations.

# Bar plot of zero proportions
zero_counts <- numerical_vars %>%
 summarise(across(where(is.numeric), ~ mean(. == 0, na.rm = TRUE))) %>%
 pivot_longer(everything(), names_to = "Variable", values_to = "ZeroProportion")

ggplot(zero_counts, aes(x = reorder(Variable, -ZeroProportion), y = ZeroProportion)) +
 geom_bar(stat = "identity", fill = "black", color = "black", alpha = 0.6) +
 coord_flip() +
 labs(title = "Proportion of Zeros in Numeric Variables", x = "Variable", y = "Proportion of Zeros") +
 theme_minimal()

Data Preparation

Impute Missing Values

Imputation Strategy
1. Median Imputation: Missing values in numerical features with <20% missingness were imputed using the median, ensuring robustness to outliers.
2. Predictive Imputation: Missing STARS values were imputed using a decision tree (rpart) trained on complete cases.
3. Missingness Flags: Binary indicators were created for all variables with missing data to retain information about missing patterns.

# Load necessary libraries
library(dplyr)
library(rpart)

# Combine training and evaluation datasets for consistent imputation
combined_data <- bind_rows(
  train_data %>% mutate(dataset = "train"),
  eval_data %>% mutate(dataset = "eval")
)

# Step 1: Impute missing values with median for variables with < 20% missingness
combined_data_imputed <- combined_data %>%
  group_by(dataset) %>%
  mutate(
    ResidualSugar = ifelse(is.na(ResidualSugar), median(ResidualSugar, na.rm = TRUE), ResidualSugar),
    Chlorides = ifelse(is.na(Chlorides), median(Chlorides, na.rm = TRUE), Chlorides),
    FreeSulfurDioxide = ifelse(is.na(FreeSulfurDioxide), median(FreeSulfurDioxide, na.rm = TRUE), FreeSulfurDioxide),
    TotalSulfurDioxide = ifelse(is.na(TotalSulfurDioxide), median(TotalSulfurDioxide, na.rm = TRUE), TotalSulfurDioxide),
    pH = ifelse(is.na(pH), median(pH, na.rm = TRUE), pH),
    Sulphates = ifelse(is.na(Sulphates), median(Sulphates, na.rm = TRUE), Sulphates),
    Alcohol = ifelse(is.na(Alcohol), median(Alcohol, na.rm = TRUE), Alcohol)
  ) %>%
  ungroup()

# Step 2: Predictive imputation for STARS using rpart
# Separate combined data into rows with and without missing STARS
complete_stars <- combined_data_imputed %>% filter(!is.na(STARS))
missing_stars <- combined_data_imputed %>% filter(is.na(STARS))

# Check column names and exclude STARS from predictors
predictor_columns <- setdiff(names(combined_data_imputed), c("STARS", "dataset"))

# Train an rpart model to predict STARS
rpart_model <- rpart(
  STARS ~ .,
  data = complete_stars,
  method = "anova"
)

# Predict missing STARS values
predicted_stars <- predict(rpart_model, missing_stars)

# Impute the predicted values back into the combined dataset
combined_data_imputed <- combined_data_imputed %>%
  mutate(
    STARS = ifelse(is.na(STARS), predicted_stars, STARS)
  )

# Step 3: Create missingness flags for all variables with missing data
combined_data_imputed <- combined_data_imputed %>%
  mutate(
    ResidualSugar_missing = is.na(ResidualSugar),
    Chlorides_missing = is.na(Chlorides),
    FreeSulfurDioxide_missing = is.na(FreeSulfurDioxide),
    TotalSulfurDioxide_missing = is.na(TotalSulfurDioxide),
    pH_missing = is.na(pH),
    Sulphates_missing = is.na(Sulphates),
    Alcohol_missing = is.na(Alcohol),
    STARS_missing = is.na(STARS)
  )

# Step 4: Split back into train_data and eval_data
train_data_imputed <- combined_data_imputed %>% filter(dataset == "train") %>% dplyr::select(-dataset)
eval_data_imputed <- combined_data_imputed %>% filter(dataset == "eval") %>% dplyr::select(-dataset)

# Display the structure of the imputed datasets
cat("\nImputed Training Data\n\n")

## 
## Imputed Training Data

str(train_data_imputed)

## tibble [12,795 × 24] (S3: tbl_df/tbl/data.frame)
##  $ INDEX                     : int [1:12795] 1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET                    : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity              : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity           : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid                : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar             : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides                 : num [1:12795] -0.567 -0.425 0.037 -0.425 0.046 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide         : num [1:12795] 30 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide        : num [1:12795] 268 -327 142 115 108 15 156 551 123 180 ...
##  $ Density                   : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                        : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates                 : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 0.5 0.26 0.75 ...
##  $ Alcohol                   : num [1:12795] 9.9 10.4 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal               : int [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex                 : int [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS                     : num [1:12795] 2 3 3 1 2 ...
##  $ ResidualSugar_missing     : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Chlorides_missing         : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ FreeSulfurDioxide_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ TotalSulfurDioxide_missing: logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pH_missing                : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Sulphates_missing         : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Alcohol_missing           : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ STARS_missing             : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...

cat("\nImputed Evaluation Data\n\n")

## 
## Imputed Evaluation Data

str(eval_data_imputed)

## tibble [3,335 × 24] (S3: tbl_df/tbl/data.frame)
##  $ INDEX                     : int [1:3335] 3 9 10 18 21 30 31 37 39 47 ...
##  $ TARGET                    : int [1:3335] NA NA NA NA NA NA NA NA NA NA ...
##  $ FixedAcidity              : num [1:3335] 5.4 12.4 7.2 6.2 11.4 17.6 15.5 15.9 11.6 3.8 ...
##  $ VolatileAcidity           : num [1:3335] -0.86 0.385 1.75 0.1 0.21 0.04 0.53 1.19 0.32 0.22 ...
##  $ CitricAcid                : num [1:3335] 0.27 -0.76 0.17 1.8 0.28 -1.15 -0.53 1.14 0.55 0.31 ...
##  $ ResidualSugar             : num [1:3335] -10.7 -19.7 -33 1 1.2 1.4 4.6 31.9 -50.9 -7.7 ...
##  $ Chlorides                 : num [1:3335] 0.092 1.169 0.065 -0.179 0.038 ...
##  $ FreeSulfurDioxide         : num [1:3335] 23 -37 9 104 70 -250 10 115 35 40 ...
##  $ TotalSulfurDioxide        : num [1:3335] 398 68 76 89 53 140 17 381 83 129 ...
##  $ Density                   : num [1:3335] 0.985 0.99 1.046 0.989 1.029 ...
##  $ pH                        : num [1:3335] 5.02 3.37 4.61 3.2 2.54 3.06 3.07 2.99 3.32 4.72 ...
##  $ Sulphates                 : num [1:3335] 0.64 1.09 0.68 2.11 -0.07 -0.02 0.75 0.31 2.18 -0.64 ...
##  $ Alcohol                   : num [1:3335] 12.3 16 8.55 12.3 4.8 11.4 8.5 11.4 -0.5 10.9 ...
##  $ LabelAppeal               : int [1:3335] -1 0 0 -1 0 1 0 1 0 0 ...
##  $ AcidIndex                 : int [1:3335] 6 6 8 8 10 8 12 7 12 7 ...
##  $ STARS                     : num [1:3335] 1.29 2 1 1 1.29 ...
##  $ ResidualSugar_missing     : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Chlorides_missing         : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ FreeSulfurDioxide_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ TotalSulfurDioxide_missing: logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pH_missing                : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Sulphates_missing         : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Alcohol_missing           : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ STARS_missing             : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...

Transform Data with Box-Cox and Yeo-Johnson

Transformation Approach:
- Box-Cox: Applied to numeric features adjusted for positivity; optimal lambda determined using training data.
- Yeo-Johnson: Handles all numeric features without positivity constraints; lambda optimized via car::powerTransform.
- Transformations were performed on the combined dataset but optimized using training data only.
- Final outputs: Transformed train and evaluation datasets for both methods.

# Combine train and evaluation datasets for consistent transformation
combined_data <- bind_rows(
  train_data_imputed %>% mutate(dataset = "train"),
  eval_data_imputed %>% mutate(dataset = "eval")
)

# Exclude columns from transformations
exclude_cols <- c("INDEX", "TARGET", "dataset")

# Identify numeric columns for transformation
numeric_cols <- setdiff(names(combined_data)[sapply(combined_data, is.numeric)], exclude_cols)

# Perform Box-Cox transformation
boxcox_data_combined <- combined_data

for (col in numeric_cols) {
  if (col %in% c("INDEX", "TARGET") && all(is.na(combined_data[[col]][combined_data$dataset == "eval"]))) {
    next # Skip transformation for INDEX and TARGET in eval_data if all values are NA
  }
  
  tryCatch({
    # Extract the column as a vector
    column_data <- combined_data[[col]]
    
    # Check for non-positive values and adjust
    adjustment <- 0
    if (min(column_data, na.rm = TRUE) <= 0) {
      adjustment <- abs(min(column_data, na.rm = TRUE)) + 0.001
      column_data <- column_data + adjustment
    }
    
    # Fit a simple linear model on training data only
    model <- lm(column_data[combined_data$dataset == "train"] ~ 1)
    
    # Perform Box-Cox transformation
    bc <- boxcox(model, lambda = seq(-2, 2, by = 0.1), plotit = FALSE)
    optimal_lambda <- bc$x[which.max(bc$y)]
    
    # Apply the transformation
    if (!is.na(optimal_lambda)) {
      if (optimal_lambda == 0) {
        transformed <- log(column_data)
      } else {
        transformed <- ((column_data)^optimal_lambda - 1) / optimal_lambda
      }
      if (adjustment > 0) {
        transformed <- transformed - adjustment
      }
      boxcox_data_combined[[col]] <- transformed
    }
  }, error = function(e) {
    cat(paste("Error processing column", col, ":", e$message, "\n"))
  })
}

# Perform Yeo-Johnson transformation
yeojohnson_data_combined <- combined_data

for (col in numeric_cols) {
  if (col %in% c("INDEX", "TARGET") && all(is.na(combined_data[[col]][combined_data$dataset == "eval"]))) {
    next # Skip transformation for INDEX and TARGET in eval_data if all values are NA
  }
  
  tryCatch({
    # Extract the column as a vector
    column_data <- combined_data[[col]]
    
    # Compute optimal lambda using training data
    lambda <- car::powerTransform(column_data[combined_data$dataset == "train"], family = "yjPower")$lambda
    
    # Apply the Yeo-Johnson transformation
    transformed <- car::yjPower(column_data, lambda)
    
    # Save the transformed data
    yeojohnson_data_combined[[col]] <- transformed
  }, error = function(e) {
    cat(paste("Error processing column", col, ":", e$message, "\n"))
  })
}


# Split transformed dataset back into train and evaluation datasets
boxcox_train_data <- boxcox_data_combined %>% 
  filter(dataset == "train") %>% 
  dplyr::select(-c(dataset))

boxcox_eval_data <- boxcox_data_combined %>% 
  filter(dataset == "eval") %>% 
  dplyr::select(-c(dataset))

yeojohnson_train_data <- yeojohnson_data_combined %>% 
  filter(dataset == "train") %>% 
  dplyr::select(-c(dataset))

yeojohnson_eval_data <- yeojohnson_data_combined %>% 
  filter(dataset == "eval") %>% 
  dplyr::select(-c(dataset))

# Display structures of transformed datasets
cat("\nBox-Cox Transformed Training Dataset:\n\n")

## 
## Box-Cox Transformed Training Dataset:

str(boxcox_train_data)

## tibble [12,795 × 24] (S3: tbl_df/tbl/data.frame)
##  $ INDEX                     : int [1:12795] 1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET                    : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity              : num [1:12795] 7.32 9.09 12.66 10.73 13.91 ...
##  $ VolatileAcidity           : num [1:12795] 0.427 -0.706 2.155 -0.454 -0.516 ...
##  $ CitricAcid                : num [1:12795] -1.92 -1.735 -1.811 -0.791 -2.222 ...
##  $ ResidualSugar             : num [1:12795] 150 103.1 84.5 91.1 75.6 ...
##  $ Chlorides                 : num [1:12795] -1.558 -1.422 -0.961 -1.422 -0.952 ...
##  $ FreeSulfurDioxide         : num [1:12795] 457 428.6 810.4 441.8 90.8 ...
##  $ TotalSulfurDioxide        : num [1:12795] 1172.3 14.9 920.3 866.7 852.8 ...
##  $ Density                   : num [1:12795] -0.00719 0.02804 -0.00482 -0.0036 -0.00543 ...
##  $ pH                        : num [1:12795] 2.51 2.56 2.27 1.3 2.27 ...
##  $ Sulphates                 : num [1:12795] -1.5043 -0.0567 -0.3076 1.2533 1.1829 ...
##  $ Alcohol                   : num [1:12795] 11.75 12.4 28.1 6.97 16.77 ...
##  $ LabelAppeal               : num [1:12795] -1.07 -2 -2 -2 -1.07 ...
##  $ AcidIndex                 : num [1:12795] 0.765 0.753 0.765 0.736 0.774 ...
##  $ STARS                     : num [1:12795] 0.647 0.986 0.986 0 0.647 ...
##  $ ResidualSugar_missing     : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Chlorides_missing         : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ FreeSulfurDioxide_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ TotalSulfurDioxide_missing: logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pH_missing                : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Sulphates_missing         : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Alcohol_missing           : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ STARS_missing             : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...

cat("\nBox-Cox Transformed Evaluation Dataset:\n\n")

## 
## Box-Cox Transformed Evaluation Dataset:

str(boxcox_eval_data)

## tibble [3,335 × 24] (S3: tbl_df/tbl/data.frame)
##  $ INDEX                     : int [1:3335] 3 9 10 18 21 30 31 37 39 47 ...
##  $ TARGET                    : int [1:3335] NA NA NA NA NA NA NA NA NA NA ...
##  $ FixedAcidity              : num [1:3335] 10.3 20.1 12.8 11.4 18.7 ...
##  $ VolatileAcidity           : num [1:3335] -1.822 -0.454 1.109 -0.773 -0.65 ...
##  $ CitricAcid                : num [1:3335] -0.531 -1.68 -0.644 1.237 -0.52 ...
##  $ ResidualSugar             : num [1:3335] 43 28.56 7.44 61.94 62.26 ...
##  $ Chlorides                 : num [1:3335] -0.905 0.236 -0.932 -1.179 -0.96 ...
##  $ FreeSulfurDioxide         : num [1:3335] 444 331 417 598 533 ...
##  $ TotalSulfurDioxide        : num [1:3335] 1435 774 789 815 744 ...
##  $ Density                   : num [1:3335] -0.0147 -0.00951 0.04673 -0.01121 0.02912 ...
##  $ pH                        : num [1:3335] 4.45 2.55 3.97 2.36 1.63 ...
##  $ Sulphates                 : num [1:3335] -0.1253 0.3915 -0.0796 1.5828 -0.928 ...
##  $ Alcohol                   : num [1:3335] 14.91 19.87 9.99 14.91 5.21 ...
##  $ LabelAppeal               : num [1:3335] -2 -1.07 -1.07 -2 -1.07 ...
##  $ AcidIndex                 : num [1:3335] 0.736 0.736 0.765 0.765 0.781 ...
##  $ STARS                     : num [1:3335] 0.246 0.647 0 0 0.246 ...
##  $ ResidualSugar_missing     : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Chlorides_missing         : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ FreeSulfurDioxide_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ TotalSulfurDioxide_missing: logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ pH_missing                : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Sulphates_missing         : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Alcohol_missing           : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ STARS_missing             : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...

Visualize Skewness Before and After Transformations

Effectiveness of Transformations:
- Both Box-Cox and Yeo-Johnson reduced skewness significantly for most variables, especially highly skewed ones like AcidIndex, Sulphates, and STARS.
Comparison:
- Box-Cox: More effective for highly skewed variables like AcidIndex and Sulphates.
- Yeo-Johnson: Effective but slightly less impactful for extreme skewness; better suited for variables with zero or negative values.
Minimal Changes:
- Variables with low skewness, such as Density and Chlorides, showed little change as they were already near symmetric.

Both transformations improve symmetry, with Box-Cox being more effective overall, while Yeo-Johnson offers versatility for handling zero or negative values. For our modeling homework, we will use the Box-Cox transformed data

# Required Libraries
library(ggplot2)
library(dplyr)
library(tidyr)

# Compute skewness for original, Box-Cox, and Yeo-Johnson transformed datasets
compute_skewness <- function(data, numeric_cols) {
  sapply(data %>% dplyr::select(all_of(numeric_cols)), function(x) {
    moments::skewness(x, na.rm = TRUE)
  })
}

# Calculate skewness for all datasets
original_skewness <- compute_skewness(train_data_imputed, numeric_cols)
boxcox_skewness <- compute_skewness(boxcox_train_data, numeric_cols)
yeojohnson_skewness <- compute_skewness(yeojohnson_train_data, numeric_cols)

# Combine skewness data into a single dataframe for plotting
skewness_df <- data.frame(
  Variable = rep(numeric_cols, 3),
  Transformation = rep(c("Original", "Box-Cox", "Yeo-Johnson"), each = length(numeric_cols)),
  Skewness = c(original_skewness, boxcox_skewness, yeojohnson_skewness)
)

# Create a bar chart to compare skewness
ggplot(skewness_df, aes(x = Variable, y = Skewness, fill = Transformation)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Skewness Comparison Before and After Transformations",
    x = "Variable",
    y = "Skewness"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set2")

Identify Highly Correlated Variables and Plot Correlations

No Highly Correlated Variable Pairs (Threshold > 0.9):
- The high_corr_df data frame is empty, indicating that no variable pairs have a correlation above the threshold of 0.9 (or below -0.9).
- This suggests that the dataset does not have multicollinearity issues due to highly correlated variables.
Correlation Matrix Plot:
- The corrplot visualization shows the pairwise correlations between variables.
- Most correlations are weak to moderate, with no strong patterns of redundancy among variables.
- The diagonal of the matrix (autocorrelations of variables with themselves) is 1, as expected.

# Load necessary libraries
library(corrplot)

## corrplot 0.95 loaded

# Compute the correlation matrix (only for numerical predictors)
cor_matrix <- cor(boxcox_train_data %>% select_if(is.numeric), use = "pairwise.complete.obs")

# Identify highly correlated variable pairs (e.g., correlation > 0.9 or < -0.9)
high_corr_pairs <- which(abs(cor_matrix) > 0.9 & abs(cor_matrix) < 1, arr.ind = TRUE)
high_corr_df <- data.frame(
  Variable1 = rownames(cor_matrix)[high_corr_pairs[, 1]],
  Variable2 = colnames(cor_matrix)[high_corr_pairs[, 2]],
  Correlation = cor_matrix[high_corr_pairs]
)

# Print highly correlated variable pairs
print(high_corr_df)

## [1] Variable1   Variable2   Correlation
## <0 rows> (or 0-length row.names)

# Plot the correlation matrix using corrplot
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.8)

Confirmation of No Multicollinearity with VIF Before Model Fitting

The output indicates that multicollinearity is not a concern among the predictors in the dataset. Here are the key points:

Variance Inflation Factor (VIF):
- All VIF values are close to 1, with the highest being 1.12 (for STARS), far below the common threshold of 10.
- This suggests that the predictors are largely independent of each other.
No High Multicollinearity Detected:
- The high_vif object is empty (named numeric(0)), confirming that no variables exhibit problematic multicollinearity.
Conclusion:
- You can safely proceed with model fitting using all predictors without any adjustments for multicollinearity.

To address the task comprehensively using concepts from Chapter 5 of the uploaded document, I propose the following steps to build and compare models across Poisson, Negative Binomial, and Linear Regression approaches, incorporating different techniques for variable selection, transformation, and model diagnostics. Here’s the strategy:

# Load necessary libraries
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(carData)

#### Confirmation of No Multicollinearity with VIF Before Model Fitting

# Fit an initial model with all predictors (ignoring TARGET for this step)
initial_model <- lm(
  FixedAcidity ~ VolatileAcidity + CitricAcid + ResidualSugar +
    Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + pH +
    Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS,
  data = boxcox_train_data
)

# Check VIF for all predictors
vif_values <- vif(initial_model)

# Convert VIF values to a data frame
vif_df <- data.frame(
  Variable = names(vif_values),
  VIF = vif_values
)

# Print the VIF data frame
print(vif_df)

##                              Variable      VIF
## VolatileAcidity       VolatileAcidity 1.004032
## CitricAcid                 CitricAcid 1.005589
## ResidualSugar           ResidualSugar 1.001635
## Chlorides                   Chlorides 1.002502
## FreeSulfurDioxide   FreeSulfurDioxide 1.002859
## TotalSulfurDioxide TotalSulfurDioxide 1.003562
## Density                       Density 1.003313
## pH                                 pH 1.005586
## Sulphates                   Sulphates 1.001448
## Alcohol                       Alcohol 1.006642
## LabelAppeal               LabelAppeal 1.080330
## AcidIndex                   AcidIndex 1.027973
## STARS                           STARS 1.092914

# Identify variables with high multicollinearity (VIF > 10)
high_vif <- vif_values[vif_values > 10]

# Create a data frame for high VIF values
high_vif_df <- data.frame(
  Variable = names(high_vif),
  VIF = high_vif
)

# Print the high VIF data frame
print(high_vif_df)

## [1] Variable VIF     
## <0 rows> (or 0-length row.names)

Scale the data

# Scale only numeric columns excluding TARGET
boxcox_train_data <- boxcox_train_data %>%
  mutate(across(where(is.numeric) & !starts_with(c("TARGET", "INDEX")), scale))

boxcox_eval_data <- boxcox_eval_data %>%
  mutate(across(where(is.numeric) & !starts_with(c("TARGET", "INDEX")), scale))

Build Models

Data Splitting, and Outlier Analysis Before Model Training**

In initial model is fit to the data to analyze Cook’s Distance, detect and remove outliers. This step ensures the dataset is clean, free from irrelevant or redundant features, and balanced before fitting models. It removes unnecessary columns (e.g., missing value indicators) because they have been found to not contribute to model performance, partitions the data for training and evaluation, and identifies influential points across multiple models to enhance data quality and improve model performance.

# Load necessary libraries
library(MASS)
library(pscl)
library(car)
library(dplyr)
library(caret)

## Loading required package: lattice

library(lattice)
library(MASS)  # For Negative Binomial Model
library(car)   # For diagnostic tools


# Remove unnecessary columns (e.g., "_missing" flags)
boxcox_train_data <- boxcox_train_data %>%
  dplyr::select(-ends_with("_missing"), -INDEX)

boxcox_eval_data <- boxcox_eval_data %>%
  dplyr::select(-ends_with("_missing"))

# Split data into training and testing sets
set.seed(123)  # For reproducibility
train_index <- createDataPartition(boxcox_train_data$TARGET, p = 0.8, list = FALSE)
train_data <- boxcox_train_data[train_index, ]
test_data <- boxcox_train_data[-train_index, ]

# Convert matrix columns to numeric vectors for train and test data
train_data <- train_data %>%
  mutate(across(where(is.matrix), ~ as.numeric(.)))

test_data <- test_data %>%
  mutate(across(where(is.matrix), ~ as.numeric(.)))

# Split data into training and testing sets
set.seed(123)  # For reproducibility
train_index <- createDataPartition(boxcox_train_data$TARGET, p = 0.8, list = FALSE)
train_data <- boxcox_train_data[train_index, ]
test_data <- boxcox_train_data[-train_index, ]

# Fit the Poisson Model using all variables
poisson_model_all <- glm(TARGET ~ ., data = train_data, family = poisson(link = "log"))

# Fit the Negative Binomial Model using all variables
nb_model <- glm.nb(TARGET ~ ., data = train_data)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

# Fit the Multiple Linear Regression Model using all variables
mlr_model <- lm(TARGET ~ ., data = train_data)

# Calculate Cook's Distance for Poisson Model
cooks_dist_poisson <- cooks.distance(poisson_model_all)
influential_poisson <- which(cooks_dist_poisson > (4 / nrow(train_data)))

# Calculate Cook's Distance for Negative Binomial Model
cooks_dist_nb <- cooks.distance(nb_model)
influential_nb <- which(cooks_dist_nb > (4 / nrow(train_data)))

# Calculate Cook's Distance for MLR Model
cooks_dist_mlr <- cooks.distance(mlr_model)
influential_mlr <- which(cooks_dist_mlr > (4 / nrow(train_data)))

# Union of Influential Points Across All Models
influential_points <- union(union(influential_poisson, influential_nb), influential_mlr)

# Refine Dataset by Removing Influential Points
train_data_refined <- train_data[-influential_points, ]

# Output Number of Removed Points
cat("Total Influential Points Removed Across Models:", length(influential_points), "\n")

## Total Influential Points Removed Across Models: 547

# Convert matrix columns to numeric vectors for train and test data
train_data_refined <- train_data_refined %>%
  mutate(across(where(is.matrix), ~ as.numeric(.)))

test_data <- test_data %>%
  mutate(across(where(is.matrix), ~ as.numeric(.)))

Plot Decision Tree to identify interactions

The tree model identifies STARS, LabelAppeal, and AcidIndex as key predictors, with interactions evident where the effect of LabelAppeal and AcidIndex depends on the range of STARS. Specifically, the effect of LabelAppeal varies within different ranges of STARS, and AcidIndex plays a role when STARS is less than 0.16. Possible interaction terms to include in a model are STARS × LabelAppeal and STARS × AcidIndex to capture these relationships.

# Load necessary libraries
library(rpart)
library(rpart.plot)

# Train a regression tree model with increased depth
set.seed(456)
treeModel <- rpart(
  TARGET ~ .,
  data = train_data_refined,
  method = "anova",
  control = rpart.control(maxdepth = 10)  # Increase the maximum depth
)

# Visualize the tree
rpart.plot(treeModel,
           box.palette = "RdBu",
           shadow.col = "gray",
           nn = TRUE)

# Extract split details using base R
split_details <- as.data.frame(treeModel$frame)
split_details <- split_details[split_details$var != "<leaf>", c("var", "n", "dev", "yval")]

# Output split details
split_details

##            var    n       dev     yval
## 1        STARS 9691 33050.204 3.136209
## 2        STARS 4171 12204.959 2.003117
## 4    AcidIndex 1821  4952.293 1.266886
## 3  LabelAppeal 5520 11443.680 3.992391
## 6  LabelAppeal 3687  6333.720 3.502848
## 13       STARS 2509  3572.616 3.859705
## 26       STARS 1541  2496.550 3.541856
## 7        STARS 1833  2449.038 4.977087
## 14       STARS  811  1371.366 4.541307

Examine variable importatance for modeling

The Random Forest model confirms the importance of STARS and LabelAppeal as the most influential predictors, with AcidIndex, Chlorides, and TotalSulfurDioxide also showing moderate significance. Other variables, such as VolatileAcidity and Alcohol, have a smaller but notable impact, while the remaining predictors contribute minimally. This supports the focus on STARS × LabelAppeal and potentially STARS × AcidIndex for interaction modeling.

# Load necessary libraries
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(doParallel)

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

# Reduce computation time strategies

# Subset the dataset (if dataset is large)
set.seed(123)  # For reproducibility
sample_data <- train_data_refined[sample(1:nrow(train_data_refined), size = 5000), ]

# Set up parallel processing
cl <- makeCluster(detectCores() - 1)  # Use all but one core
registerDoParallel(cl)

# Train the Random Forest model
set.seed(456)  # For reproducibility
rf_model <- randomForest(
  TARGET ~ ., 
  data = sample_data, 
  importance = TRUE,  # Compute variable importance
  ntree = 100,        # Reduce the number of trees
  mtry = 2,           # Reduce the number of predictors per split
  nodesize = 10       # Increase node size to reduce splits
)

# Stop parallel processing
stopCluster(cl)

# Display variable importance
print(importance(rf_model))

##                       %IncMSE IncNodePurity
## FixedAcidity        1.3144050      615.5860
## VolatileAcidity     4.7946415      750.5674
## CitricAcid          0.9824440      577.4464
## ResidualSugar       3.0250795      628.7444
## Chlorides           5.0115482      774.0460
## FreeSulfurDioxide   5.2555645      712.3400
## TotalSulfurDioxide  2.0808506      737.3759
## Density             2.4141392      698.5284
## pH                 -0.9662872      593.9370
## Sulphates          -0.4988596      581.0680
## Alcohol             3.3110223      662.1983
## LabelAppeal        30.2845082     1894.0864
## AcidIndex          18.9442301      932.2876
## STARS              33.7295596     4122.4252

# Plot variable importance
varImpPlot(rf_model, main = "Variable Importance Plot")

## Poisson Regression Models

The Higher-Order Interaction Model is the best-performing Poisson model, based on the lowest Test RMSE (1.584489).

Key Metrics Comparison:

HigherOrderInteraction achieves the lowest Test RMSE (1.584489) and balances performance across Train/Test R² values.
Other models (Stepwise, Interaction, Additive, Simplified) exhibit higher Test RMSE, indicating relatively poorer generalization performance.

This confirms the robustness of the Higher-Order Interaction model in the Poisson regression context.

# Load necessary libraries
library(Metrics)

## 
## Attaching package: 'Metrics'

## The following objects are masked from 'package:caret':
## 
##     precision, recall

library(knitr)
library(MASS)  # For stepAIC function
library(caret) # For train-test split

train_data_refined <- train_data_refined %>% drop_na(TARGET)
test_data <- test_data %>% drop_na(TARGET)

# Step 2: Fit Multiple Poisson Models Using the Training Data
# 1. Basic Additive Model
poisson_model_additive <- glm(TARGET ~ STARS + LabelAppeal + AcidIndex + TotalSulfurDioxide + VolatileAcidity, 
                              family = poisson(link = "log"), data = train_data_refined)

# 2. Interaction Model
poisson_model_interaction <- glm(TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal + 
                                   TotalSulfurDioxide + VolatileAcidity, 
                                 family = poisson(link = "log"), data = train_data_refined)

# 3. Higher-Order Interaction Model
poisson_model_higher_interaction <- glm(TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal + 
                                          TotalSulfurDioxide * VolatileAcidity + I(STARS^2) + I(LabelAppeal^2), 
                                        family = poisson(link = "log"), data = train_data_refined)

# 4. Simplified Model for Parsimony
poisson_model_simplified <- glm(TARGET ~ STARS * AcidIndex, 
                                family = poisson(link = "log"), data = train_data_refined)

# 5. Stepwise Regression Model
# Start with the full model and perform stepwise selection based on AIC
poisson_full_model <- glm(TARGET ~ ., family = poisson(link = "log"), data = train_data_refined)  # Use all predictors
poisson_model_stepwise <- stepAIC(poisson_full_model, direction = "both", trace = FALSE)  # Stepwise selection

# Step 3: Evaluate Models on Both Training and Test Data
# Define a helper function to calculate metrics for each Poisson model
calculate_poisson_metrics <- function(model, train_data_refined, test_data) {
  # Predictions
  train_preds <- predict(model, newdata = train_data_refined, type = "response")
  test_preds <- predict(model, newdata = test_data, type = "response")
  
  # Calculate metrics
  metrics <- data.frame(
    Train_R2 = 1 - (sum((train_data_refined$TARGET - train_preds)^2) / sum((train_data_refined$TARGET - mean(train_data_refined$TARGET))^2)),
    Test_R2 = 1 - (sum((test_data$TARGET - test_preds)^2) / sum((test_data$TARGET - mean(test_data$TARGET))^2)),
    Train_RMSE = rmse(train_data_refined$TARGET, train_preds),
    Test_RMSE = rmse(test_data$TARGET, test_preds),
    AIC = AIC(model),
    Null_Deviance = model$null.deviance,  # Null deviance for the model
    Residual_Deviance = model$deviance   # Residual deviance for the model
  )
  
  return(metrics)
}

# Apply the helper function to all Poisson models
poisson_model_metrics <- rbind(
  Additive = calculate_poisson_metrics(poisson_model_additive, train_data_refined, test_data),
  Interaction = calculate_poisson_metrics(poisson_model_interaction, train_data_refined, test_data),
  HigherOrderInteraction = calculate_poisson_metrics(poisson_model_higher_interaction, train_data_refined, test_data),
  Simplified = calculate_poisson_metrics(poisson_model_simplified, train_data_refined, test_data),
  Stepwise = calculate_poisson_metrics(poisson_model_stepwise, train_data_refined, test_data)
)

# Step 4: Sort Models by Test RMSE
# Sort by Test_RMSE (ascending order) to identify the best model
poisson_model_metrics_sorted <- poisson_model_metrics[order(poisson_model_metrics$Test_RMSE), ]

# Step 5: Display the Sorted Metrics in a Table
cat("\nPoisson Model Comparison Metrics Sorted by Test RMSE\n")

## 
## Poisson Model Comparison Metrics Sorted by Test RMSE

poisson_model_metrics_sorted

##                         Train_R2   Test_R2 Train_RMSE Test_RMSE      AIC
## HigherOrderInteraction 0.4223409 0.3307138   1.403585  1.584489 36020.06
## Stepwise               0.4061486 0.3205099   1.423120  1.596522 36294.70
## Interaction            0.4216400 0.3190610   1.404436  1.598223 36120.49
## Additive               0.4025244 0.3148473   1.427456  1.603161 36338.88
## Simplified             0.3181138 0.2692662   1.524961  1.655629 37232.56
##                        Null_Deviance Residual_Deviance
## HigherOrderInteraction      15416.13          10838.14
## Stepwise                    15416.13          11110.79
## Interaction                 15416.13          10944.58
## Additive                    15416.13          11168.96
## Simplified                  15416.13          12066.65

# Step 6: Identify and Output the Best Poisson Model
# Ensure the best model name is extracted properly
best_poisson_model_index <- which.min(poisson_model_metrics_sorted$Test_RMSE)  # Get the index of the lowest Test RMSE
best_poisson_model_name <- rownames(poisson_model_metrics_sorted)[best_poisson_model_index]  # Extract the model name
best_poisson_model_type <- paste("Poisson Best Model:", best_poisson_model_name)  # Include the model type

# Output the best Poisson model name
cat("Best model based on lowest Test RMSE:", best_poisson_model_type, "\n")

## Best model based on lowest Test RMSE: Poisson Best Model: HigherOrderInteraction

Zero-inflated Poisson Regression Models - Use the best Poisson Model’s Formula

The Vuong test failed to produce valid results (NULL), likely due to a mismatch in data dimensions. However, the Zero-Inflated Poisson model appears well-fitted, with significant predictors (e.g., STARS, LabelAppeal, AcidIndex) and a log-likelihood of -1.743e+04. Further checks on data alignment are needed for accurate comparison.

test_data <- test_data %>% drop_na()
train_data_refined <- train_data_refined %>% drop_na()

# Load necessary libraries
library(pscl)
library(Metrics)  # For RMSE
library(MASS)     # For Negative Binomial model
library(knitr)    # For pretty table output

test_data <- test_data %>% drop_na()
train_data_refined <- train_data_refined %>% drop_na()

# Step 1: Fit Zero-Inflated Poisson Model
zip_model_higher_interaction <- zeroinfl(
  TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal +
    TotalSulfurDioxide * VolatileAcidity + I(STARS^2) + I(LabelAppeal^2) | 1,
  data = train_data_refined,
  dist = "poisson"
)

# Step 2: Define Function to Evaluate Model Metrics
evaluate_zip_metrics <- function(model, train_data_refined, test_data) {
  tryCatch({
    # Predictions
    train_preds <- predict(model, newdata = train_data_refined, type = "response")
    test_preds <- predict(model, newdata = test_data, type = "response")
    
    # RMSE Calculation
   
    train_rmse <- sqrt(mean((train_data_refined$TARGET - train_preds)^2))
    test_rmse <- sqrt(mean((test_data$TARGET - test_preds)^2))

    # R-squared Calculation
    train_r2 <- 1 - sum((train_data_refined$TARGET - train_preds)^2) / sum((train_data_refined$TARGET - mean(train_data_refined$TARGET))^2)
    test_r2 <- 1 - sum((test_data$TARGET - test_preds)^2) / sum((test_data$TARGET - mean(test_data$TARGET))^2)
    
    # Deviance Metrics
    null_deviance <- model$null
    residual_deviance <- model$deviance
    
    # Model Metrics Table
    metrics <- data.frame(
      Train_R2 = train_r2,
      Test_R2 = test_r2,
      Train_RMSE = train_rmse,
      Test_RMSE = test_rmse,
      AIC = AIC(model),
      Null_Deviance = null_deviance,
      Residual_Deviance = residual_deviance
    )
    
    return(metrics)
  }, error = function(e) {
    cat("Error evaluating model metrics:", e$message, "\n")
    return(NULL)
  })
}

# Step 3: Apply Metrics Function
zip_metrics <- evaluate_zip_metrics(zip_model_higher_interaction, train_data_refined, test_data)

## Error evaluating model metrics: arguments imply differing number of rows: 1, 0

# Step 4: Model Comparison (Vuong Test with Negative Binomial)
vuong_test <- vuong(zip_model_higher_interaction, poisson_model_higher_interaction)

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                    16.50134 model1 > model2 < 2.22e-16
## AIC-corrected          16.47250 model1 > model2 < 2.22e-16
## BIC-corrected          16.36899 model1 > model2 < 2.22e-16

# Step 5: Display Results
# Metrics Table
if (!is.null(zip_metrics)) {
  cat("Model Performance Metrics:\n")
  print(kable(zip_metrics, digits = 3, caption = "Zero-Inflated Poisson Model Metrics"))
}


# Vuong Test Results
cat("\nVuong Test Results:\n")

## 
## Vuong Test Results:

print(vuong_test)

## NULL

# Model Summary
cat("\nSummary of the Zero-Inflated Poisson Model:\n")

## 
## Summary of the Zero-Inflated Poisson Model:

print(summary(zip_model_higher_interaction))

## 
## Call:
## zeroinfl(formula = TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + 
##     AcidIndex * LabelAppeal + TotalSulfurDioxide * VolatileAcidity + 
##     I(STARS^2) + I(LabelAppeal^2) | 1, data = train_data_refined, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6546 -0.2143  0.1805  0.5062  2.2075 
## 
## Count model coefficients (poisson with log link):
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         1.200994   0.010704 112.200  < 2e-16 ***
## STARS                               0.132690   0.006888  19.263  < 2e-16 ***
## LabelAppeal                         0.228233   0.007231  31.565  < 2e-16 ***
## AcidIndex                          -0.082892   0.007238 -11.453  < 2e-16 ***
## TotalSulfurDioxide                  0.006764   0.006072   1.114  0.26528    
## VolatileAcidity                    -0.022319   0.006111  -3.652  0.00026 ***
## I(STARS^2)                          0.041713   0.006501   6.417 1.39e-10 ***
## I(LabelAppeal^2)                   -0.010813   0.005122  -2.111  0.03477 *  
## STARS:LabelAppeal                  -0.032960   0.007480  -4.406 1.05e-05 ***
## STARS:AcidIndex                     0.048403   0.007010   6.905 5.03e-12 ***
## LabelAppeal:AcidIndex               0.004627   0.007478   0.619  0.53606    
## TotalSulfurDioxide:VolatileAcidity  0.006799   0.006387   1.065  0.28707    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.92098    0.03964  -48.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 19 
## Log-likelihood: -1.743e+04 on 13 Df

Neative Binomial Regression Models

The Higher-Order Interaction model outperformed all other Negative Binomial models based on Test RMSE and Test R²:

Test RMSE: 1.584490 (lowest among all models)
Test R²: 0.3307130 (highest among all models)
AIC: 36022.24 (better than others except Stepwise)
Residual Deviance: 10837.83

While the Stepwise model has a slightly lower AIC, the Higher-Order Interaction model achieves the best balance of predictive accuracy and generalization performance, making it the best model.

# Load necessary libraries
library(Metrics)
library(knitr)
library(MASS)  # For glm.nb() and stepAIC
library(caret) # For train-test split

# Step 2: Fit Multiple Negative Binomial Models Using the Training Data
# 1. Basic Additive Model
nb_model_additive <- suppressWarnings(glm.nb(TARGET ~ STARS + LabelAppeal + AcidIndex + TotalSulfurDioxide + VolatileAcidity, 
                            data = train_data_refined))

# 2. Interaction Model
nb_model_interaction <- suppressWarnings(glm.nb(TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal + 
                                 TotalSulfurDioxide + VolatileAcidity, 
                               data = train_data_refined))

# 3. Higher-Order Interaction Model
nb_model_higher_interaction <- suppressWarnings(glm.nb(TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal + 
                                        TotalSulfurDioxide * VolatileAcidity + I(STARS^2) + I(LabelAppeal^2), 
                                      data = train_data_refined))

# 4. Simplified Model for Parsimony
nb_model_simplified <- suppressWarnings(glm.nb(TARGET ~ STARS * AcidIndex, 
                              data = train_data_refined))

# 5. Stepwise Regression Model
# Start with the full model and perform stepwise selection based on AIC
nb_full_model <- suppressWarnings(glm.nb(TARGET ~ ., data = train_data_refined))  # Use all predictors
nb_model_stepwise <- suppressWarnings(stepAIC(nb_full_model, direction = "both", trace = FALSE))  # Stepwise selection

# Step 3: Evaluate Models on Both Training and Test Data
# Define a helper function to calculate metrics for each Negative Binomial model
calculate_nb_metrics <- function(model, train_data_refined, test_data) {
  # Predictions
  train_preds <- predict(model, newdata = train_data_refined, type = "response")
  test_preds <- predict(model, newdata = test_data, type = "response")
  
  # Extract theta (dispersion parameter)
  theta <- model$theta
  
  # Calculate metrics
  metrics <- data.frame(
    Train_R2 = 1 - (sum((train_data_refined$TARGET - train_preds)^2) / sum((train_data_refined$TARGET - mean(train_data_refined$TARGET))^2)),
    Test_R2 = 1 - (sum((test_data$TARGET - test_preds)^2) / sum((test_data$TARGET - mean(test_data$TARGET))^2)),
    Train_RMSE = rmse(train_data_refined$TARGET, train_preds),
    Test_RMSE = rmse(test_data$TARGET, test_preds),
    AIC = AIC(model),
    Null_Deviance = model$null.deviance,  # Null deviance for the model
    Residual_Deviance = model$deviance,   # Residual deviance for the model
    Theta = theta                         # Dispersion parameter
  )
  
  return(metrics)
}

# Apply the helper function to all Negative Binomial models
nb_model_metrics <- rbind(
  Additive = calculate_nb_metrics(nb_model_additive, train_data_refined, test_data),
  Interaction = calculate_nb_metrics(nb_model_interaction, train_data_refined, test_data),
  HigherOrderInteraction = calculate_nb_metrics(nb_model_higher_interaction, train_data_refined, test_data),
  Simplified = calculate_nb_metrics(nb_model_simplified, train_data_refined, test_data),
  Stepwise = calculate_nb_metrics(nb_model_stepwise, train_data_refined, test_data)
)

# Step 4: Sort Models by Test RMSE
# Sort by Test_RMSE (ascending order) to identify the best model
nb_model_metrics_sorted <- nb_model_metrics[order(nb_model_metrics$Test_RMSE), ]

# Step 5: Display the Sorted Metrics in a Table
cat("\nNegative Binomial Model Comparison Metrics Sorted by Test RMSE\n")

## 
## Negative Binomial Model Comparison Metrics Sorted by Test RMSE

nb_model_metrics_sorted

##                         Train_R2   Test_R2 Train_RMSE Test_RMSE      AIC
## HigherOrderInteraction 0.4223398 0.3307130   1.403586  1.584490 36022.24
## Stepwise               0.4061480 0.3205090   1.423121  1.596523 36296.88
## Interaction            0.4216393 0.3190600   1.404437  1.598225 36122.68
## Additive               0.4025238 0.3148464   1.427457  1.603162 36341.05
## Simplified             0.3181135 0.2692658   1.524961  1.655629 37234.73
##                        Null_Deviance Residual_Deviance    Theta
## HigherOrderInteraction      15415.58          10837.83 60781.92
## Stepwise                    15415.59          11110.47 61340.98
## Interaction                 15415.57          10944.26 59020.49
## Additive                    15415.59          11168.64 61039.95
## Simplified                  15415.42          12066.17 46873.72

# Step 6: Identify and Output the Best Negative Binomial Model
# Ensure the best model name is extracted properly
best_nb_model_index <- which.min(nb_model_metrics_sorted$Test_RMSE)  # Get the index of the lowest Test RMSE
best_nb_model_name <- rownames(nb_model_metrics_sorted)[best_nb_model_index]  # Extract the model name
best_nb_model_type <- paste("Negative Binomial Best Model:", best_nb_model_name)  # Include the model type

# Output the best Negative Binomial model name
cat("Best model based on lowest Test RMSE:", best_nb_model_type, "\n")

## Best model based on lowest Test RMSE: Negative Binomial Best Model: HigherOrderInteraction

Zero-Inflated Negative Binomial (ZINB)

The Zero-Inflated Negative Binomial (ZINB) model and the Negative Binomial Higher-Order Interaction (NB_HigherOrder) model were compared:

Performance Metrics:
- NB_HigherOrder: Lower Test RMSE (1.584) and better Test R² (0.331) indicate superior predictive performance.
- ZINB: Higher Test RMSE (1.625) and lower Test R² (0.296) suggest slightly worse predictions but lower AIC (34879.68 vs. 36022.24).
Vuong Test Results:
- The Vuong test strongly favors NB_HigherOrder over ZINB (z-statistics > 16, p-value < 2.22e-16), indicating that the NB_HigherOrder model significantly outperforms ZINB.

In summary, while the ZINB model improves AIC, the NB_HigherOrder model is superior in predictive accuracy and overall performance.

library(pscl)  # For zeroinfl function

# Fit the Zero-Inflated Negative Binomial model
zinb_model <- zeroinfl(
  TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal + 
    TotalSulfurDioxide * VolatileAcidity + I(STARS^2) + I(LabelAppeal^2) | 1,
  data = train_data_refined,
  dist = "negbin"
)

# Function to compute model metrics
calculate_model_metrics <- function(model, train_data, test_data) {
  # Predictions
  train_preds <- predict(model, newdata = train_data, type = "response")
  test_preds <- predict(model, newdata = test_data, type = "response")
  
  # Metrics
  metrics <- data.frame(
    Train_R2 = 1 - (sum((train_data$TARGET - train_preds)^2) / sum((train_data$TARGET - mean(train_data$TARGET))^2)),
    Test_R2 = 1 - (sum((test_data$TARGET - test_preds)^2) / sum((test_data$TARGET - mean(test_data$TARGET))^2)),
    Train_RMSE = sqrt(mean((train_data$TARGET - train_preds)^2)),
    Test_RMSE = sqrt(mean((test_data$TARGET - test_preds)^2)),
    AIC = AIC(model),
    LogLikelihood = as.numeric(logLik(model))  # Log-Likelihood
  )
  return(metrics)
}

# Compute metrics for both models
zinb_metrics <- calculate_model_metrics(zinb_model, train_data_refined, test_data)
nb_metrics <- calculate_model_metrics(nb_model_higher_interaction, train_data_refined, test_data)

# Combine metrics for comparison
model_comparison <- rbind(
  ZINB = zinb_metrics,
  NB_HigherOrder = nb_metrics
)

# Display comparison table
print(model_comparison)

##                 Train_R2   Test_R2 Train_RMSE Test_RMSE      AIC LogLikelihood
## ZINB           0.3869180 0.2962219   1.445979  1.624805 34879.68     -17425.84
## NB_HigherOrder 0.4223398 0.3307130   1.403586  1.584490 36022.24     -17998.12

# Vuong Test for model comparison
vuong_test <- vuong(zinb_model, nb_model_higher_interaction)

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                    16.50382 model1 > model2 < 2.22e-16
## AIC-corrected          16.47499 model1 > model2 < 2.22e-16
## BIC-corrected          16.37147 model1 > model2 < 2.22e-16

# Display Vuong Test results
cat("Vuong Test Results:\n")

## Vuong Test Results:

print(vuong_test)

## NULL

Multiple Linear Regression Models

The Polynomial model outperformed other multiple linear regression (MLR) models with the lowest Test RMSE of 1.550 and the highest Test R² of 0.360, indicating it captures nonlinear relationships effectively. The Higher-Order Interaction model followed closely but with slightly higher error metrics.

# Load necessary libraries
library(Metrics)
library(knitr)
library(MASS)  # For stepAIC function
library(caret) # For train-test split

# Step 2: Fit Multiple MLR Models Using the Training Data
# 1. Basic Additive Model
model_additive <- lm(TARGET ~ STARS + LabelAppeal + AcidIndex + TotalSulfurDioxide + VolatileAcidity, 
                     data = train_data_refined)

# 2. Interaction Model
model_interaction <- lm(TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal + 
                          TotalSulfurDioxide + VolatileAcidity, 
                        data = train_data_refined)

# 3. Higher-Order Interaction Model
model_higher_interaction <- lm(TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal + 
                                 TotalSulfurDioxide * VolatileAcidity + I(STARS^2) + I(LabelAppeal^2), 
                               data = train_data_refined)

# 4. Simplified Model for Parsimony
model_simplified <- lm(TARGET ~ STARS * AcidIndex, 
                       data = train_data_refined)

# 5. Polynomial Model for Nonlinear Effects
model_polynomial <- lm(TARGET ~ STARS + LabelAppeal + AcidIndex + I(STARS^2) + I(LabelAppeal^2) + 
                         I(AcidIndex^2), 
                       data = train_data_refined)

# 6. Stepwise Regression Model
# Start with the full model and perform stepwise selection based on AIC
full_model <- lm(TARGET ~ ., data = train_data_refined)  # Use all predictors
model_stepwise <- stepAIC(full_model, direction = "both", trace = FALSE)  # Stepwise selection

# Step 3: Evaluate Models on Both Training and Test Data
# Define a helper function to calculate metrics for each model
calculate_metrics <- function(model, train_data_refined, test_data) {
  # Predictions
  train_preds <- predict(model, newdata = train_data_refined)
  test_preds <- predict(model, newdata = test_data)
  
  # Calculate metrics
  metrics <- data.frame(
    Train_R2 = summary(model)$adj.r.squared,
    Test_R2 = 1 - (sum((test_data$TARGET - test_preds)^2) / sum((test_data$TARGET - mean(test_data$TARGET))^2)),
    Train_RMSE = rmse(train_data_refined$TARGET, train_preds),
    Test_RMSE = rmse(test_data$TARGET, test_preds),
    AIC = AIC(model)
  )
  
  return(metrics)
}

# Apply the helper function to all models
model_metrics <- rbind(
  Additive = calculate_metrics(model_additive, train_data_refined, test_data),
  Interaction = calculate_metrics(model_interaction, train_data_refined, test_data),
  HigherOrderInteraction = calculate_metrics(model_higher_interaction, train_data_refined, test_data),
  Simplified = calculate_metrics(model_simplified, train_data_refined, test_data),
  Polynomial = calculate_metrics(model_polynomial, train_data_refined, test_data),
  Stepwise = calculate_metrics(model_stepwise, train_data_refined, test_data)
)

# Step 4: Sort Models by Test RMSE
# Sort by Test_RMSE (ascending order) to identify the best model
model_metrics_sorted <- model_metrics[order(model_metrics$Test_RMSE), ]

# Step 5: Display the Sorted Metrics in a Table
kable(model_metrics_sorted, digits = 3, caption = "Model Comparison Metrics Sorted by Test RMSE")

Model Comparison Metrics Sorted by Test RMSE
	Train_R2	Test_R2	Train_RMSE	Test_RMSE	AIC
Polynomial	0.434	0.360	1.389	1.550	33891.43
HigherOrderInteraction	0.447	0.354	1.373	1.557	33670.02
Interaction	0.426	0.328	1.398	1.587	34022.10
Stepwise	0.408	0.312	1.420	1.606	34333.21
Additive	0.402	0.305	1.428	1.615	34416.15
Simplified	0.302	0.252	1.543	1.675	35914.16

# Step 6: Identify and Output the Best Model with Model Type
# Ensure the best model name is extracted properly
best_model_index <- which.min(model_metrics_sorted$Test_RMSE)  # Get the index of the lowest Test RMSE
best_model_name <- rownames(model_metrics_sorted)[best_model_index]  # Extract the model name
best_model_type <- paste("MLR Best Model:", best_model_name)  # Include the model type

# Output the best model name
cat("Best model based on lowest Test RMSE:", best_model_type, "\n")

## Best model based on lowest Test RMSE: MLR Best Model: Polynomial

Model Comparisons

Best Model Based on Test RMSE

The best-performing model is Linear Regression (Polynomial), achieving the lowest Test RMSE (1.5498) and the highest Test R² (0.3597). This indicates that the Polynomial model captures the underlying patterns in the data more effectively than the other models.

Model Selection for Interpretability

While the Polynomial Based performs best, I may consider Linear Regression (Higher-Order Interaction) or Poisson (Higher-Order Interaction) for interpretability: - Linear Regression (Higher-Order Interaction) has slightly higher Test RMSE (1.5570) but provides a simpler linear interpretation with interaction terms. - Poisson (Higher-Order Interaction) also has a competitive performance (Test RMSE = 1.5845) and aligns well with count data assumptions.

These models balance performance and interpretability, making them easier to explain to stakeholders while still maintaining strong predictive accuracy.

Conclusion

If predictive performance is the sole objective, the Polynomial model is the best. However, for easier interpretation, Linear Regression (Higher-Order Interaction) or Poisson (Higher-Order Interaction) can be selected with minimal trade-offs in performance.

# Initialize storage for metrics across all models
final_metrics <- data.frame(
  Model_Type = character(),
  Model = character(),
  Train_RMSE = numeric(),
  Test_RMSE = numeric(),
  Train_R2 = numeric(),
  Test_R2 = numeric()
)

# Function to evaluate and store model metrics
evaluate_and_store <- function(model, model_name, model_type, train_data, test_data) {
  tryCatch({
    train_predictions <- predict(model, newdata = train_data, type = "response")
    test_predictions <- predict(model, newdata = test_data, type = "response")
    
    # Calculate metrics
    train_rmse <- sqrt(mean((train_data$TARGET - train_predictions)^2))
    test_rmse <- sqrt(mean((test_data$TARGET - test_predictions)^2))
    
    train_r2 <- 1 - (sum((train_data$TARGET - train_predictions)^2) / 
                     sum((train_data$TARGET - mean(train_data$TARGET))^2))
    test_r2 <- 1 - (sum((test_data$TARGET - test_predictions)^2) / 
                    sum((test_data$TARGET - mean(test_data$TARGET))^2))
    
    # Store metrics in the final table
    final_metrics <<- rbind(
      final_metrics, 
      data.frame(
        Model_Type = model_type,
        Model = model_name,
        Train_RMSE = round(train_rmse, 4),
        Test_RMSE = round(test_rmse, 4),
        Train_R2 = round(train_r2, 4),
        Test_R2 = round(test_r2, 4)
      )
    )
  }, error = function(e) {
    # Add NA values if any issue occurs
    final_metrics <<- rbind(
      final_metrics,
      data.frame(
        Model_Type = model_type,
        Model = model_name,
        Train_RMSE = NA, Test_RMSE = NA, Train_R2 = NA, Test_R2 = NA
      )
    )
  })
}

# List all models for comparison
model_list <- list(
  list(name = "Poisson (Higher-Order Interaction)", type = "Poisson", model = poisson_model_higher_interaction),
  list(name = "Zero-Inflated Poisson", type = "ZIP", model = zip_model_higher_interaction),
  list(name = "Negative Binomial (Higher-Order Interaction)", type = "Negative Binomial", model = nb_model_higher_interaction),
  list(name = "Zero-Inflated Negative Binomial", type = "ZINB", model = zinb_model),
  list(name = "Linear Regression (Polynomial)", type = "MLR", model = model_polynomial),
  list(name = "Linear Regression (Higher-Order Interaction)", type = "MLR", model = model_higher_interaction),
  list(name = "Linear Regression (Stepwise)", type = "MLR", model = model_stepwise),
  list(name = "Linear Regression (Interaction)", type = "MLR", model = model_interaction),
  list(name = "Linear Regression (Simplified)", type = "MLR", model = model_simplified)
)

# Evaluate all models
for (entry in model_list) {
  evaluate_and_store(entry$model, entry$name, entry$type, train_data_refined, test_data)
}

# Sort the data frame by Test RMSE
final_metrics_sorted <- final_metrics[order(final_metrics$Test_RMSE), ]

# Display the final consolidated metrics table
print(final_metrics_sorted)

##          Model_Type                                        Model Train_RMSE
## 5               MLR               Linear Regression (Polynomial)     1.3894
## 6               MLR Linear Regression (Higher-Order Interaction)     1.3729
## 1           Poisson           Poisson (Higher-Order Interaction)     1.4036
## 3 Negative Binomial Negative Binomial (Higher-Order Interaction)     1.4036
## 8               MLR              Linear Regression (Interaction)     1.3985
## 7               MLR                 Linear Regression (Stepwise)     1.4204
## 2               ZIP                        Zero-Inflated Poisson     1.4460
## 4              ZINB              Zero-Inflated Negative Binomial     1.4460
## 9               MLR               Linear Regression (Simplified)     1.5427
##   Test_RMSE Train_R2 Test_R2
## 5    1.5498   0.4340  0.3597
## 6    1.5570   0.4474  0.3537
## 1    1.5845   0.4223  0.3307
## 3    1.5845   0.4223  0.3307
## 8    1.5873   0.4265  0.3283
## 7    1.6064   0.4085  0.3120
## 2    1.6248   0.3869  0.2962
## 4    1.6248   0.3869  0.2962
## 9    1.6748   0.3022  0.2522

# Visualize Test RMSE across models
library(ggplot2)
ggplot(final_metrics_sorted, aes(x = reorder(Model, Test_RMSE), y = Test_RMSE, fill = Model_Type)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Test RMSE Comparison Across Models", x = "Models", y = "Test RMSE") +
  theme_minimal()

Residual Analysis for the Best Model (Polynomial) and Interpretability Models (Higher-Order Interaction - Linear Regression and Poisson)

Residual Analysis for the Polynomial Model

Residuals vs Fitted: Clear pattern indicates nonlinearity or heteroscedasticity.
Histogram: Residuals are slightly skewed, deviating from normality.
QQ-Plot: Residuals deviate from theoretical quantiles, especially at the tails.
Scale-Location: Fan-shaped pattern confirms heteroscedasticity.

Conclusion:

The Polynomial model captures the data reasonably well, but patterns in the residuals and non-constant variance suggest it could be improved.
Consider transformations of predictors, weighted regression, or alternative models like Generalized Additive Models (GAMs) to address these issues.

# Residual Analysis for Polynomial Model
par(mfrow = c(2, 2))  # Set 2x2 plot layout

# Residuals
residuals_poly <- residuals(model_polynomial)
fitted_poly <- predict(model_polynomial, newdata = train_data_refined)

# Plot 1: Residuals vs Fitted Values
plot(fitted_poly, residuals_poly, 
     main = "Polynomial Model: Residuals vs Fitted",
     xlab = "Fitted Values", ylab = "Residuals", pch = 20)
abline(h = 0, col = "red")

# Plot 2: Histogram of Residuals
hist(residuals_poly, breaks = 30, main = "Histogram of Residuals", xlab = "Residuals", col = "gray")

# Plot 3: QQ-Plot of Residuals
qqnorm(residuals_poly, main = "QQ-Plot of Residuals")
qqline(residuals_poly, col = "red")

# Plot 4: Scale-Location Plot
plot(fitted_poly, abs(residuals_poly)^0.5,
     main = "Scale-Location Plot",
     xlab = "Fitted Values", ylab = "Sqrt(|Residuals|)", pch = 20)
abline(h = 0, col = "red")

Residual Analysis for Linear Regression (Higher-Order Interaction)

Residuals vs Fitted: Patterned trend suggests nonlinearity and heteroscedasticity.
Histogram: Residuals appear approximately symmetric but show slight skewness.
QQ-Plot: Deviations from the line, especially at tails, indicate non-normal residuals.
Scale-Location: Fan-shaped spread confirms heteroscedasticity.

Conclusion: Similar to the polynomial model, this model exhibits non-constant variance and residual non-normality, signaling a need for further refinements or transformations.

# Residual Analysis for Higher-Order Interaction Linear Model
par(mfrow = c(2, 2))  # Set 2x2 plot layout

# Residuals
residuals_lm <- residuals(model_higher_interaction)
fitted_lm <- predict(model_higher_interaction, newdata = train_data_refined)

# Plot 1: Residuals vs Fitted Values
plot(fitted_lm, residuals_lm, 
     main = "Linear Model: Residuals vs Fitted",
     xlab = "Fitted Values", ylab = "Residuals", pch = 20)
abline(h = 0, col = "red")

# Plot 2: Histogram of Residuals
hist(residuals_lm, breaks = 30, main = "Histogram of Residuals", xlab = "Residuals", col = "gray")

# Plot 3: QQ-Plot of Residuals
qqnorm(residuals_lm, main = "QQ-Plot of Residuals")
qqline(residuals_lm, col = "red")

# Plot 4: Scale-Location Plot
plot(fitted_lm, abs(residuals_lm)^0.5,
     main = "Scale-Location Plot",
     xlab = "Fitted Values", ylab = "Sqrt(|Residuals|)", pch = 20)
abline(h = 0, col = "red")

Residual Analysis for Poisson Higher-Order Interaction Model

For Poisson models, deviance residuals are used instead of raw residuals.

Residuals vs Fitted: Clear nonlinear pattern and funnel shape suggest heteroscedasticity and model misspecification.
Histogram: Deviance residuals appear skewed to the right, indicating potential overdispersion.
QQ-Plot: Residuals deviate from the line, especially at tails, signaling non-normality.
Scale-Location: Fan-shaped spread confirms heteroscedasticity, with increasing variance at higher fitted values.

Conclusion: The Poisson model struggles with overdispersion and heteroscedasticity, suggesting a need for a Negative Binomial or Zero-Inflated model to better capture the variance structure.

# Residual Analysis for Poisson Higher-Order Interaction Model
par(mfrow = c(2, 2))  # Set 2x2 plot layout

# Deviance Residuals
deviance_resid_poisson <- residuals(poisson_model_higher_interaction, type = "deviance")
fitted_poisson <- predict(poisson_model_higher_interaction, newdata = train_data_refined, type = "response")

# Plot 1: Residuals vs Fitted Values
plot(fitted_poisson, deviance_resid_poisson, 
     main = "Poisson Model: Residuals vs Fitted",
     xlab = "Fitted Values", ylab = "Deviance Residuals", pch = 20)
abline(h = 0, col = "red")

# Plot 2: Histogram of Residuals
hist(deviance_resid_poisson, breaks = 30, main = "Histogram of Deviance Residuals", xlab = "Residuals", col = "gray")

# Plot 3: QQ-Plot of Residuals
qqnorm(deviance_resid_poisson, main = "QQ-Plot of Deviance Residuals")
qqline(deviance_resid_poisson, col = "red")

# Plot 4: Scale-Location Plot
plot(fitted_poisson, abs(deviance_resid_poisson)^0.5,
     main = "Scale-Location Plot",
     xlab = "Fitted Values", ylab = "Sqrt(|Deviance Residuals|)", pch = 20)
abline(h = 0, col = "red")

Residual Analysis for the Negative Binomial Higher-Order Interaction Model

Residuals vs Fitted Plot:
- Residuals appear centered around zero with no major curvature, suggesting the model captures the relationship reasonably well.
- However, some slight funneling is observed, indicating mild heteroscedasticity at higher fitted values.
Histogram of Pearson Residuals:
- The residuals exhibit a near-symmetric distribution centered around zero, with slight skewness.
- No extreme outliers are visible, indicating reasonable model fit.
QQ-Plot of Residuals:
- Residuals follow the theoretical quantiles closely, with some deviation at the tails.
- This suggests that while the residuals approximate normality, they may still contain mild non-normal patterns.
Scale-Location Plot:
- Residual variance appears relatively stable across fitted values, with only minor increases at higher values.
- This confirms that heteroscedasticity is reduced compared to the Poisson model.

Conclusion: The Negative Binomial model effectively handles overdispersion and demonstrates better residual behavior than the Poisson model. There are minor issues with residual tails and mild heteroscedasticity, but overall, the model is a robust choice for the given data.

# Residual Analysis for Negative Binomial Higher-Order Interaction Model
par(mfrow = c(2, 2))  # Set plot layout to 2x2

# Plot 1: Residuals vs Fitted
plot(nb_model_higher_interaction$fitted.values, 
     residuals(nb_model_higher_interaction, type = "pearson"), 
     xlab = "Fitted Values", ylab = "Pearson Residuals",
     main = "Negative Binomial: Residuals vs Fitted")
abline(h = 0, col = "red", lwd = 2)

# Plot 2: Histogram of Residuals
hist(residuals(nb_model_higher_interaction, type = "pearson"), 
     main = "Histogram of Pearson Residuals", 
     xlab = "Residuals", breaks = 50)

# Plot 3: QQ-Plot of Residuals
qqnorm(residuals(nb_model_higher_interaction, type = "pearson"), 
       main = "QQ-Plot of Residuals")
qqline(residuals(nb_model_higher_interaction, type = "pearson"), col = "red", lwd = 2)

# Plot 4: Scale-Location Plot
plot(nb_model_higher_interaction$fitted.values, 
     sqrt(abs(residuals(nb_model_higher_interaction, type = "pearson"))), 
     xlab = "Fitted Values", ylab = "Sqrt(|Pearson Residuals|)", 
     main = "Scale-Location Plot")
abline(h = 0, col = "red", lwd = 2)

par(mfrow = c(1, 1))  # Reset plot layout

Final Model Selection: Negative Binomial Higher-Order Interaction Model

Rationale for Selection:

Test RMSE Comparison:
- Polynomial MLR: 1.549 (lowest Test RMSE)
- Higher-Order Interaction MLR: 1.557
- Poisson Higher-Order Interaction: 1.584
- Negative Binomial Higher-Order Interaction: 1.584
While the Polynomial MLR has the lowest Test RMSE, the difference is minimal compared to the Higher-Order MLR and Poisson/Negative Binomial models.
Suitability for Count Data:
- The TARGET variable is count data, making Poisson and Negative Binomial models more appropriate. Linear regression models (MLR) assume normally distributed residuals, which is violated here.
Residual Analysis:
- Polynomial and Higher-Order MLR: Residuals show clear heteroscedasticity, indicating issues with constant variance and potential model misspecification.
- Poisson Higher-Order Interaction: Deviance residuals show some improvement but still exhibit heteroscedasticity.
- Negative Binomial Higher-Order Interaction: Pearson residuals are better distributed with less heteroscedasticity, indicating improved handling of variability.
Overdispersion:
- The Negative Binomial model explicitly accounts for overdispersion (variance > mean), a common issue with count data that the Poisson model does not handle as well.

Final Decision:

The Negative Binomial Higher-Order Interaction Model is the best choice for prediction. It balances predictive accuracy, handles overdispersion effectively, and demonstrates better residual behavior. While the Polynomial MLR has a slightly lower Test RMSE, its assumptions are not appropriate for count data, and the Negative Binomial model is more robust and reliable.

Predict and save the predicted values

For prediction, we will submit the predicted values using the Negative Binomial Higher-Order Interaction Model, as it is the most appropriate and robust choice. However, we will also generate predictions for the Polynomial MLR, Higher-Order Interaction MLR, and Poisson Higher-Order Interaction Model to visually compare their predictions for further analysis and insights.

Prediction Using the Negative Binomial Higher-Order Interaction Model

# Prepare the evaluation dataset and ensure consistent row order
eval_data_prepared <- boxcox_eval_data %>%
  dplyr::select(-TARGET, -INDEX) %>%  # Drop TARGET and INDEX columns
  dplyr::mutate(ID = boxcox_eval_data$INDEX)  # Retain INDEX for prediction alignment

# Convert all matrix columns to numeric vectors in the evaluation dataset
eval_data_clean <- eval_data_prepared %>%
  dplyr::mutate(across(where(is.matrix), ~ as.numeric(.)))  # Convert matrices to numeric

# Predict using the Negative Binomial Higher-Order Interaction model
nb_higher_interaction_predictions <- predict(nb_model_higher_interaction, 
                                             newdata = eval_data_clean, 
                                             type = "response")

# Create a DataFrame with predictions
nb_predictions_df <- data.frame(
  ID = eval_data_clean$ID,  # Add back the ID column
  Predictions = nb_higher_interaction_predictions
)

# Save predictions to CSV
write.csv(nb_predictions_df, "nb_higher_interaction_predictions.csv", row.names = FALSE)
cat("Negative Binomial (Higher-Order Interaction) predictions saved to 'nb_higher_interaction_predictions.csv'\n")

## Negative Binomial (Higher-Order Interaction) predictions saved to 'nb_higher_interaction_predictions.csv'

cat("\nPreview the prediction\n")

## 
## Preview the prediction

glimpse(nb_predictions_df)

## Rows: 3,335
## Columns: 2
## $ ID          <int> 3, 9, 10, 18, 21, 30, 31, 37, 39, 47, 60, 62, 63, 64, 68, …
## $ Predictions <dbl> 2.886945, 3.722064, 1.958305, 1.680948, 1.803885, 6.319411…

Prediction Using the using the Polynomial MLR model

# Convert all matrix columns to numeric vectors in the evaluation dataset
eval_data_clean <- eval_data_prepared %>%
  dplyr::mutate(across(where(is.matrix), ~ as.numeric(.)))  # Convert matrices to numeric

# Predict using the Polynomial MLR model
mlr_polynomial_predictions <- predict(model_polynomial, 
                                      newdata = eval_data_clean)

# Create a DataFrame with predictions
mlr_polynomial_predictions_df <- data.frame(
  ID = eval_data_clean$ID,  # Add back the ID column
  Predictions = mlr_polynomial_predictions
)

# Save predictions to CSV
write.csv(mlr_polynomial_predictions_df, "mlr_polynomial_predictions.csv", row.names = FALSE)
cat("\nMLR (Polynomial) predictions saved to 'mlr_polynomial_predictions.csv'\n")

## 
## MLR (Polynomial) predictions saved to 'mlr_polynomial_predictions.csv'

cat("\nPreview the prediction\n")

## 
## Preview the prediction

glimpse(mlr_polynomial_predictions_df)

## Rows: 3,335
## Columns: 2
## $ ID          <int> 3, 9, 10, 18, 21, 30, 31, 37, 39, 47, 60, 62, 63, 64, 68, …
## $ Predictions <dbl> 2.347799, 3.603266, 2.370396, 1.735872, 1.658376, 5.786728…

Prediction using the Higher-Order Interaction MLR model

# Convert all matrix columns to numeric vectors in the evaluation dataset
eval_data_clean <- eval_data_prepared %>%
  dplyr::mutate(across(where(is.matrix), ~ as.numeric(.)))  # Convert matrices to numeric


# Predict using the Higher-Order Interaction MLR model
mlr_higher_interaction_predictions <- predict(model_higher_interaction, 
                                              newdata = eval_data_clean)

# Create a DataFrame with predictions
mlr_predictions_df <- data.frame(
  ID = eval_data_clean$ID,  # Add back the ID column
  Predictions = mlr_higher_interaction_predictions
)

# Save predictions to CSV
write.csv(mlr_predictions_df, "mlr_higher_interaction_predictions.csv", row.names = FALSE)
cat("MLR (Higher-Order Interaction) predictions saved to 'mlr_higher_interaction_predictions.csv'\n")

## MLR (Higher-Order Interaction) predictions saved to 'mlr_higher_interaction_predictions.csv'

cat("\nPreview the prediction\n")

## 
## Preview the prediction

glimpse(mlr_predictions_df)

## Rows: 3,335
## Columns: 2
## $ ID          <int> 3, 9, 10, 18, 21, 30, 31, 37, 39, 47, 60, 62, 63, 64, 68, …
## $ Predictions <dbl> 2.748037, 3.651464, 1.942896, 1.760075, 1.658487, 5.983839…

Predictions Using the Poisson Higher-Order Interaction Model

# Convert all matrix columns to numeric vectors in the evaluation dataset
eval_data_clean <- eval_data_prepared %>%
  dplyr::mutate(across(where(is.matrix), ~ as.numeric(.)))  # Convert matrices to numeric

# Predict using the Poisson Higher-Order Interaction model
poisson_higher_interaction_predictions <- predict(poisson_model_higher_interaction, 
                                                  newdata = eval_data_clean, 
                                                  type = "response")

# Create a DataFrame with predictions
poisson_predictions_df <- data.frame(
  ID = eval_data_clean$ID,  # Add back the ID column
  Predictions = poisson_higher_interaction_predictions
)

# Save predictions to CSV
write.csv(poisson_predictions_df, "poisson_higher_interaction_predictions.csv", row.names = FALSE)
cat("Poisson (Higher-Order Interaction) predictions saved to 'poisson_higher_interaction_predictions.csv'\n")

## Poisson (Higher-Order Interaction) predictions saved to 'poisson_higher_interaction_predictions.csv'

cat("\nPreview the prediction\n")

## 
## Preview the prediction

glimpse(poisson_predictions_df)

## Rows: 3,335
## Columns: 2
## $ ID          <int> 3, 9, 10, 18, 21, 30, 31, 37, 39, 47, 60, 62, 63, 64, 68, …
## $ Predictions <dbl> 2.886941, 3.722066, 1.958307, 1.680945, 1.803893, 6.319380…

Comparison and Evaluation of Model Predictions Across Selected Models

The pairwise scatter plot shows a strong alignment between predictions across all four models:
- Negative Binomial Higher-Order, MLR Polynomial, MLR Higher-Order, and Poisson Higher-Order Interaction.

Key Observations:

Linear Trends:
- Predictions from all models exhibit high correlation with clear linear relationships.
- This suggests consistency in predictive patterns across the models.
Differences in Spread:
- MLR Polynomial and MLR Higher-Order predictions display slightly more variability in the mid-range values compared to the Poisson and Negative Binomial models.
Alignment Between Poisson and Negative Binomial:
- The Poisson Higher-Order and Negative Binomial Higher-Order predictions are almost identical, reinforcing their comparable Test RMSE values and suitability for count data.

Conclusion:

While predictions from all models are highly correlated, the Negative Binomial Higher-Order Interaction Model remains the preferred model for final prediction due to its robustness in handling overdispersion and suitability for the data.

# Combine predictions into a single data frame
final_predictions_df <- data.frame(
  ID = nb_predictions_df$ID,  # Ensure alignment with ID
  NegativeBinomial_HigherOrder = nb_predictions_df$Predictions,
  MLR_Polynomial = mlr_polynomial_predictions_df$Predictions,
  MLR_HigherOrder = mlr_predictions_df$Predictions,
  Poisson_HigherOrder = poisson_predictions_df$Predictions
)

# Display summary of predictions
cat("Summary of Predictions for the Four Models:\n")

## Summary of Predictions for the Four Models:

print(summary(final_predictions_df[, -1]))  # Exclude ID column for summary

##  NegativeBinomial_HigherOrder MLR_Polynomial    MLR_HigherOrder
##  Min.   :0.9045               Min.   :-0.5212   Min.   :0.506  
##  1st Qu.:2.2338               1st Qu.: 2.3704   1st Qu.:2.265  
##  Median :2.8890               Median : 3.0215   Median :2.914  
##  Mean   :3.1507               Mean   : 3.1454   Mean   :3.147  
##  3rd Qu.:3.8618               3rd Qu.: 3.8219   3rd Qu.:3.903  
##  Max.   :9.2831               Max.   : 7.1249   Max.   :8.043  
##  Poisson_HigherOrder
##  Min.   :0.9045     
##  1st Qu.:2.2338     
##  Median :2.8890     
##  Mean   :3.1506     
##  3rd Qu.:3.8618     
##  Max.   :9.2830

# Generate pairwise scatter plots
pairs(
  final_predictions_df[, -1],  # Exclude ID column for scatter plots
  main = "Pairwise Scatter Plot of Model Predictions",
  pch = 21, 
  bg = c("blue", "green", "red", "purple")
)

# Save combined predictions to a CSV
write.csv(final_predictions_df, "combined_predictions.csv", row.names = FALSE)
cat("Combined predictions saved to 'combined_predictions.csv'\n")

## Combined predictions saved to 'combined_predictions.csv'

Wine Data Analysis and Modeling

Warner Alexis, Saloua Daouki, Souleymane Doumbia, Fomba Kassoh, Lewris Mota Sanchez

2024-10-08

Abstract

Introduction

Data Exploration

Load Libraries and Data

Summary Statistics and Missing Data

analysis of the Summary Statistics:

Suitability of Models Based on Summary Statistics

Visualizations

Key Variables for Visualization Analysis:

Target Variable (TARGET)

Highly Skewed Variables (e.g., ResidualSugar, FreeSulfurDioxide)

LabelAppeal and STARS

Correlations Between Numeric Predictors and TARGET

Zero-Inflation and Overdispersion

Data Preparation

Impute Missing Values

Transform Data with Box-Cox and Yeo-Johnson

Visualize Skewness Before and After Transformations

Identify Highly Correlated Variables and Plot Correlations

Confirmation of No Multicollinearity with VIF Before Model Fitting

Scale the data

Build Models

Data Splitting, and Outlier Analysis Before Model Training**

Plot Decision Tree to identify interactions

Examine variable importatance for modeling

Key Metrics Comparison:

Zero-inflated Poisson Regression Models - Use the best Poisson Model’s Formula

Neative Binomial Regression Models

Zero-Inflated Negative Binomial (ZINB)

Multiple Linear Regression Models

Model Comparisons

Best Model Based on Test RMSE

Model Selection for Interpretability

Conclusion

Residual Analysis for the Best Model (Polynomial) and Interpretability Models (Higher-Order Interaction - Linear Regression and Poisson)

Residual Analysis for the Polynomial Model

Conclusion:

Residual Analysis for Linear Regression (Higher-Order Interaction)

Residual Analysis for Poisson Higher-Order Interaction Model

Residual Analysis for the Negative Binomial Higher-Order Interaction Model

Final Model Selection: Negative Binomial Higher-Order Interaction Model

Rationale for Selection:

Final Decision:

Predict and save the predicted values

Prediction Using the Negative Binomial Higher-Order Interaction Model

Prediction Using the using the Polynomial MLR model

Prediction using the Higher-Order Interaction MLR model

Predictions Using the Poisson Higher-Order Interaction Model

Comparison and Evaluation of Model Predictions Across Selected Models

Key Observations:

Conclusion: