This study analyzes and models wine sales data to predict the number of wine cases sold based on chemical and marketing features. After extensive data exploration, preprocessing (missing value imputation, transformations), and outlier handling, multiple models were built and evaluated, including Linear Regression (MLR), Poisson Regression, and Negative Binomial Regression.
The Negative Binomial Higher-Order Interaction Model emerged as the final model for prediction. It effectively handles overdispersion inherent in the target variable and achieves competitive predictive performance (Test RMSE = 1.584). Comparisons with the Polynomial MLR and Poisson Higher-Order Interaction models highlight the robustness of the Negative Binomial model in balancing residual behavior, accuracy, and suitability for count data.
Residual analyses revealed patterns of heteroscedasticity and non-normality in MLR and Poisson models, reinforcing the superiority of the Negative Binomial approach. Predictions across models were highly correlated, as confirmed by pairwise scatter plots, demonstrating consistency in predictive trends.
For submission, the Negative Binomial Higher-Order Interaction Model predictions will be provided, with visual comparisons to the Polynomial MLR, Higher-Order Interaction MLR, and Poisson Higher-Order Interaction models for further validation and insights.
This document outlines the analysis and modeling of the wine dataset using R. The goal is to predict the number of wine cases sold based on chemical and marketing features.
# Load required libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(pscl)
## Classes and Methods for R originally developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University (2002-2015),
## by and under the direction of Simon Jackman.
## hurdle and zeroinfl functions by Achim Zeileis.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Load data
train_data <- read.csv("https://raw.githubusercontent.com/hawa1983/DATA-621/refs/heads/main/Homework%205/wine-training-data.csv")
eval_data <- read.csv("https://raw.githubusercontent.com/hawa1983/DATA-621/refs/heads/main/Homework%205/wine-evaluation-data.csv")
cat("\nStructure of the training data\n")
##
## Structure of the training data
str(train_data)
## 'data.frame': 12795 obs. of 16 variables:
## $ INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET : int 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : int 2 3 3 1 2 NA NA 3 NA 4 ...
cat("\nStructure of the evaluation data\n")
##
## Structure of the evaluation data
str(eval_data)
## 'data.frame': 3335 obs. of 16 variables:
## $ INDEX : int 3 9 10 18 21 30 31 37 39 47 ...
## $ TARGET : logi NA NA NA NA NA NA ...
## $ FixedAcidity : num 5.4 12.4 7.2 6.2 11.4 17.6 15.5 15.9 11.6 3.8 ...
## $ VolatileAcidity : num -0.86 0.385 1.75 0.1 0.21 0.04 0.53 1.19 0.32 0.22 ...
## $ CitricAcid : num 0.27 -0.76 0.17 1.8 0.28 -1.15 -0.53 1.14 0.55 0.31 ...
## $ ResidualSugar : num -10.7 -19.7 -33 1 1.2 1.4 4.6 31.9 -50.9 -7.7 ...
## $ Chlorides : num 0.092 1.169 0.065 -0.179 0.038 ...
## $ FreeSulfurDioxide : num 23 -37 9 104 70 -250 10 115 35 40 ...
## $ TotalSulfurDioxide: num 398 68 76 89 53 140 17 381 83 129 ...
## $ Density : num 0.985 0.99 1.046 0.989 1.029 ...
## $ pH : num 5.02 3.37 4.61 3.2 2.54 3.06 3.07 2.99 3.32 4.72 ...
## $ Sulphates : num 0.64 1.09 0.68 2.11 -0.07 -0.02 0.75 0.31 2.18 -0.64 ...
## $ Alcohol : num 12.3 16 8.55 12.3 4.8 11.4 8.5 11.4 -0.5 10.9 ...
## $ LabelAppeal : int -1 0 0 -1 0 1 0 1 0 0 ...
## $ AcidIndex : int 6 6 8 8 10 8 12 7 12 7 ...
## $ STARS : int NA 2 1 1 NA 4 3 NA NA NA ...
# Load necessary libraries
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(dplyr)
library(tidyr)
library(moments) # For skewness and kurtosis
# Create vectors of numeric and categorical variables
numeric_vars <- names(train_data)[sapply(train_data, is.numeric)]
categorical_vars <- names(train_data)[sapply(train_data, is.factor)]
# Correctly select numerical variables using the predefined numeric_vars
numerical_vars <- train_data %>%
dplyr::select(all_of(numeric_vars))
# Compute statistical summary including skewness, kurtosis, missing value counts, and percentages
statistical_summary <- numerical_vars %>%
summarise(across(
everything(),
list(
Min = ~round(min(., na.rm = TRUE), 2),
Q1 = ~round(quantile(., 0.25, na.rm = TRUE), 2),
Mean = ~round(mean(., na.rm = TRUE), 2),
Median = ~round(median(., na.rm = TRUE), 2),
Q3 = ~round(quantile(., 0.75, na.rm = TRUE), 2),
Max = ~round(max(., na.rm = TRUE), 2),
SD = ~round(sd(., na.rm = TRUE), 2),
Variance = ~round(var(., na.rm = TRUE), 2),
CV = ~round(sd(., na.rm = TRUE) / mean(., na.rm = TRUE), 2),
Skewness = ~round(skewness(., na.rm = TRUE), 2),
Kurtosis = ~round(kurtosis(., na.rm = TRUE), 2),
ZeroCount = ~sum(. == 0, na.rm = TRUE),
ZeroProportion = ~round(mean(. == 0, na.rm = TRUE) * 100, 2),
Missing = ~sum(is.na(.)),
PercentMissing = ~round(mean(is.na(.)) * 100, 2),
Overdispersion = ~round(var(., na.rm = TRUE) / mean(., na.rm = TRUE), 2),
Outliers = ~sum(. > (quantile(., 0.75, na.rm = TRUE) + 1.5 * IQR(., na.rm = TRUE)) |
. < (quantile(., 0.25, na.rm = TRUE) - 1.5 * IQR(., na.rm = TRUE)), na.rm = TRUE)
),
.names = "{.col}_{.fn}"
)) %>%
pivot_longer(
cols = everything(),
names_to = c("Variable", ".value"),
names_pattern = "^(.*)_(.*)$"
)
# Display the resulting summary table
statistical_summary
## # A tibble: 16 × 18
## Variable Min Q1 Mean Median Q3 Max SD Variance
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 INDEX 1 4038. 8070. 8110 1.21e+4 1.61e4 4657. 2.17e+7
## 2 TARGET 0 2 3.03 3 4 e+0 8 e0 1.93 3.71e+0
## 3 FixedAcidity -18.1 5.2 7.08 6.9 9.5 e+0 3.44e1 6.32 3.99e+1
## 4 VolatileAcid… -2.79 0.13 0.32 0.28 6.4 e-1 3.68e0 0.78 6.1 e-1
## 5 CitricAcid -3.24 0.03 0.31 0.31 5.8 e-1 3.86e0 0.86 7.4 e-1
## 6 ResidualSugar -128. -2 5.42 3.9 1.59e+1 1.41e2 33.8 1.14e+3
## 7 Chlorides -1.17 -0.03 0.05 0.05 1.5 e-1 1.35e0 0.32 1 e-1
## 8 FreeSulfurDi… -555 0 30.8 30 7 e+1 6.23e2 149. 2.21e+4
## 9 TotalSulfurD… -823 27 121. 123 2.08e+2 1.06e3 232. 5.38e+4
## 10 Density 0.89 0.99 0.99 0.99 1 e+0 1.1 e0 0.03 0
## 11 pH 0.48 2.96 3.21 3.2 3.47e+0 6.13e0 0.68 4.6 e-1
## 12 Sulphates -3.13 0.28 0.53 0.5 8.6 e-1 4.24e0 0.93 8.7 e-1
## 13 Alcohol -4.7 9 10.5 10.4 1.24e+1 2.65e1 3.73 1.39e+1
## 14 LabelAppeal -2 -1 -0.01 0 1 e+0 2 e0 0.89 7.9 e-1
## 15 AcidIndex 4 7 7.77 8 8 e+0 1.7 e1 1.32 1.75e+0
## 16 STARS 1 1 2.04 2 3 e+0 4 e0 0.9 8.1 e-1
## # ℹ 9 more variables: CV <dbl>, Skewness <dbl>, Kurtosis <dbl>,
## # ZeroCount <int>, ZeroProportion <dbl>, Missing <int>, PercentMissing <dbl>,
## # Overdispersion <dbl>, Outliers <int>
The summary statistics highlights key metrics for numerical variables in the dataset, including central tendencies (mean, median), variability (standard deviation, variance), distribution shape (skewness, kurtosis), and data integrity measures (missing values, outliers). Here’s a statement tailored for this modeling context:
Given the observed distribution of the TARGET variable and the characteristics of predictor variables:
This summary emphasizes the appropriate models and preprocessing steps based on the data’s statistical characteristics.
Based on the summary statistics and modeling context, here are the key variables and visuals. These visualizations will provide critical insights into the data distribution, relationships, and model suitability.
The histogram and bar plot reveal a multimodal distribution of the
TARGET
variable, with prominent peaks at 0, 3, and 4, and
very few occurrences at higher values. The excess of zeros in the data
suggests potential zero-inflation, which makes the TARGET
variable suitable for models like Zero-Inflated Poisson and
Zero-Inflated Negative Binomial, in addition to standard Poisson and
Negative Binomial models. The non-uniformity and zero-inflation also
indicate that Multiple Regression models will need to account for the
skewed and count-based nature of the TARGET
variable to
ensure accurate predictions. These visuals underscore the importance of
considering overdispersion and zero-inflation when selecting statistical
models for analysis.
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Remove NAs from ResidualSugar
numerical_vars_clean <- numerical_vars %>%
dplyr::filter(!is.na(ResidualSugar))
# Create the histogram of TARGET
histogram_plot <- ggplot(numerical_vars_clean, aes(x = TARGET)) +
geom_histogram(binwidth = 1, fill = "black", color = "black", alpha = 0.6) +
labs(title = "Histogram of TARGET", x = "TARGET", y = "Frequency") +
theme_minimal()
# Create the bar plot of TARGET value counts
bar_plot <- ggplot(numerical_vars_clean, aes(x = factor(TARGET))) +
geom_bar(fill = "black", color = "black", alpha = 0.6) +
labs(title = "Bar Plot of TARGET Value Counts", x = "TARGET", y = "Count") +
theme_minimal()
# Use gridExtra to arrange the plots
gridExtra::grid.arrange(histogram_plot, bar_plot, ncol = 1)
The histogram and bar plot reveal a multimodal distribution of the
TARGET
variable, with prominent peaks around 0, 3, and 4,
and fewer occurrences at higher values. This suggests that
TARGET
is not evenly distributed, making it suitable for
count-based regression models such as Poisson, Negative Binomial, and
Zero-Inflated models. The observed non-uniformity highlights the
importance of selecting appropriate statistical models to account for
overdispersion and the excess zero counts. Multiple regression models
may also need to address the variable’s skewed distribution for accurate
prediction.
The ResidualSugar
variable shows heavy skewness and
numerous outliers, as seen in the density and box plots. These issues
suggest the need for transformations (e.g., log or Box-Cox) to stabilize
variance and normalize the distribution. For Poisson
and Negative Binomial models, overdispersion and
extreme values may require robust handling, such as Zero-Inflated
models. For multiple regression, preprocessing the
variable will likely improve model performance and interpretability.
R Code:
library(ggplot2)
library(gridExtra)
# Remove NAs from ResidualSugar
numerical_vars_clean <- numerical_vars %>%
dplyr::filter(!is.na(ResidualSugar))
# Density plot of ResidualSugar
density_plot <- ggplot(numerical_vars_clean, aes(x = ResidualSugar)) +
geom_density(fill = "black", alpha = 0.6) +
labs(title = "Density Plot of ResidualSugar", x = "ResidualSugar", y = "Density") +
theme_minimal()
# Box plot of ResidualSugar
box_plot <- ggplot(numerical_vars_clean, aes(y = ResidualSugar)) +
geom_boxplot(fill = "black", color = "black", alpha = 0.6) +
labs(title = "Box Plot of ResidualSugar", y = "ResidualSugar") +
theme_minimal()
# Use gridExtra to arrange the plots
gridExtra::grid.arrange(density_plot, box_plot, ncol = 2)
The visualizations highlight the relationships between the key
predictors (STARS
and LabelAppeal
) and
TARGET
. The box plot demonstrates that TARGET
increases with higher values of STARS
, consistent with the
theoretical expectation that better wine ratings lead to higher sales.
Similarly, the bar plot shows a clear positive trend between
LabelAppeal
and the mean TARGET
, aligning with
the notion that visually appealing labels drive consumer purchases.
These predictors are significant in the models and should be emphasized
for their strong predictive power in explaining TARGET
.
library(ggplot2)
library(gridExtra)
# Box plot of TARGET by STARS
box_plot <- ggplot(numerical_vars, aes(x = factor(STARS), y = TARGET)) +
geom_boxplot(fill = "black", color = "black", alpha = 0.6) +
labs(title = "Box Plot of TARGET by STARS", x = "STARS", y = "TARGET") +
scale_x_discrete(labels = function(x) sprintf("%.4f", as.numeric(x))) +
theme_minimal()
# Bar plot of TARGET by LabelAppeal
bar_plot <- ggplot(numerical_vars, aes(x = factor(LabelAppeal), y = TARGET)) +
geom_bar(stat = "summary", fun = "mean", fill = "black", color = "black", alpha = 0.6) +
labs(title = "Bar Plot of TARGET by LabelAppeal", x = "LabelAppeal", y = "Mean TARGET") +
scale_x_discrete(labels = function(x) sprintf("%.4f", as.numeric(x))) +
theme_minimal()
# Use gridExtra to arrange the plots
gridExtra::grid.arrange(box_plot, bar_plot, ncol = 1)
The scatter plot reveals a positive but weak linear relationship
between Alcohol and TARGET, suggesting Alcohol may have a small
predictive effect on the number of cases purchased. The correlation
heatmap highlights stronger positive correlations for variables such as
STARS
and LabelAppeal
with TARGET, aligning
with their theoretical importance in the models. These findings guide
variable selection and emphasize the need for both simple and complex
model types (e.g., Poisson, Negative Binomial, Zero-Inflated, and
Multiple Regression) to capture various relationships effectively.
# Install the ggcorrplot package if not already installed
if (!requireNamespace("ggcorrplot", quietly = TRUE)) {
install.packages("ggcorrplot")
}
# Load necessary libraries
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggcorrplot)
library(gridExtra)
# Scatter plot: TARGET vs. Alcohol
scatter_plot <- ggplot(numerical_vars, aes(x = Alcohol, y = TARGET)) +
geom_point(alpha = 0.6, color = 'blue') +
geom_smooth(method = "lm", color = 'red', se = FALSE) +
labs(title = "Scatter Plot of TARGET vs Alcohol", x = "Alcohol", y = "TARGET") +
theme_minimal()
# Correlation heatmap for numeric variables
numeric_vars <- dplyr::select(numerical_vars, where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "complete.obs")
heatmap <- ggcorrplot(cor_matrix, lab = TRUE, digits = 1, title = "Correlation Heatmap")
# Display scatter plot and heatmap
scatter_plot
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 653 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 653 rows containing missing values or values outside the scale range
## (`geom_point()`).
heatmap
The bar plot illustrates the proportion of zeros across numeric
variables, with the TARGET
variable exhibiting a notable
degree of zero inflation. This suggests the appropriateness of employing
Zero-Inflated Poisson or Negative Binomial regression to account for the
excess zeros effectively. In contrast, variables such as
LabelAppeal
show a higher presence of non-zero values,
indicating their suitability for traditional regression models without
zero-inflation considerations.
# Bar plot of zero proportions
zero_counts <- numerical_vars %>%
summarise(across(where(is.numeric), ~ mean(. == 0, na.rm = TRUE))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "ZeroProportion")
ggplot(zero_counts, aes(x = reorder(Variable, -ZeroProportion), y = ZeroProportion)) +
geom_bar(stat = "identity", fill = "black", color = "black", alpha = 0.6) +
coord_flip() +
labs(title = "Proportion of Zeros in Numeric Variables", x = "Variable", y = "Proportion of Zeros") +
theme_minimal()
Imputation Strategy
1. Median Imputation: Missing values in numerical
features with <20% missingness were imputed using the
median, ensuring robustness to outliers.
2. Predictive Imputation: Missing STARS values
were imputed using a decision tree (rpart) trained on
complete cases.
3. Missingness Flags: Binary indicators were created
for all variables with missing data to retain information about missing
patterns.
# Load necessary libraries
library(dplyr)
library(rpart)
# Combine training and evaluation datasets for consistent imputation
combined_data <- bind_rows(
train_data %>% mutate(dataset = "train"),
eval_data %>% mutate(dataset = "eval")
)
# Step 1: Impute missing values with median for variables with < 20% missingness
combined_data_imputed <- combined_data %>%
group_by(dataset) %>%
mutate(
ResidualSugar = ifelse(is.na(ResidualSugar), median(ResidualSugar, na.rm = TRUE), ResidualSugar),
Chlorides = ifelse(is.na(Chlorides), median(Chlorides, na.rm = TRUE), Chlorides),
FreeSulfurDioxide = ifelse(is.na(FreeSulfurDioxide), median(FreeSulfurDioxide, na.rm = TRUE), FreeSulfurDioxide),
TotalSulfurDioxide = ifelse(is.na(TotalSulfurDioxide), median(TotalSulfurDioxide, na.rm = TRUE), TotalSulfurDioxide),
pH = ifelse(is.na(pH), median(pH, na.rm = TRUE), pH),
Sulphates = ifelse(is.na(Sulphates), median(Sulphates, na.rm = TRUE), Sulphates),
Alcohol = ifelse(is.na(Alcohol), median(Alcohol, na.rm = TRUE), Alcohol)
) %>%
ungroup()
# Step 2: Predictive imputation for STARS using rpart
# Separate combined data into rows with and without missing STARS
complete_stars <- combined_data_imputed %>% filter(!is.na(STARS))
missing_stars <- combined_data_imputed %>% filter(is.na(STARS))
# Check column names and exclude STARS from predictors
predictor_columns <- setdiff(names(combined_data_imputed), c("STARS", "dataset"))
# Train an rpart model to predict STARS
rpart_model <- rpart(
STARS ~ .,
data = complete_stars,
method = "anova"
)
# Predict missing STARS values
predicted_stars <- predict(rpart_model, missing_stars)
# Impute the predicted values back into the combined dataset
combined_data_imputed <- combined_data_imputed %>%
mutate(
STARS = ifelse(is.na(STARS), predicted_stars, STARS)
)
# Step 3: Create missingness flags for all variables with missing data
combined_data_imputed <- combined_data_imputed %>%
mutate(
ResidualSugar_missing = is.na(ResidualSugar),
Chlorides_missing = is.na(Chlorides),
FreeSulfurDioxide_missing = is.na(FreeSulfurDioxide),
TotalSulfurDioxide_missing = is.na(TotalSulfurDioxide),
pH_missing = is.na(pH),
Sulphates_missing = is.na(Sulphates),
Alcohol_missing = is.na(Alcohol),
STARS_missing = is.na(STARS)
)
# Step 4: Split back into train_data and eval_data
train_data_imputed <- combined_data_imputed %>% filter(dataset == "train") %>% dplyr::select(-dataset)
eval_data_imputed <- combined_data_imputed %>% filter(dataset == "eval") %>% dplyr::select(-dataset)
# Display the structure of the imputed datasets
cat("\nImputed Training Data\n\n")
##
## Imputed Training Data
str(train_data_imputed)
## tibble [12,795 × 24] (S3: tbl_df/tbl/data.frame)
## $ INDEX : int [1:12795] 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num [1:12795] -0.567 -0.425 0.037 -0.425 0.046 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num [1:12795] 30 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide : num [1:12795] 268 -327 142 115 108 15 156 551 123 180 ...
## $ Density : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 0.5 0.26 0.75 ...
## $ Alcohol : num [1:12795] 9.9 10.4 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : num [1:12795] 2 3 3 1 2 ...
## $ ResidualSugar_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Chlorides_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ FreeSulfurDioxide_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ TotalSulfurDioxide_missing: logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ pH_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Sulphates_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Alcohol_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ STARS_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
cat("\nImputed Evaluation Data\n\n")
##
## Imputed Evaluation Data
str(eval_data_imputed)
## tibble [3,335 × 24] (S3: tbl_df/tbl/data.frame)
## $ INDEX : int [1:3335] 3 9 10 18 21 30 31 37 39 47 ...
## $ TARGET : int [1:3335] NA NA NA NA NA NA NA NA NA NA ...
## $ FixedAcidity : num [1:3335] 5.4 12.4 7.2 6.2 11.4 17.6 15.5 15.9 11.6 3.8 ...
## $ VolatileAcidity : num [1:3335] -0.86 0.385 1.75 0.1 0.21 0.04 0.53 1.19 0.32 0.22 ...
## $ CitricAcid : num [1:3335] 0.27 -0.76 0.17 1.8 0.28 -1.15 -0.53 1.14 0.55 0.31 ...
## $ ResidualSugar : num [1:3335] -10.7 -19.7 -33 1 1.2 1.4 4.6 31.9 -50.9 -7.7 ...
## $ Chlorides : num [1:3335] 0.092 1.169 0.065 -0.179 0.038 ...
## $ FreeSulfurDioxide : num [1:3335] 23 -37 9 104 70 -250 10 115 35 40 ...
## $ TotalSulfurDioxide : num [1:3335] 398 68 76 89 53 140 17 381 83 129 ...
## $ Density : num [1:3335] 0.985 0.99 1.046 0.989 1.029 ...
## $ pH : num [1:3335] 5.02 3.37 4.61 3.2 2.54 3.06 3.07 2.99 3.32 4.72 ...
## $ Sulphates : num [1:3335] 0.64 1.09 0.68 2.11 -0.07 -0.02 0.75 0.31 2.18 -0.64 ...
## $ Alcohol : num [1:3335] 12.3 16 8.55 12.3 4.8 11.4 8.5 11.4 -0.5 10.9 ...
## $ LabelAppeal : int [1:3335] -1 0 0 -1 0 1 0 1 0 0 ...
## $ AcidIndex : int [1:3335] 6 6 8 8 10 8 12 7 12 7 ...
## $ STARS : num [1:3335] 1.29 2 1 1 1.29 ...
## $ ResidualSugar_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Chlorides_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ FreeSulfurDioxide_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ TotalSulfurDioxide_missing: logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ pH_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Sulphates_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Alcohol_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ STARS_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
Transformation Approach:
- Box-Cox: Applied to numeric features adjusted for
positivity; optimal lambda determined using training data.
- Yeo-Johnson: Handles all numeric features without
positivity constraints; lambda optimized via
car::powerTransform
.
- Transformations were performed on the combined dataset but optimized
using training data only.
- Final outputs: Transformed train and
evaluation datasets for both methods.
# Combine train and evaluation datasets for consistent transformation
combined_data <- bind_rows(
train_data_imputed %>% mutate(dataset = "train"),
eval_data_imputed %>% mutate(dataset = "eval")
)
# Exclude columns from transformations
exclude_cols <- c("INDEX", "TARGET", "dataset")
# Identify numeric columns for transformation
numeric_cols <- setdiff(names(combined_data)[sapply(combined_data, is.numeric)], exclude_cols)
# Perform Box-Cox transformation
boxcox_data_combined <- combined_data
for (col in numeric_cols) {
if (col %in% c("INDEX", "TARGET") && all(is.na(combined_data[[col]][combined_data$dataset == "eval"]))) {
next # Skip transformation for INDEX and TARGET in eval_data if all values are NA
}
tryCatch({
# Extract the column as a vector
column_data <- combined_data[[col]]
# Check for non-positive values and adjust
adjustment <- 0
if (min(column_data, na.rm = TRUE) <= 0) {
adjustment <- abs(min(column_data, na.rm = TRUE)) + 0.001
column_data <- column_data + adjustment
}
# Fit a simple linear model on training data only
model <- lm(column_data[combined_data$dataset == "train"] ~ 1)
# Perform Box-Cox transformation
bc <- boxcox(model, lambda = seq(-2, 2, by = 0.1), plotit = FALSE)
optimal_lambda <- bc$x[which.max(bc$y)]
# Apply the transformation
if (!is.na(optimal_lambda)) {
if (optimal_lambda == 0) {
transformed <- log(column_data)
} else {
transformed <- ((column_data)^optimal_lambda - 1) / optimal_lambda
}
if (adjustment > 0) {
transformed <- transformed - adjustment
}
boxcox_data_combined[[col]] <- transformed
}
}, error = function(e) {
cat(paste("Error processing column", col, ":", e$message, "\n"))
})
}
# Perform Yeo-Johnson transformation
yeojohnson_data_combined <- combined_data
for (col in numeric_cols) {
if (col %in% c("INDEX", "TARGET") && all(is.na(combined_data[[col]][combined_data$dataset == "eval"]))) {
next # Skip transformation for INDEX and TARGET in eval_data if all values are NA
}
tryCatch({
# Extract the column as a vector
column_data <- combined_data[[col]]
# Compute optimal lambda using training data
lambda <- car::powerTransform(column_data[combined_data$dataset == "train"], family = "yjPower")$lambda
# Apply the Yeo-Johnson transformation
transformed <- car::yjPower(column_data, lambda)
# Save the transformed data
yeojohnson_data_combined[[col]] <- transformed
}, error = function(e) {
cat(paste("Error processing column", col, ":", e$message, "\n"))
})
}
# Split transformed dataset back into train and evaluation datasets
boxcox_train_data <- boxcox_data_combined %>%
filter(dataset == "train") %>%
dplyr::select(-c(dataset))
boxcox_eval_data <- boxcox_data_combined %>%
filter(dataset == "eval") %>%
dplyr::select(-c(dataset))
yeojohnson_train_data <- yeojohnson_data_combined %>%
filter(dataset == "train") %>%
dplyr::select(-c(dataset))
yeojohnson_eval_data <- yeojohnson_data_combined %>%
filter(dataset == "eval") %>%
dplyr::select(-c(dataset))
# Display structures of transformed datasets
cat("\nBox-Cox Transformed Training Dataset:\n\n")
##
## Box-Cox Transformed Training Dataset:
str(boxcox_train_data)
## tibble [12,795 × 24] (S3: tbl_df/tbl/data.frame)
## $ INDEX : int [1:12795] 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num [1:12795] 7.32 9.09 12.66 10.73 13.91 ...
## $ VolatileAcidity : num [1:12795] 0.427 -0.706 2.155 -0.454 -0.516 ...
## $ CitricAcid : num [1:12795] -1.92 -1.735 -1.811 -0.791 -2.222 ...
## $ ResidualSugar : num [1:12795] 150 103.1 84.5 91.1 75.6 ...
## $ Chlorides : num [1:12795] -1.558 -1.422 -0.961 -1.422 -0.952 ...
## $ FreeSulfurDioxide : num [1:12795] 457 428.6 810.4 441.8 90.8 ...
## $ TotalSulfurDioxide : num [1:12795] 1172.3 14.9 920.3 866.7 852.8 ...
## $ Density : num [1:12795] -0.00719 0.02804 -0.00482 -0.0036 -0.00543 ...
## $ pH : num [1:12795] 2.51 2.56 2.27 1.3 2.27 ...
## $ Sulphates : num [1:12795] -1.5043 -0.0567 -0.3076 1.2533 1.1829 ...
## $ Alcohol : num [1:12795] 11.75 12.4 28.1 6.97 16.77 ...
## $ LabelAppeal : num [1:12795] -1.07 -2 -2 -2 -1.07 ...
## $ AcidIndex : num [1:12795] 0.765 0.753 0.765 0.736 0.774 ...
## $ STARS : num [1:12795] 0.647 0.986 0.986 0 0.647 ...
## $ ResidualSugar_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Chlorides_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ FreeSulfurDioxide_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ TotalSulfurDioxide_missing: logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ pH_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Sulphates_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Alcohol_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ STARS_missing : logi [1:12795] FALSE FALSE FALSE FALSE FALSE FALSE ...
cat("\nBox-Cox Transformed Evaluation Dataset:\n\n")
##
## Box-Cox Transformed Evaluation Dataset:
str(boxcox_eval_data)
## tibble [3,335 × 24] (S3: tbl_df/tbl/data.frame)
## $ INDEX : int [1:3335] 3 9 10 18 21 30 31 37 39 47 ...
## $ TARGET : int [1:3335] NA NA NA NA NA NA NA NA NA NA ...
## $ FixedAcidity : num [1:3335] 10.3 20.1 12.8 11.4 18.7 ...
## $ VolatileAcidity : num [1:3335] -1.822 -0.454 1.109 -0.773 -0.65 ...
## $ CitricAcid : num [1:3335] -0.531 -1.68 -0.644 1.237 -0.52 ...
## $ ResidualSugar : num [1:3335] 43 28.56 7.44 61.94 62.26 ...
## $ Chlorides : num [1:3335] -0.905 0.236 -0.932 -1.179 -0.96 ...
## $ FreeSulfurDioxide : num [1:3335] 444 331 417 598 533 ...
## $ TotalSulfurDioxide : num [1:3335] 1435 774 789 815 744 ...
## $ Density : num [1:3335] -0.0147 -0.00951 0.04673 -0.01121 0.02912 ...
## $ pH : num [1:3335] 4.45 2.55 3.97 2.36 1.63 ...
## $ Sulphates : num [1:3335] -0.1253 0.3915 -0.0796 1.5828 -0.928 ...
## $ Alcohol : num [1:3335] 14.91 19.87 9.99 14.91 5.21 ...
## $ LabelAppeal : num [1:3335] -2 -1.07 -1.07 -2 -1.07 ...
## $ AcidIndex : num [1:3335] 0.736 0.736 0.765 0.765 0.781 ...
## $ STARS : num [1:3335] 0.246 0.647 0 0 0.246 ...
## $ ResidualSugar_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Chlorides_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ FreeSulfurDioxide_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ TotalSulfurDioxide_missing: logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ pH_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Sulphates_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Alcohol_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ STARS_missing : logi [1:3335] FALSE FALSE FALSE FALSE FALSE FALSE ...
AcidIndex
, Sulphates
, and
STARS
.AcidIndex
and Sulphates
.Density
and
Chlorides
, showed little change as they were already near
symmetric.Both transformations improve symmetry, with Box-Cox being more effective overall, while Yeo-Johnson offers versatility for handling zero or negative values. For our modeling homework, we will use the Box-Cox transformed data
# Required Libraries
library(ggplot2)
library(dplyr)
library(tidyr)
# Compute skewness for original, Box-Cox, and Yeo-Johnson transformed datasets
compute_skewness <- function(data, numeric_cols) {
sapply(data %>% dplyr::select(all_of(numeric_cols)), function(x) {
moments::skewness(x, na.rm = TRUE)
})
}
# Calculate skewness for all datasets
original_skewness <- compute_skewness(train_data_imputed, numeric_cols)
boxcox_skewness <- compute_skewness(boxcox_train_data, numeric_cols)
yeojohnson_skewness <- compute_skewness(yeojohnson_train_data, numeric_cols)
# Combine skewness data into a single dataframe for plotting
skewness_df <- data.frame(
Variable = rep(numeric_cols, 3),
Transformation = rep(c("Original", "Box-Cox", "Yeo-Johnson"), each = length(numeric_cols)),
Skewness = c(original_skewness, boxcox_skewness, yeojohnson_skewness)
)
# Create a bar chart to compare skewness
ggplot(skewness_df, aes(x = Variable, y = Skewness, fill = Transformation)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Skewness Comparison Before and After Transformations",
x = "Variable",
y = "Skewness"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set2")
# Scale only numeric columns excluding TARGET
boxcox_train_data <- boxcox_train_data %>%
mutate(across(where(is.numeric) & !starts_with(c("TARGET", "INDEX")), scale))
boxcox_eval_data <- boxcox_eval_data %>%
mutate(across(where(is.numeric) & !starts_with(c("TARGET", "INDEX")), scale))
In initial model is fit to the data to analyze Cook’s Distance, detect and remove outliers. This step ensures the dataset is clean, free from irrelevant or redundant features, and balanced before fitting models. It removes unnecessary columns (e.g., missing value indicators) because they have been found to not contribute to model performance, partitions the data for training and evaluation, and identifies influential points across multiple models to enhance data quality and improve model performance.
# Load necessary libraries
library(MASS)
library(pscl)
library(car)
library(dplyr)
library(caret)
## Loading required package: lattice
library(lattice)
library(MASS) # For Negative Binomial Model
library(car) # For diagnostic tools
# Remove unnecessary columns (e.g., "_missing" flags)
boxcox_train_data <- boxcox_train_data %>%
dplyr::select(-ends_with("_missing"), -INDEX)
boxcox_eval_data <- boxcox_eval_data %>%
dplyr::select(-ends_with("_missing"))
# Split data into training and testing sets
set.seed(123) # For reproducibility
train_index <- createDataPartition(boxcox_train_data$TARGET, p = 0.8, list = FALSE)
train_data <- boxcox_train_data[train_index, ]
test_data <- boxcox_train_data[-train_index, ]
# Convert matrix columns to numeric vectors for train and test data
train_data <- train_data %>%
mutate(across(where(is.matrix), ~ as.numeric(.)))
test_data <- test_data %>%
mutate(across(where(is.matrix), ~ as.numeric(.)))
# Split data into training and testing sets
set.seed(123) # For reproducibility
train_index <- createDataPartition(boxcox_train_data$TARGET, p = 0.8, list = FALSE)
train_data <- boxcox_train_data[train_index, ]
test_data <- boxcox_train_data[-train_index, ]
# Fit the Poisson Model using all variables
poisson_model_all <- glm(TARGET ~ ., data = train_data, family = poisson(link = "log"))
# Fit the Negative Binomial Model using all variables
nb_model <- glm.nb(TARGET ~ ., data = train_data)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
# Fit the Multiple Linear Regression Model using all variables
mlr_model <- lm(TARGET ~ ., data = train_data)
# Calculate Cook's Distance for Poisson Model
cooks_dist_poisson <- cooks.distance(poisson_model_all)
influential_poisson <- which(cooks_dist_poisson > (4 / nrow(train_data)))
# Calculate Cook's Distance for Negative Binomial Model
cooks_dist_nb <- cooks.distance(nb_model)
influential_nb <- which(cooks_dist_nb > (4 / nrow(train_data)))
# Calculate Cook's Distance for MLR Model
cooks_dist_mlr <- cooks.distance(mlr_model)
influential_mlr <- which(cooks_dist_mlr > (4 / nrow(train_data)))
# Union of Influential Points Across All Models
influential_points <- union(union(influential_poisson, influential_nb), influential_mlr)
# Refine Dataset by Removing Influential Points
train_data_refined <- train_data[-influential_points, ]
# Output Number of Removed Points
cat("Total Influential Points Removed Across Models:", length(influential_points), "\n")
## Total Influential Points Removed Across Models: 547
# Convert matrix columns to numeric vectors for train and test data
train_data_refined <- train_data_refined %>%
mutate(across(where(is.matrix), ~ as.numeric(.)))
test_data <- test_data %>%
mutate(across(where(is.matrix), ~ as.numeric(.)))
The tree model identifies STARS
,
LabelAppeal
, and AcidIndex
as key predictors,
with interactions evident where the effect of LabelAppeal
and AcidIndex
depends on the range of STARS
.
Specifically, the effect of LabelAppeal
varies within
different ranges of STARS
, and AcidIndex
plays
a role when STARS
is less than 0.16. Possible interaction
terms to include in a model are STARS × LabelAppeal
and
STARS × AcidIndex
to capture these relationships.
# Load necessary libraries
library(rpart)
library(rpart.plot)
# Train a regression tree model with increased depth
set.seed(456)
treeModel <- rpart(
TARGET ~ .,
data = train_data_refined,
method = "anova",
control = rpart.control(maxdepth = 10) # Increase the maximum depth
)
# Visualize the tree
rpart.plot(treeModel,
box.palette = "RdBu",
shadow.col = "gray",
nn = TRUE)
# Extract split details using base R
split_details <- as.data.frame(treeModel$frame)
split_details <- split_details[split_details$var != "<leaf>", c("var", "n", "dev", "yval")]
# Output split details
split_details
## var n dev yval
## 1 STARS 9691 33050.204 3.136209
## 2 STARS 4171 12204.959 2.003117
## 4 AcidIndex 1821 4952.293 1.266886
## 3 LabelAppeal 5520 11443.680 3.992391
## 6 LabelAppeal 3687 6333.720 3.502848
## 13 STARS 2509 3572.616 3.859705
## 26 STARS 1541 2496.550 3.541856
## 7 STARS 1833 2449.038 4.977087
## 14 STARS 811 1371.366 4.541307
The Random Forest model confirms the importance of STARS
and LabelAppeal
as the most influential predictors, with
AcidIndex
, Chlorides
, and
TotalSulfurDioxide
also showing moderate significance.
Other variables, such as VolatileAcidity
and
Alcohol
, have a smaller but notable impact, while the
remaining predictors contribute minimally. This supports the focus on
STARS × LabelAppeal
and potentially
STARS × AcidIndex
for interaction modeling.
# Load necessary libraries
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
# Reduce computation time strategies
# Subset the dataset (if dataset is large)
set.seed(123) # For reproducibility
sample_data <- train_data_refined[sample(1:nrow(train_data_refined), size = 5000), ]
# Set up parallel processing
cl <- makeCluster(detectCores() - 1) # Use all but one core
registerDoParallel(cl)
# Train the Random Forest model
set.seed(456) # For reproducibility
rf_model <- randomForest(
TARGET ~ .,
data = sample_data,
importance = TRUE, # Compute variable importance
ntree = 100, # Reduce the number of trees
mtry = 2, # Reduce the number of predictors per split
nodesize = 10 # Increase node size to reduce splits
)
# Stop parallel processing
stopCluster(cl)
# Display variable importance
print(importance(rf_model))
## %IncMSE IncNodePurity
## FixedAcidity 1.3144050 615.5860
## VolatileAcidity 4.7946415 750.5674
## CitricAcid 0.9824440 577.4464
## ResidualSugar 3.0250795 628.7444
## Chlorides 5.0115482 774.0460
## FreeSulfurDioxide 5.2555645 712.3400
## TotalSulfurDioxide 2.0808506 737.3759
## Density 2.4141392 698.5284
## pH -0.9662872 593.9370
## Sulphates -0.4988596 581.0680
## Alcohol 3.3110223 662.1983
## LabelAppeal 30.2845082 1894.0864
## AcidIndex 18.9442301 932.2876
## STARS 33.7295596 4122.4252
# Plot variable importance
varImpPlot(rf_model, main = "Variable Importance Plot")
## Poisson Regression Models
The Higher-Order Interaction Model is the best-performing Poisson model, based on the lowest Test RMSE (1.584489).
This confirms the robustness of the Higher-Order Interaction model in the Poisson regression context.
# Load necessary libraries
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
library(knitr)
library(MASS) # For stepAIC function
library(caret) # For train-test split
train_data_refined <- train_data_refined %>% drop_na(TARGET)
test_data <- test_data %>% drop_na(TARGET)
# Step 2: Fit Multiple Poisson Models Using the Training Data
# 1. Basic Additive Model
poisson_model_additive <- glm(TARGET ~ STARS + LabelAppeal + AcidIndex + TotalSulfurDioxide + VolatileAcidity,
family = poisson(link = "log"), data = train_data_refined)
# 2. Interaction Model
poisson_model_interaction <- glm(TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal +
TotalSulfurDioxide + VolatileAcidity,
family = poisson(link = "log"), data = train_data_refined)
# 3. Higher-Order Interaction Model
poisson_model_higher_interaction <- glm(TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal +
TotalSulfurDioxide * VolatileAcidity + I(STARS^2) + I(LabelAppeal^2),
family = poisson(link = "log"), data = train_data_refined)
# 4. Simplified Model for Parsimony
poisson_model_simplified <- glm(TARGET ~ STARS * AcidIndex,
family = poisson(link = "log"), data = train_data_refined)
# 5. Stepwise Regression Model
# Start with the full model and perform stepwise selection based on AIC
poisson_full_model <- glm(TARGET ~ ., family = poisson(link = "log"), data = train_data_refined) # Use all predictors
poisson_model_stepwise <- stepAIC(poisson_full_model, direction = "both", trace = FALSE) # Stepwise selection
# Step 3: Evaluate Models on Both Training and Test Data
# Define a helper function to calculate metrics for each Poisson model
calculate_poisson_metrics <- function(model, train_data_refined, test_data) {
# Predictions
train_preds <- predict(model, newdata = train_data_refined, type = "response")
test_preds <- predict(model, newdata = test_data, type = "response")
# Calculate metrics
metrics <- data.frame(
Train_R2 = 1 - (sum((train_data_refined$TARGET - train_preds)^2) / sum((train_data_refined$TARGET - mean(train_data_refined$TARGET))^2)),
Test_R2 = 1 - (sum((test_data$TARGET - test_preds)^2) / sum((test_data$TARGET - mean(test_data$TARGET))^2)),
Train_RMSE = rmse(train_data_refined$TARGET, train_preds),
Test_RMSE = rmse(test_data$TARGET, test_preds),
AIC = AIC(model),
Null_Deviance = model$null.deviance, # Null deviance for the model
Residual_Deviance = model$deviance # Residual deviance for the model
)
return(metrics)
}
# Apply the helper function to all Poisson models
poisson_model_metrics <- rbind(
Additive = calculate_poisson_metrics(poisson_model_additive, train_data_refined, test_data),
Interaction = calculate_poisson_metrics(poisson_model_interaction, train_data_refined, test_data),
HigherOrderInteraction = calculate_poisson_metrics(poisson_model_higher_interaction, train_data_refined, test_data),
Simplified = calculate_poisson_metrics(poisson_model_simplified, train_data_refined, test_data),
Stepwise = calculate_poisson_metrics(poisson_model_stepwise, train_data_refined, test_data)
)
# Step 4: Sort Models by Test RMSE
# Sort by Test_RMSE (ascending order) to identify the best model
poisson_model_metrics_sorted <- poisson_model_metrics[order(poisson_model_metrics$Test_RMSE), ]
# Step 5: Display the Sorted Metrics in a Table
cat("\nPoisson Model Comparison Metrics Sorted by Test RMSE\n")
##
## Poisson Model Comparison Metrics Sorted by Test RMSE
poisson_model_metrics_sorted
## Train_R2 Test_R2 Train_RMSE Test_RMSE AIC
## HigherOrderInteraction 0.4223409 0.3307138 1.403585 1.584489 36020.06
## Stepwise 0.4061486 0.3205099 1.423120 1.596522 36294.70
## Interaction 0.4216400 0.3190610 1.404436 1.598223 36120.49
## Additive 0.4025244 0.3148473 1.427456 1.603161 36338.88
## Simplified 0.3181138 0.2692662 1.524961 1.655629 37232.56
## Null_Deviance Residual_Deviance
## HigherOrderInteraction 15416.13 10838.14
## Stepwise 15416.13 11110.79
## Interaction 15416.13 10944.58
## Additive 15416.13 11168.96
## Simplified 15416.13 12066.65
# Step 6: Identify and Output the Best Poisson Model
# Ensure the best model name is extracted properly
best_poisson_model_index <- which.min(poisson_model_metrics_sorted$Test_RMSE) # Get the index of the lowest Test RMSE
best_poisson_model_name <- rownames(poisson_model_metrics_sorted)[best_poisson_model_index] # Extract the model name
best_poisson_model_type <- paste("Poisson Best Model:", best_poisson_model_name) # Include the model type
# Output the best Poisson model name
cat("Best model based on lowest Test RMSE:", best_poisson_model_type, "\n")
## Best model based on lowest Test RMSE: Poisson Best Model: HigherOrderInteraction
The Vuong test failed to produce valid results
(NULL
), likely due to a mismatch in data dimensions.
However, the Zero-Inflated Poisson model appears
well-fitted, with significant predictors (e.g., STARS
,
LabelAppeal
, AcidIndex
) and a log-likelihood
of -1.743e+04. Further checks on data alignment are
needed for accurate comparison.
test_data <- test_data %>% drop_na()
train_data_refined <- train_data_refined %>% drop_na()
# Load necessary libraries
library(pscl)
library(Metrics) # For RMSE
library(MASS) # For Negative Binomial model
library(knitr) # For pretty table output
test_data <- test_data %>% drop_na()
train_data_refined <- train_data_refined %>% drop_na()
# Step 1: Fit Zero-Inflated Poisson Model
zip_model_higher_interaction <- zeroinfl(
TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal +
TotalSulfurDioxide * VolatileAcidity + I(STARS^2) + I(LabelAppeal^2) | 1,
data = train_data_refined,
dist = "poisson"
)
# Step 2: Define Function to Evaluate Model Metrics
evaluate_zip_metrics <- function(model, train_data_refined, test_data) {
tryCatch({
# Predictions
train_preds <- predict(model, newdata = train_data_refined, type = "response")
test_preds <- predict(model, newdata = test_data, type = "response")
# RMSE Calculation
train_rmse <- sqrt(mean((train_data_refined$TARGET - train_preds)^2))
test_rmse <- sqrt(mean((test_data$TARGET - test_preds)^2))
# R-squared Calculation
train_r2 <- 1 - sum((train_data_refined$TARGET - train_preds)^2) / sum((train_data_refined$TARGET - mean(train_data_refined$TARGET))^2)
test_r2 <- 1 - sum((test_data$TARGET - test_preds)^2) / sum((test_data$TARGET - mean(test_data$TARGET))^2)
# Deviance Metrics
null_deviance <- model$null
residual_deviance <- model$deviance
# Model Metrics Table
metrics <- data.frame(
Train_R2 = train_r2,
Test_R2 = test_r2,
Train_RMSE = train_rmse,
Test_RMSE = test_rmse,
AIC = AIC(model),
Null_Deviance = null_deviance,
Residual_Deviance = residual_deviance
)
return(metrics)
}, error = function(e) {
cat("Error evaluating model metrics:", e$message, "\n")
return(NULL)
})
}
# Step 3: Apply Metrics Function
zip_metrics <- evaluate_zip_metrics(zip_model_higher_interaction, train_data_refined, test_data)
## Error evaluating model metrics: arguments imply differing number of rows: 1, 0
# Step 4: Model Comparison (Vuong Test with Negative Binomial)
vuong_test <- vuong(zip_model_higher_interaction, poisson_model_higher_interaction)
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw 16.50134 model1 > model2 < 2.22e-16
## AIC-corrected 16.47250 model1 > model2 < 2.22e-16
## BIC-corrected 16.36899 model1 > model2 < 2.22e-16
# Step 5: Display Results
# Metrics Table
if (!is.null(zip_metrics)) {
cat("Model Performance Metrics:\n")
print(kable(zip_metrics, digits = 3, caption = "Zero-Inflated Poisson Model Metrics"))
}
# Vuong Test Results
cat("\nVuong Test Results:\n")
##
## Vuong Test Results:
print(vuong_test)
## NULL
# Model Summary
cat("\nSummary of the Zero-Inflated Poisson Model:\n")
##
## Summary of the Zero-Inflated Poisson Model:
print(summary(zip_model_higher_interaction))
##
## Call:
## zeroinfl(formula = TARGET ~ STARS * LabelAppeal + STARS * AcidIndex +
## AcidIndex * LabelAppeal + TotalSulfurDioxide * VolatileAcidity +
## I(STARS^2) + I(LabelAppeal^2) | 1, data = train_data_refined, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.6546 -0.2143 0.1805 0.5062 2.2075
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.200994 0.010704 112.200 < 2e-16 ***
## STARS 0.132690 0.006888 19.263 < 2e-16 ***
## LabelAppeal 0.228233 0.007231 31.565 < 2e-16 ***
## AcidIndex -0.082892 0.007238 -11.453 < 2e-16 ***
## TotalSulfurDioxide 0.006764 0.006072 1.114 0.26528
## VolatileAcidity -0.022319 0.006111 -3.652 0.00026 ***
## I(STARS^2) 0.041713 0.006501 6.417 1.39e-10 ***
## I(LabelAppeal^2) -0.010813 0.005122 -2.111 0.03477 *
## STARS:LabelAppeal -0.032960 0.007480 -4.406 1.05e-05 ***
## STARS:AcidIndex 0.048403 0.007010 6.905 5.03e-12 ***
## LabelAppeal:AcidIndex 0.004627 0.007478 0.619 0.53606
## TotalSulfurDioxide:VolatileAcidity 0.006799 0.006387 1.065 0.28707
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.92098 0.03964 -48.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 19
## Log-likelihood: -1.743e+04 on 13 Df
The Higher-Order Interaction model outperformed all other Negative Binomial models based on Test RMSE and Test R²:
While the Stepwise model has a slightly lower AIC, the Higher-Order Interaction model achieves the best balance of predictive accuracy and generalization performance, making it the best model.
# Load necessary libraries
library(Metrics)
library(knitr)
library(MASS) # For glm.nb() and stepAIC
library(caret) # For train-test split
# Step 2: Fit Multiple Negative Binomial Models Using the Training Data
# 1. Basic Additive Model
nb_model_additive <- suppressWarnings(glm.nb(TARGET ~ STARS + LabelAppeal + AcidIndex + TotalSulfurDioxide + VolatileAcidity,
data = train_data_refined))
# 2. Interaction Model
nb_model_interaction <- suppressWarnings(glm.nb(TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal +
TotalSulfurDioxide + VolatileAcidity,
data = train_data_refined))
# 3. Higher-Order Interaction Model
nb_model_higher_interaction <- suppressWarnings(glm.nb(TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal +
TotalSulfurDioxide * VolatileAcidity + I(STARS^2) + I(LabelAppeal^2),
data = train_data_refined))
# 4. Simplified Model for Parsimony
nb_model_simplified <- suppressWarnings(glm.nb(TARGET ~ STARS * AcidIndex,
data = train_data_refined))
# 5. Stepwise Regression Model
# Start with the full model and perform stepwise selection based on AIC
nb_full_model <- suppressWarnings(glm.nb(TARGET ~ ., data = train_data_refined)) # Use all predictors
nb_model_stepwise <- suppressWarnings(stepAIC(nb_full_model, direction = "both", trace = FALSE)) # Stepwise selection
# Step 3: Evaluate Models on Both Training and Test Data
# Define a helper function to calculate metrics for each Negative Binomial model
calculate_nb_metrics <- function(model, train_data_refined, test_data) {
# Predictions
train_preds <- predict(model, newdata = train_data_refined, type = "response")
test_preds <- predict(model, newdata = test_data, type = "response")
# Extract theta (dispersion parameter)
theta <- model$theta
# Calculate metrics
metrics <- data.frame(
Train_R2 = 1 - (sum((train_data_refined$TARGET - train_preds)^2) / sum((train_data_refined$TARGET - mean(train_data_refined$TARGET))^2)),
Test_R2 = 1 - (sum((test_data$TARGET - test_preds)^2) / sum((test_data$TARGET - mean(test_data$TARGET))^2)),
Train_RMSE = rmse(train_data_refined$TARGET, train_preds),
Test_RMSE = rmse(test_data$TARGET, test_preds),
AIC = AIC(model),
Null_Deviance = model$null.deviance, # Null deviance for the model
Residual_Deviance = model$deviance, # Residual deviance for the model
Theta = theta # Dispersion parameter
)
return(metrics)
}
# Apply the helper function to all Negative Binomial models
nb_model_metrics <- rbind(
Additive = calculate_nb_metrics(nb_model_additive, train_data_refined, test_data),
Interaction = calculate_nb_metrics(nb_model_interaction, train_data_refined, test_data),
HigherOrderInteraction = calculate_nb_metrics(nb_model_higher_interaction, train_data_refined, test_data),
Simplified = calculate_nb_metrics(nb_model_simplified, train_data_refined, test_data),
Stepwise = calculate_nb_metrics(nb_model_stepwise, train_data_refined, test_data)
)
# Step 4: Sort Models by Test RMSE
# Sort by Test_RMSE (ascending order) to identify the best model
nb_model_metrics_sorted <- nb_model_metrics[order(nb_model_metrics$Test_RMSE), ]
# Step 5: Display the Sorted Metrics in a Table
cat("\nNegative Binomial Model Comparison Metrics Sorted by Test RMSE\n")
##
## Negative Binomial Model Comparison Metrics Sorted by Test RMSE
nb_model_metrics_sorted
## Train_R2 Test_R2 Train_RMSE Test_RMSE AIC
## HigherOrderInteraction 0.4223398 0.3307130 1.403586 1.584490 36022.24
## Stepwise 0.4061480 0.3205090 1.423121 1.596523 36296.88
## Interaction 0.4216393 0.3190600 1.404437 1.598225 36122.68
## Additive 0.4025238 0.3148464 1.427457 1.603162 36341.05
## Simplified 0.3181135 0.2692658 1.524961 1.655629 37234.73
## Null_Deviance Residual_Deviance Theta
## HigherOrderInteraction 15415.58 10837.83 60781.92
## Stepwise 15415.59 11110.47 61340.98
## Interaction 15415.57 10944.26 59020.49
## Additive 15415.59 11168.64 61039.95
## Simplified 15415.42 12066.17 46873.72
# Step 6: Identify and Output the Best Negative Binomial Model
# Ensure the best model name is extracted properly
best_nb_model_index <- which.min(nb_model_metrics_sorted$Test_RMSE) # Get the index of the lowest Test RMSE
best_nb_model_name <- rownames(nb_model_metrics_sorted)[best_nb_model_index] # Extract the model name
best_nb_model_type <- paste("Negative Binomial Best Model:", best_nb_model_name) # Include the model type
# Output the best Negative Binomial model name
cat("Best model based on lowest Test RMSE:", best_nb_model_type, "\n")
## Best model based on lowest Test RMSE: Negative Binomial Best Model: HigherOrderInteraction
The Zero-Inflated Negative Binomial (ZINB) model and the Negative Binomial Higher-Order Interaction (NB_HigherOrder) model were compared:
In summary, while the ZINB model improves AIC, the NB_HigherOrder model is superior in predictive accuracy and overall performance.
library(pscl) # For zeroinfl function
# Fit the Zero-Inflated Negative Binomial model
zinb_model <- zeroinfl(
TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal +
TotalSulfurDioxide * VolatileAcidity + I(STARS^2) + I(LabelAppeal^2) | 1,
data = train_data_refined,
dist = "negbin"
)
# Function to compute model metrics
calculate_model_metrics <- function(model, train_data, test_data) {
# Predictions
train_preds <- predict(model, newdata = train_data, type = "response")
test_preds <- predict(model, newdata = test_data, type = "response")
# Metrics
metrics <- data.frame(
Train_R2 = 1 - (sum((train_data$TARGET - train_preds)^2) / sum((train_data$TARGET - mean(train_data$TARGET))^2)),
Test_R2 = 1 - (sum((test_data$TARGET - test_preds)^2) / sum((test_data$TARGET - mean(test_data$TARGET))^2)),
Train_RMSE = sqrt(mean((train_data$TARGET - train_preds)^2)),
Test_RMSE = sqrt(mean((test_data$TARGET - test_preds)^2)),
AIC = AIC(model),
LogLikelihood = as.numeric(logLik(model)) # Log-Likelihood
)
return(metrics)
}
# Compute metrics for both models
zinb_metrics <- calculate_model_metrics(zinb_model, train_data_refined, test_data)
nb_metrics <- calculate_model_metrics(nb_model_higher_interaction, train_data_refined, test_data)
# Combine metrics for comparison
model_comparison <- rbind(
ZINB = zinb_metrics,
NB_HigherOrder = nb_metrics
)
# Display comparison table
print(model_comparison)
## Train_R2 Test_R2 Train_RMSE Test_RMSE AIC LogLikelihood
## ZINB 0.3869180 0.2962219 1.445979 1.624805 34879.68 -17425.84
## NB_HigherOrder 0.4223398 0.3307130 1.403586 1.584490 36022.24 -17998.12
# Vuong Test for model comparison
vuong_test <- vuong(zinb_model, nb_model_higher_interaction)
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw 16.50382 model1 > model2 < 2.22e-16
## AIC-corrected 16.47499 model1 > model2 < 2.22e-16
## BIC-corrected 16.37147 model1 > model2 < 2.22e-16
# Display Vuong Test results
cat("Vuong Test Results:\n")
## Vuong Test Results:
print(vuong_test)
## NULL
The Polynomial model outperformed other multiple linear regression (MLR) models with the lowest Test RMSE of 1.550 and the highest Test R² of 0.360, indicating it captures nonlinear relationships effectively. The Higher-Order Interaction model followed closely but with slightly higher error metrics.
# Load necessary libraries
library(Metrics)
library(knitr)
library(MASS) # For stepAIC function
library(caret) # For train-test split
# Step 2: Fit Multiple MLR Models Using the Training Data
# 1. Basic Additive Model
model_additive <- lm(TARGET ~ STARS + LabelAppeal + AcidIndex + TotalSulfurDioxide + VolatileAcidity,
data = train_data_refined)
# 2. Interaction Model
model_interaction <- lm(TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal +
TotalSulfurDioxide + VolatileAcidity,
data = train_data_refined)
# 3. Higher-Order Interaction Model
model_higher_interaction <- lm(TARGET ~ STARS * LabelAppeal + STARS * AcidIndex + AcidIndex * LabelAppeal +
TotalSulfurDioxide * VolatileAcidity + I(STARS^2) + I(LabelAppeal^2),
data = train_data_refined)
# 4. Simplified Model for Parsimony
model_simplified <- lm(TARGET ~ STARS * AcidIndex,
data = train_data_refined)
# 5. Polynomial Model for Nonlinear Effects
model_polynomial <- lm(TARGET ~ STARS + LabelAppeal + AcidIndex + I(STARS^2) + I(LabelAppeal^2) +
I(AcidIndex^2),
data = train_data_refined)
# 6. Stepwise Regression Model
# Start with the full model and perform stepwise selection based on AIC
full_model <- lm(TARGET ~ ., data = train_data_refined) # Use all predictors
model_stepwise <- stepAIC(full_model, direction = "both", trace = FALSE) # Stepwise selection
# Step 3: Evaluate Models on Both Training and Test Data
# Define a helper function to calculate metrics for each model
calculate_metrics <- function(model, train_data_refined, test_data) {
# Predictions
train_preds <- predict(model, newdata = train_data_refined)
test_preds <- predict(model, newdata = test_data)
# Calculate metrics
metrics <- data.frame(
Train_R2 = summary(model)$adj.r.squared,
Test_R2 = 1 - (sum((test_data$TARGET - test_preds)^2) / sum((test_data$TARGET - mean(test_data$TARGET))^2)),
Train_RMSE = rmse(train_data_refined$TARGET, train_preds),
Test_RMSE = rmse(test_data$TARGET, test_preds),
AIC = AIC(model)
)
return(metrics)
}
# Apply the helper function to all models
model_metrics <- rbind(
Additive = calculate_metrics(model_additive, train_data_refined, test_data),
Interaction = calculate_metrics(model_interaction, train_data_refined, test_data),
HigherOrderInteraction = calculate_metrics(model_higher_interaction, train_data_refined, test_data),
Simplified = calculate_metrics(model_simplified, train_data_refined, test_data),
Polynomial = calculate_metrics(model_polynomial, train_data_refined, test_data),
Stepwise = calculate_metrics(model_stepwise, train_data_refined, test_data)
)
# Step 4: Sort Models by Test RMSE
# Sort by Test_RMSE (ascending order) to identify the best model
model_metrics_sorted <- model_metrics[order(model_metrics$Test_RMSE), ]
# Step 5: Display the Sorted Metrics in a Table
kable(model_metrics_sorted, digits = 3, caption = "Model Comparison Metrics Sorted by Test RMSE")
Train_R2 | Test_R2 | Train_RMSE | Test_RMSE | AIC | |
---|---|---|---|---|---|
Polynomial | 0.434 | 0.360 | 1.389 | 1.550 | 33891.43 |
HigherOrderInteraction | 0.447 | 0.354 | 1.373 | 1.557 | 33670.02 |
Interaction | 0.426 | 0.328 | 1.398 | 1.587 | 34022.10 |
Stepwise | 0.408 | 0.312 | 1.420 | 1.606 | 34333.21 |
Additive | 0.402 | 0.305 | 1.428 | 1.615 | 34416.15 |
Simplified | 0.302 | 0.252 | 1.543 | 1.675 | 35914.16 |
# Step 6: Identify and Output the Best Model with Model Type
# Ensure the best model name is extracted properly
best_model_index <- which.min(model_metrics_sorted$Test_RMSE) # Get the index of the lowest Test RMSE
best_model_name <- rownames(model_metrics_sorted)[best_model_index] # Extract the model name
best_model_type <- paste("MLR Best Model:", best_model_name) # Include the model type
# Output the best model name
cat("Best model based on lowest Test RMSE:", best_model_type, "\n")
## Best model based on lowest Test RMSE: MLR Best Model: Polynomial
The best-performing model is Linear Regression (Polynomial), achieving the lowest Test RMSE (1.5498) and the highest Test R² (0.3597). This indicates that the Polynomial model captures the underlying patterns in the data more effectively than the other models.
While the Polynomial Based performs best, I may consider Linear Regression (Higher-Order Interaction) or Poisson (Higher-Order Interaction) for interpretability: - Linear Regression (Higher-Order Interaction) has slightly higher Test RMSE (1.5570) but provides a simpler linear interpretation with interaction terms. - Poisson (Higher-Order Interaction) also has a competitive performance (Test RMSE = 1.5845) and aligns well with count data assumptions.
These models balance performance and interpretability, making them easier to explain to stakeholders while still maintaining strong predictive accuracy.
If predictive performance is the sole objective, the Polynomial model is the best. However, for easier interpretation, Linear Regression (Higher-Order Interaction) or Poisson (Higher-Order Interaction) can be selected with minimal trade-offs in performance.
# Initialize storage for metrics across all models
final_metrics <- data.frame(
Model_Type = character(),
Model = character(),
Train_RMSE = numeric(),
Test_RMSE = numeric(),
Train_R2 = numeric(),
Test_R2 = numeric()
)
# Function to evaluate and store model metrics
evaluate_and_store <- function(model, model_name, model_type, train_data, test_data) {
tryCatch({
train_predictions <- predict(model, newdata = train_data, type = "response")
test_predictions <- predict(model, newdata = test_data, type = "response")
# Calculate metrics
train_rmse <- sqrt(mean((train_data$TARGET - train_predictions)^2))
test_rmse <- sqrt(mean((test_data$TARGET - test_predictions)^2))
train_r2 <- 1 - (sum((train_data$TARGET - train_predictions)^2) /
sum((train_data$TARGET - mean(train_data$TARGET))^2))
test_r2 <- 1 - (sum((test_data$TARGET - test_predictions)^2) /
sum((test_data$TARGET - mean(test_data$TARGET))^2))
# Store metrics in the final table
final_metrics <<- rbind(
final_metrics,
data.frame(
Model_Type = model_type,
Model = model_name,
Train_RMSE = round(train_rmse, 4),
Test_RMSE = round(test_rmse, 4),
Train_R2 = round(train_r2, 4),
Test_R2 = round(test_r2, 4)
)
)
}, error = function(e) {
# Add NA values if any issue occurs
final_metrics <<- rbind(
final_metrics,
data.frame(
Model_Type = model_type,
Model = model_name,
Train_RMSE = NA, Test_RMSE = NA, Train_R2 = NA, Test_R2 = NA
)
)
})
}
# List all models for comparison
model_list <- list(
list(name = "Poisson (Higher-Order Interaction)", type = "Poisson", model = poisson_model_higher_interaction),
list(name = "Zero-Inflated Poisson", type = "ZIP", model = zip_model_higher_interaction),
list(name = "Negative Binomial (Higher-Order Interaction)", type = "Negative Binomial", model = nb_model_higher_interaction),
list(name = "Zero-Inflated Negative Binomial", type = "ZINB", model = zinb_model),
list(name = "Linear Regression (Polynomial)", type = "MLR", model = model_polynomial),
list(name = "Linear Regression (Higher-Order Interaction)", type = "MLR", model = model_higher_interaction),
list(name = "Linear Regression (Stepwise)", type = "MLR", model = model_stepwise),
list(name = "Linear Regression (Interaction)", type = "MLR", model = model_interaction),
list(name = "Linear Regression (Simplified)", type = "MLR", model = model_simplified)
)
# Evaluate all models
for (entry in model_list) {
evaluate_and_store(entry$model, entry$name, entry$type, train_data_refined, test_data)
}
# Sort the data frame by Test RMSE
final_metrics_sorted <- final_metrics[order(final_metrics$Test_RMSE), ]
# Display the final consolidated metrics table
print(final_metrics_sorted)
## Model_Type Model Train_RMSE
## 5 MLR Linear Regression (Polynomial) 1.3894
## 6 MLR Linear Regression (Higher-Order Interaction) 1.3729
## 1 Poisson Poisson (Higher-Order Interaction) 1.4036
## 3 Negative Binomial Negative Binomial (Higher-Order Interaction) 1.4036
## 8 MLR Linear Regression (Interaction) 1.3985
## 7 MLR Linear Regression (Stepwise) 1.4204
## 2 ZIP Zero-Inflated Poisson 1.4460
## 4 ZINB Zero-Inflated Negative Binomial 1.4460
## 9 MLR Linear Regression (Simplified) 1.5427
## Test_RMSE Train_R2 Test_R2
## 5 1.5498 0.4340 0.3597
## 6 1.5570 0.4474 0.3537
## 1 1.5845 0.4223 0.3307
## 3 1.5845 0.4223 0.3307
## 8 1.5873 0.4265 0.3283
## 7 1.6064 0.4085 0.3120
## 2 1.6248 0.3869 0.2962
## 4 1.6248 0.3869 0.2962
## 9 1.6748 0.3022 0.2522
# Visualize Test RMSE across models
library(ggplot2)
ggplot(final_metrics_sorted, aes(x = reorder(Model, Test_RMSE), y = Test_RMSE, fill = Model_Type)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Test RMSE Comparison Across Models", x = "Models", y = "Test RMSE") +
theme_minimal()
# Residual Analysis for Polynomial Model
par(mfrow = c(2, 2)) # Set 2x2 plot layout
# Residuals
residuals_poly <- residuals(model_polynomial)
fitted_poly <- predict(model_polynomial, newdata = train_data_refined)
# Plot 1: Residuals vs Fitted Values
plot(fitted_poly, residuals_poly,
main = "Polynomial Model: Residuals vs Fitted",
xlab = "Fitted Values", ylab = "Residuals", pch = 20)
abline(h = 0, col = "red")
# Plot 2: Histogram of Residuals
hist(residuals_poly, breaks = 30, main = "Histogram of Residuals", xlab = "Residuals", col = "gray")
# Plot 3: QQ-Plot of Residuals
qqnorm(residuals_poly, main = "QQ-Plot of Residuals")
qqline(residuals_poly, col = "red")
# Plot 4: Scale-Location Plot
plot(fitted_poly, abs(residuals_poly)^0.5,
main = "Scale-Location Plot",
xlab = "Fitted Values", ylab = "Sqrt(|Residuals|)", pch = 20)
abline(h = 0, col = "red")
Conclusion: Similar to the polynomial model, this model exhibits non-constant variance and residual non-normality, signaling a need for further refinements or transformations.
# Residual Analysis for Higher-Order Interaction Linear Model
par(mfrow = c(2, 2)) # Set 2x2 plot layout
# Residuals
residuals_lm <- residuals(model_higher_interaction)
fitted_lm <- predict(model_higher_interaction, newdata = train_data_refined)
# Plot 1: Residuals vs Fitted Values
plot(fitted_lm, residuals_lm,
main = "Linear Model: Residuals vs Fitted",
xlab = "Fitted Values", ylab = "Residuals", pch = 20)
abline(h = 0, col = "red")
# Plot 2: Histogram of Residuals
hist(residuals_lm, breaks = 30, main = "Histogram of Residuals", xlab = "Residuals", col = "gray")
# Plot 3: QQ-Plot of Residuals
qqnorm(residuals_lm, main = "QQ-Plot of Residuals")
qqline(residuals_lm, col = "red")
# Plot 4: Scale-Location Plot
plot(fitted_lm, abs(residuals_lm)^0.5,
main = "Scale-Location Plot",
xlab = "Fitted Values", ylab = "Sqrt(|Residuals|)", pch = 20)
abline(h = 0, col = "red")
For Poisson models, deviance residuals are used instead of raw residuals.
Conclusion: The Poisson model struggles with overdispersion and heteroscedasticity, suggesting a need for a Negative Binomial or Zero-Inflated model to better capture the variance structure.
# Residual Analysis for Poisson Higher-Order Interaction Model
par(mfrow = c(2, 2)) # Set 2x2 plot layout
# Deviance Residuals
deviance_resid_poisson <- residuals(poisson_model_higher_interaction, type = "deviance")
fitted_poisson <- predict(poisson_model_higher_interaction, newdata = train_data_refined, type = "response")
# Plot 1: Residuals vs Fitted Values
plot(fitted_poisson, deviance_resid_poisson,
main = "Poisson Model: Residuals vs Fitted",
xlab = "Fitted Values", ylab = "Deviance Residuals", pch = 20)
abline(h = 0, col = "red")
# Plot 2: Histogram of Residuals
hist(deviance_resid_poisson, breaks = 30, main = "Histogram of Deviance Residuals", xlab = "Residuals", col = "gray")
# Plot 3: QQ-Plot of Residuals
qqnorm(deviance_resid_poisson, main = "QQ-Plot of Deviance Residuals")
qqline(deviance_resid_poisson, col = "red")
# Plot 4: Scale-Location Plot
plot(fitted_poisson, abs(deviance_resid_poisson)^0.5,
main = "Scale-Location Plot",
xlab = "Fitted Values", ylab = "Sqrt(|Deviance Residuals|)", pch = 20)
abline(h = 0, col = "red")
Conclusion: The Negative Binomial model effectively handles overdispersion and demonstrates better residual behavior than the Poisson model. There are minor issues with residual tails and mild heteroscedasticity, but overall, the model is a robust choice for the given data.
# Residual Analysis for Negative Binomial Higher-Order Interaction Model
par(mfrow = c(2, 2)) # Set plot layout to 2x2
# Plot 1: Residuals vs Fitted
plot(nb_model_higher_interaction$fitted.values,
residuals(nb_model_higher_interaction, type = "pearson"),
xlab = "Fitted Values", ylab = "Pearson Residuals",
main = "Negative Binomial: Residuals vs Fitted")
abline(h = 0, col = "red", lwd = 2)
# Plot 2: Histogram of Residuals
hist(residuals(nb_model_higher_interaction, type = "pearson"),
main = "Histogram of Pearson Residuals",
xlab = "Residuals", breaks = 50)
# Plot 3: QQ-Plot of Residuals
qqnorm(residuals(nb_model_higher_interaction, type = "pearson"),
main = "QQ-Plot of Residuals")
qqline(residuals(nb_model_higher_interaction, type = "pearson"), col = "red", lwd = 2)
# Plot 4: Scale-Location Plot
plot(nb_model_higher_interaction$fitted.values,
sqrt(abs(residuals(nb_model_higher_interaction, type = "pearson"))),
xlab = "Fitted Values", ylab = "Sqrt(|Pearson Residuals|)",
main = "Scale-Location Plot")
abline(h = 0, col = "red", lwd = 2)
par(mfrow = c(1, 1)) # Reset plot layout
Test RMSE Comparison:
While the Polynomial MLR has the lowest Test RMSE, the difference is minimal compared to the Higher-Order MLR and Poisson/Negative Binomial models.
Suitability for Count Data:
Residual Analysis:
Overdispersion:
The Negative Binomial Higher-Order Interaction Model is the best choice for prediction. It balances predictive accuracy, handles overdispersion effectively, and demonstrates better residual behavior. While the Polynomial MLR has a slightly lower Test RMSE, its assumptions are not appropriate for count data, and the Negative Binomial model is more robust and reliable.
For prediction, we will submit the predicted values using the Negative Binomial Higher-Order Interaction Model, as it is the most appropriate and robust choice. However, we will also generate predictions for the Polynomial MLR, Higher-Order Interaction MLR, and Poisson Higher-Order Interaction Model to visually compare their predictions for further analysis and insights.
# Prepare the evaluation dataset and ensure consistent row order
eval_data_prepared <- boxcox_eval_data %>%
dplyr::select(-TARGET, -INDEX) %>% # Drop TARGET and INDEX columns
dplyr::mutate(ID = boxcox_eval_data$INDEX) # Retain INDEX for prediction alignment
# Convert all matrix columns to numeric vectors in the evaluation dataset
eval_data_clean <- eval_data_prepared %>%
dplyr::mutate(across(where(is.matrix), ~ as.numeric(.))) # Convert matrices to numeric
# Predict using the Negative Binomial Higher-Order Interaction model
nb_higher_interaction_predictions <- predict(nb_model_higher_interaction,
newdata = eval_data_clean,
type = "response")
# Create a DataFrame with predictions
nb_predictions_df <- data.frame(
ID = eval_data_clean$ID, # Add back the ID column
Predictions = nb_higher_interaction_predictions
)
# Save predictions to CSV
write.csv(nb_predictions_df, "nb_higher_interaction_predictions.csv", row.names = FALSE)
cat("Negative Binomial (Higher-Order Interaction) predictions saved to 'nb_higher_interaction_predictions.csv'\n")
## Negative Binomial (Higher-Order Interaction) predictions saved to 'nb_higher_interaction_predictions.csv'
cat("\nPreview the prediction\n")
##
## Preview the prediction
glimpse(nb_predictions_df)
## Rows: 3,335
## Columns: 2
## $ ID <int> 3, 9, 10, 18, 21, 30, 31, 37, 39, 47, 60, 62, 63, 64, 68, …
## $ Predictions <dbl> 2.886945, 3.722064, 1.958305, 1.680948, 1.803885, 6.319411…
# Convert all matrix columns to numeric vectors in the evaluation dataset
eval_data_clean <- eval_data_prepared %>%
dplyr::mutate(across(where(is.matrix), ~ as.numeric(.))) # Convert matrices to numeric
# Predict using the Polynomial MLR model
mlr_polynomial_predictions <- predict(model_polynomial,
newdata = eval_data_clean)
# Create a DataFrame with predictions
mlr_polynomial_predictions_df <- data.frame(
ID = eval_data_clean$ID, # Add back the ID column
Predictions = mlr_polynomial_predictions
)
# Save predictions to CSV
write.csv(mlr_polynomial_predictions_df, "mlr_polynomial_predictions.csv", row.names = FALSE)
cat("\nMLR (Polynomial) predictions saved to 'mlr_polynomial_predictions.csv'\n")
##
## MLR (Polynomial) predictions saved to 'mlr_polynomial_predictions.csv'
cat("\nPreview the prediction\n")
##
## Preview the prediction
glimpse(mlr_polynomial_predictions_df)
## Rows: 3,335
## Columns: 2
## $ ID <int> 3, 9, 10, 18, 21, 30, 31, 37, 39, 47, 60, 62, 63, 64, 68, …
## $ Predictions <dbl> 2.347799, 3.603266, 2.370396, 1.735872, 1.658376, 5.786728…
# Convert all matrix columns to numeric vectors in the evaluation dataset
eval_data_clean <- eval_data_prepared %>%
dplyr::mutate(across(where(is.matrix), ~ as.numeric(.))) # Convert matrices to numeric
# Predict using the Higher-Order Interaction MLR model
mlr_higher_interaction_predictions <- predict(model_higher_interaction,
newdata = eval_data_clean)
# Create a DataFrame with predictions
mlr_predictions_df <- data.frame(
ID = eval_data_clean$ID, # Add back the ID column
Predictions = mlr_higher_interaction_predictions
)
# Save predictions to CSV
write.csv(mlr_predictions_df, "mlr_higher_interaction_predictions.csv", row.names = FALSE)
cat("MLR (Higher-Order Interaction) predictions saved to 'mlr_higher_interaction_predictions.csv'\n")
## MLR (Higher-Order Interaction) predictions saved to 'mlr_higher_interaction_predictions.csv'
cat("\nPreview the prediction\n")
##
## Preview the prediction
glimpse(mlr_predictions_df)
## Rows: 3,335
## Columns: 2
## $ ID <int> 3, 9, 10, 18, 21, 30, 31, 37, 39, 47, 60, 62, 63, 64, 68, …
## $ Predictions <dbl> 2.748037, 3.651464, 1.942896, 1.760075, 1.658487, 5.983839…
# Convert all matrix columns to numeric vectors in the evaluation dataset
eval_data_clean <- eval_data_prepared %>%
dplyr::mutate(across(where(is.matrix), ~ as.numeric(.))) # Convert matrices to numeric
# Predict using the Poisson Higher-Order Interaction model
poisson_higher_interaction_predictions <- predict(poisson_model_higher_interaction,
newdata = eval_data_clean,
type = "response")
# Create a DataFrame with predictions
poisson_predictions_df <- data.frame(
ID = eval_data_clean$ID, # Add back the ID column
Predictions = poisson_higher_interaction_predictions
)
# Save predictions to CSV
write.csv(poisson_predictions_df, "poisson_higher_interaction_predictions.csv", row.names = FALSE)
cat("Poisson (Higher-Order Interaction) predictions saved to 'poisson_higher_interaction_predictions.csv'\n")
## Poisson (Higher-Order Interaction) predictions saved to 'poisson_higher_interaction_predictions.csv'
cat("\nPreview the prediction\n")
##
## Preview the prediction
glimpse(poisson_predictions_df)
## Rows: 3,335
## Columns: 2
## $ ID <int> 3, 9, 10, 18, 21, 30, 31, 37, 39, 47, 60, 62, 63, 64, 68, …
## $ Predictions <dbl> 2.886941, 3.722066, 1.958307, 1.680945, 1.803893, 6.319380…
The pairwise scatter plot shows a strong alignment
between predictions across all four models:
- Negative Binomial Higher-Order, MLR
Polynomial, MLR Higher-Order, and
Poisson Higher-Order Interaction.
While predictions from all models are highly correlated, the Negative Binomial Higher-Order Interaction Model remains the preferred model for final prediction due to its robustness in handling overdispersion and suitability for the data.
# Combine predictions into a single data frame
final_predictions_df <- data.frame(
ID = nb_predictions_df$ID, # Ensure alignment with ID
NegativeBinomial_HigherOrder = nb_predictions_df$Predictions,
MLR_Polynomial = mlr_polynomial_predictions_df$Predictions,
MLR_HigherOrder = mlr_predictions_df$Predictions,
Poisson_HigherOrder = poisson_predictions_df$Predictions
)
# Display summary of predictions
cat("Summary of Predictions for the Four Models:\n")
## Summary of Predictions for the Four Models:
print(summary(final_predictions_df[, -1])) # Exclude ID column for summary
## NegativeBinomial_HigherOrder MLR_Polynomial MLR_HigherOrder
## Min. :0.9045 Min. :-0.5212 Min. :0.506
## 1st Qu.:2.2338 1st Qu.: 2.3704 1st Qu.:2.265
## Median :2.8890 Median : 3.0215 Median :2.914
## Mean :3.1507 Mean : 3.1454 Mean :3.147
## 3rd Qu.:3.8618 3rd Qu.: 3.8219 3rd Qu.:3.903
## Max. :9.2831 Max. : 7.1249 Max. :8.043
## Poisson_HigherOrder
## Min. :0.9045
## 1st Qu.:2.2338
## Median :2.8890
## Mean :3.1506
## 3rd Qu.:3.8618
## Max. :9.2830
# Generate pairwise scatter plots
pairs(
final_predictions_df[, -1], # Exclude ID column for scatter plots
main = "Pairwise Scatter Plot of Model Predictions",
pch = 21,
bg = c("blue", "green", "red", "purple")
)
# Save combined predictions to a CSV
write.csv(final_predictions_df, "combined_predictions.csv", row.names = FALSE)
cat("Combined predictions saved to 'combined_predictions.csv'\n")
## Combined predictions saved to 'combined_predictions.csv'