#Libraries
library(DHARMa)
library(kableExtra)
library(knitr)
library(readr)
library(tidyverse)
library(corrplot)
library(dplyr)
library(GGally)
library(caret)
library(pROC)
library(glmnet)
library(MASS)
library(car)
library(correlationfunnel)
library(faraway)
library(arm)
library(performance)
library(see)
library(reshape2)
library(readr)
#library(tidymodels)
#library(rms)
library(smotefamily)
library(themis)
library(skimr)
library(DataExplorer)
library(naniar)
library(mice)
library(corrr)
library(FactoMineR)
library(ggcorrplot)
library(factoextra)
In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
INDEX: Identification Variable (do not use)
AcidIndex: Proprietary method of testing total acidity of
wine by using a weighted average Alcohol: Alcohol Content
Chlorides: Chloride content of wine
CitricAcid: Citric Acid Content Density:
Density of Wine FixedAcidity: Fixed Acidity of Wine
FreeSulfurDioxide: Sulfur Dioxide content of wine
LabelAppeal: Marketing Score indicating the appeal of label
design for consumers. High numbers suggest customers like the label
design. Negative numbers suggest customes don’t like the design. Many
consumers purchase based on the visual appeal of the wine label design.
Higher numbers suggest better sales. ResidualSugar:
Residual Sugar of wine STARS: Wine rating by a team of
experts. 4 Stars = Excellent, 1 Star = Poor. A high number of stars
suggests high sales Sulphates: Sulfate content of wine
TotalSulfurDioxide: Total Sulfur Dioxide of Wine
VolatileAcidity: Volatile Acid content of wine
pH: pH of wine
raw_data <- read.csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-621/refs/heads/main/wine-training-data.csv")
eval_data <- read.csv("https://raw.githubusercontent.com/gillianmcgovern0/cuny-data-621/refs/heads/main/wine-evaluation-data.csv")
The wine training dataset contains 12,795 observations and 16 variables, where each observation is an available wine at the wine company. All the features are numerical - LabelAppeal and STARS is an ordinal variable (scores).
Missing data:
Only 50.3% of all rows are not completely missing. Given the 50.3% complete rows, there are only 4% total missing observations.
The missing data breakdown bar chart shows that pH, ResidualSugar, Chlorides, FreeSulfurDioxide, Alcohol, TotalSulfurDioxide, Sulphates and STARS have missing values. STARS has the most missing values, yet there could be a reason this value is blank given it’s a review. Looking at the missing factor plot, we can see there is a clear pattern in the heatmap for STARS - missing data strongly relates to a lower TARGET value. People who do not like the wine might also choose to not give a review at all. The missing upset plot does not show any interesting missing data patterns between predictors.
There are no duplicate observations.
Summary statistics of the predictor variables in the training dataset are included in the table below. Key metrics such as minimum, maximum, mean, median, and standard deviation (SD) help us understand the range, central tendency, and variability of each variable. The predictors have a wide range of spread. Many show a spread with negative values which does not makes sense. That will need to be addressed later.
Model Specific Checks:
Variance-to-mean ratio: Ratios between 0.5 and 1.5 are generally acceptable for a Poisson distribution, so a value of 1.2 means we can use the poisson regression model. Yet, greater than 1 means there is overdispersion, in which case the poisson regression model might not be the best. Negative binomial regression will also be tested which should help with overdispersion.
# Visualize data types
plot_str(raw_data)
# Visualize EDA
plot_intro(raw_data)
# Visualize Missing Data
plot_missing(raw_data)
# Check how missingness relates to other variables
gg_miss_upset(raw_data, nsets = 10)
# Check how missingness affects TARGET
gg_miss_fct(x = raw_data, fct = TARGET)
# Check mean vs variance (should be similar for Poisson Regression Model)
mean(raw_data$TARGET)
## [1] 3.029074
var(raw_data$TARGET)
## [1] 3.710895
var(raw_data$TARGET) / mean(raw_data$TARGET) # Should be close to 1
## [1] 1.225092
# Glimpse of the data
head(raw_data)
## INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1 1 3 3.2 1.160 -0.98 54.2 -0.567
## 2 2 3 4.5 0.160 -0.81 26.1 -0.425
## 3 4 5 7.1 2.640 -0.88 14.8 0.037
## 4 5 3 5.7 0.385 0.04 18.8 -0.425
## 5 6 4 8.0 0.330 -1.26 9.4 NA
## 6 7 0 11.3 0.320 0.59 2.2 0.556
## FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol
## 1 NA 268 0.99280 3.33 -0.59 9.9
## 2 15 -327 1.02792 3.38 0.70 NA
## 3 214 142 0.99518 3.12 0.48 22.0
## 4 22 115 0.99640 2.24 1.83 6.2
## 5 -167 108 0.99457 3.12 1.77 13.7
## 6 -37 15 0.99940 3.20 1.29 15.4
## LabelAppeal AcidIndex STARS
## 1 0 8 2
## 2 -1 7 3
## 3 -1 8 3
## 4 -1 6 1
## 5 0 9 2
## 6 0 11 NA
#the Index column had no impact on the target variable, it was dropped
raw_data <- raw_data[ , -1]
# Check for duplicates
duplicates <- duplicated(raw_data)
# Print the duplicates
print(raw_data[duplicates, ])
## [1] TARGET FixedAcidity VolatileAcidity CitricAcid
## [5] ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## [9] Density pH Sulphates Alcohol
## [13] LabelAppeal AcidIndex STARS
## <0 rows> (or 0-length row.names)
FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Sulphates and Alcohol contain negative values which, according to online research (and common knowledge), does not make sense. We will treat these as invalid, and therefore convert these values to missing/NA. LabelAppeal is a score and can be negative.
Approximately 21.37% of the TARGET values are zeros, which supports the appropriateness of checking for Zero-Inflated Poisson (ZIP) or Negative Binomial models for this data.
#Converting ordinal variables as factor data type.
raw_data$LabelAppeal <- as.factor(raw_data$LabelAppeal)
raw_data$AcidIndex <- as.factor(raw_data$AcidIndex)
raw_data$STARS <- as.factor(raw_data$STARS)
# raw_data$STARS <- factor(raw_data$STARS,
# ordered = TRUE,
# levels = c(1, 2, 3, 4))
# If value is negative (most likely inputted incorrectly), update to NA
raw_data$FixedAcidity = ifelse(raw_data$FixedAcidity < 0, NA, raw_data$FixedAcidity)
raw_data$VolatileAcidity = ifelse(raw_data$VolatileAcidity < 0, NA, raw_data$VolatileAcidity)
raw_data$CitricAcid = ifelse(raw_data$CitricAcid < 0, NA, raw_data$CitricAcid)
raw_data$ResidualSugar = ifelse(raw_data$ResidualSugar < 0, NA, raw_data$ResidualSugar)
raw_data$Chlorides = ifelse(raw_data$Chlorides < 0, NA, raw_data$Chlorides)
raw_data$FreeSulfurDioxide = ifelse(raw_data$FreeSulfurDioxide < 0, NA, raw_data$FreeSulfurDioxide)
raw_data$TotalSulfurDioxide = ifelse(raw_data$TotalSulfurDioxide < 0, NA, raw_data$TotalSulfurDioxide)
raw_data$Sulphates = ifelse(raw_data$Sulphates < 0, NA, raw_data$Sulphates)
raw_data$Alcohol = ifelse(raw_data$Alcohol < 0, NA, raw_data$Alcohol)
# Summary statistics
raw_data %>%
summary() %>%
kable(caption = "Descriptive Statistics of Predictor Variables") %>%
kable_styling()
| TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :0.000 | Min. : 0.000 | Min. :0.0000 | Min. :0.0000 | Min. : 0.00 | Min. :0.000 | Min. : 0.00 | Min. : 0.0 | Min. :0.8881 | Min. :0.480 | Min. :0.000 | Min. : 0.00 | -2: 504 | 7 :4878 | 1 :3042 | |
| 1st Qu.:2.000 | 1st Qu.: 6.200 | 1st Qu.:0.2400 | 1st Qu.:0.2800 | 1st Qu.: 2.40 | 1st Qu.:0.042 | 1st Qu.: 25.00 | 1st Qu.: 100.0 | 1st Qu.:0.9877 | 1st Qu.:2.960 | 1st Qu.:0.440 | 1st Qu.: 9.00 | -1:3136 | 8 :4142 | 2 :3570 | |
| Median :3.000 | Median : 7.200 | Median :0.3600 | Median :0.3900 | Median : 8.70 | Median :0.063 | Median : 43.00 | Median : 150.0 | Median :0.9945 | Median :3.200 | Median :0.575 | Median :10.40 | 0 :5617 | 9 :1427 | 3 :2212 | |
| Mean :3.029 | Mean : 8.668 | Mean :0.6195 | Mean :0.6474 | Mean : 19.39 | Mean :0.188 | Mean : 91.67 | Mean : 204.9 | Mean :0.9942 | Mean :3.208 | Mean :0.863 | Mean :10.61 | 1 :3048 | 6 :1197 | 4 : 612 | |
| 3rd Qu.:4.000 | 3rd Qu.:10.400 | 3rd Qu.:0.8500 | 3rd Qu.:0.8500 | 3rd Qu.: 29.50 | 3rd Qu.:0.289 | 3rd Qu.:131.00 | 3rd Qu.: 258.0 | 3rd Qu.:1.0005 | 3rd Qu.:3.470 | 3rd Qu.:1.140 | 3rd Qu.:12.40 | 2 : 490 | 10 : 551 | NA’s:3359 | |
| Max. :8.000 | Max. :34.400 | Max. :3.6800 | Max. :3.8600 | Max. :141.15 | Max. :1.351 | Max. :623.00 | Max. :1057.0 | Max. :1.0992 | Max. :6.130 | Max. :4.240 | Max. :26.50 | NA | 11 : 258 | NA | |
| NA | NA’s :1621 | NA’s :2827 | NA’s :2966 | NA’s :3752 | NA’s :3835 | NA’s :3683 | NA’s :3186 | NA | NA’s :395 | NA’s :3571 | NA’s :771 | NA | (Other): 342 | NA |
TARGET is bimodal with a peak at 4 and 0 cases purchased - frequently either people purchase a couple of cases, or none at all. This could indicate a potential excess of zeros issue with our models.
Many numeric predictors are unimodal with heavy tails. These include Chlorides, CitricAcid, Density, FixedAcidity, FreeSulfurDioxide, pH, ResidualSugar, Sulphates, TotalSulfurDioxide, and VolatileAcidity. AcidIndex and Alochol is right skewed, and LabelAppeal looks normal. For the heavy tailed and skewed predictors transformation will need to be added.
As we saw earlier, predictors have a wide range of spread. For example, Sulphates has a very small spread compared to FreeSulfurDioxide which has a much larger spread.
The QQ plots and the box plots show the heavy tails shown in the histograms (many extreme values). However, based on the feature meanings and provided information, there is no reason to believe that any of these extreme values are mistakes or data errors. As such, we will not remove the extreme values, as they could be predictive of the target. For a higher TARGET (6-8 cases purchased), we can see that the spread for many of the predictors is smaller, but that could just be due to the fact there is not as many observations. For most of the predictors, there is not a clear value for higher cases purchased as it is around the same for each “bucket” of cases purchased.
Unsurprisingly, a lower AcidIndex and higher amount of STARS and LabelAppeal have higher TARGET values.
For the ordinal features, the most frequent LabelAppeal rating is 0 meaning a neutral feeling about label, and -1 and 1 rating frequencies are almost equal. Interestingly, -2 and 2 for LabelAppeal also have almost equal frequencies. The most frequent AcidIndex values are 7 and 8 (mid-range) and the most frequent values for STARS is 2 with NA being a close second, which means ratings are on the lower side. A STARS rating of 4 is the least frequent rating, which means highly rated wines are harder to come by.
# Plot histograms
plot_histogram(raw_data)
# Categorical Variables
data_raw_categorical <- raw_data |>
dplyr::select(where(is.factor) | where(is.character))
categorical_predictors <- names(data_raw_categorical)[names(data_raw_categorical) != "TARGET"]
plot_bar(data_raw_categorical)
data_raw_numeric <- raw_data |>
dplyr::select(where(is.numeric))
numerical_predictors <- names(data_raw_numeric)
# Closer look at target distribution (# of orders)
hist(as.numeric(raw_data$TARGET), main="TARGET Distribution",
xlab="Cases Purchased", col="lightblue", breaks=100)
# QQ Plot
plot_qq(raw_data, sampled_rows = 1000L)
# Box plots - visualize outliers
ggplot(stack(data_raw_numeric), aes(x = ind, y = values)) +
geom_boxplot(color = 'skyblue', outlier.color = 'red') +
coord_cartesian(ylim = c(0, 1250)) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1),
panel.background = element_rect(fill = 'grey96')) +
labs(title = "Boxplots of Predictor Variables", x="Predictors")
# Box plots by TARGET
plot_boxplot(raw_data, by = "TARGET", ggtheme = theme_light())
# Scatter plots by TARGET
plot_scatterplot(raw_data, by = "TARGET", sampled_rows = 1000L)
Not too many features have high correlation with the TARGET and with each other.
Positive correlations with TARGET (Cases purchased): 1. STARS: 0.56 (a positive review for the wine indicates someone is more likely purchase the wine) 2. LabelAppeal: 0.36 (Better bottle design, more likely to purchase the wine) 3. Alcohol: 0.06 (Higher alcohol content leads to more purchases) 4. AcidIndex: -0.25 (Higher total acidity wines are less likely to be purchased)
Negative correlations with TARGET: 1. AcidIndex: -0.18 (Higher acid index is less likely to be purchased) 2. VolatileAcidity: -0.10 (Higher volatile acidity is less likely to be purchased) 3. FixedAcidity: -0.06 (Higher fixed acidity is less likely to be purchased)
Correlation Among Predictors:We see that the features have very low correlations with each other, meaning that there is not much multicollinearity present in the dataset. This means that the assumptions of linear regression are more likely to be met. 1. LabelAppeal <-> STARS: 0.33 (Better bottle design, more likely for a positive review) 2. FixedAcidity <-> AcidIndex: 0.19 (Higher fixed acidity, higher acid index) 3. AcidIndex <-> STARS: -0.09 (Higher acid index, lower review) 4. CitricAcid <-> AcidIndex: 0.06 (Higher citric acid, higher acid index) 5. Alcohol <-> STARS: 0.06 (Higher alcohol content, more likely for a positive review)
Initial business insights:
The correlation funnel shows that the following values for the predictors correspond to a high TARGET (4 cases purchased and above):
Two-way Cross-Tabulations - Categorical Variables: Based on the p-value, all the categorical variables are statistically significant.
Lastly, the relationship scatter plots did not show any interesting relationships between predictors.
#Correlation matrix with target
numeric_vars <- raw_data %>% select_if(is.numeric)
cor_matrix <- cor(numeric_vars, use="pairwise.complete.obs")
corr_target <- cor_matrix[,"TARGET"]
corr_target_sorted <- sort(corr_target, decreasing = TRUE)
kable(as.data.frame(corr_target_sorted), col.names = c("Correlation with Target"), digits = 2)
| Correlation with Target | |
|---|---|
| TARGET | 1.00 |
| Alcohol | 0.07 |
| TotalSulfurDioxide | 0.06 |
| FreeSulfurDioxide | 0.05 |
| CitricAcid | 0.02 |
| ResidualSugar | 0.01 |
| pH | -0.01 |
| Density | -0.04 |
| Sulphates | -0.04 |
| Chlorides | -0.05 |
| FixedAcidity | -0.06 |
| VolatileAcidity | -0.10 |
#Correlation heatmap
corrplot::corrplot(cor_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45, addCoef.col = "black",
number.cex = 0.7, diag = FALSE)
# Correlation top 5 positive and negative
corr_target <- cor_matrix[, "TARGET"]
corr_target <- corr_target[names(corr_target) != "TARGET"] # Remove self-correlation
top_pos <- sort(corr_target, decreasing = TRUE)[1:3]
top_neg <- sort(corr_target, decreasing = FALSE)[1:3]
combined_df <- data.frame(
Feature = c(names(top_pos), names(top_neg)),
Correlation = c(top_pos, top_neg)) %>%
mutate(Direction = ifelse(Correlation > 0, "Positive", "Negative"))
ggplot(combined_df, aes(x = reorder(Feature, Correlation), y = Correlation, fill = Direction)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(values = c("Positive" = "skyblue", "Negative" = "salmon")) +
labs(title = "Top Features Correlated with TARGET",
x = "Feature",
y = "Correlation with Target",
fill = "Direction") +
theme_minimal()
# Correlation among predictors
melt_cor <- melt(cor_matrix)
melt_cor <- melt_cor |>
filter(Var1 != "TARGET") |>
filter(Var2 != "TARGET")
filtered_cor <- melt_cor[melt_cor$Var1 != melt_cor$Var2 & as.numeric(melt_cor$Var1) < as.numeric(melt_cor$Var2), ]
sorted_cor <- filtered_cor[order(abs(filtered_cor$value), decreasing = TRUE), ]
head(sorted_cor)
## Var1 Var2 value
## 68 VolatileAcidity TotalSulfurDioxide -0.04959490
## 117 TotalSulfurDioxide Alcohol -0.04003116
## 70 ResidualSugar TotalSulfurDioxide 0.03744668
## 100 FixedAcidity Sulphates 0.03204314
## 72 FreeSulfurDioxide TotalSulfurDioxide 0.03082415
## 104 Chlorides Sulphates 0.02345348
# Correlation Funnel
set.seed(123)
raw_data_v2 <- raw_data |>
dplyr::select(-STARS)
raw_data_v2 <- na.omit(raw_data_v2)
raw_data_binarized_tbl <- raw_data_v2 %>%
binarize(n_bins = 4, thresh_infreq = 0.01)
raw_data_correlated_tbl <- raw_data_binarized_tbl %>%
correlationfunnel::correlate(target = TARGET__4_Inf)
raw_data_correlated_tbl %>%
correlationfunnel::plot_correlation_funnel(interactive = FALSE)
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## ℹ The deprecated feature was likely used in the correlationfunnel package.
## Please report the issue at
## <https://github.com/business-science/correlationfunnel/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
ggpairs(data_raw_numeric, progress = FALSE)
Two-way Cross-Tabulations - Categorical Variables:
# Comparing categorical features with the target variable - Chi-square
# got this piece of code from the internet
# Loop through categorical variables, and apply chi square test for significance
results <- map_dfr(categorical_predictors, function(var) {
tibble(
variable = var,
chisq_p_value = chisq.test(table(raw_data[[var]], raw_data$TARGET))$p.value
)
})
results %>%
arrange(desc(chisq_p_value)) %>%
kbl() %>%
kable_styling(full_width = TRUE)
| variable | chisq_p_value |
|---|---|
| AcidIndex | 0 |
| LabelAppeal | 0 |
| STARS | 0 |
Besides STARS, let’s replace the missing values with the median. MICE imputation is complex, slow, and can create unrealistic synthetic values. For Poisson regression, it often does not outperform simpler methods unless you know the missingness mechanism.
To handle missing data for STARS, we’ll test out 2 options:
# Fill in NA with median
data_imputed <- raw_data %>%
mutate_at(vars(c("FixedAcidity", "VolatileAcidity", "CitricAcid", "ResidualSugar", "Chlorides", "FreeSulfurDioxide", "TotalSulfurDioxide", "pH", "Sulphates", "Alcohol")), ~ifelse(is.na(.), median(., na.rm = TRUE), .))
# Check for missing values again
plot_missing(data_imputed )
# Handling STARS
# Option 1: STARS_new
# Create a new rating value of 0
data_imputed$STARS_new = ifelse(is.na(data_imputed$STARS), 0, data_imputed$STARS)
# Option 2: is_stars_missing & STARS_imputed
# Create new missing indicator variable
data_imputed <- data_imputed %>%
mutate(is_stars_missing = is.na(STARS))
# Convert STARS to numeric as it is a factor
data_imputed$STARS <- as.numeric(as.character(data_imputed$STARS))
# impute STARS missing values using the median
data_imputed <- data_imputed %>%
mutate(STARS_imputed = coalesce(STARS, median(STARS, na.rm = TRUE)))
# Convert missing indicator and STARS variables to factor
data_imputed$is_stars_missing <- as.factor(data_imputed$is_stars_missing)
data_imputed$STARS <- as.factor(data_imputed$STARS)
# data_imputed$STARS <- factor(data_imputed$STARS,
# ordered = TRUE,
# levels = c(1, 2, 3, 4))
data_imputed$STARS_new <- as.factor(data_imputed$STARS_new)
# data_imputed$STARS_new <- factor(data_imputed$STARS_new,
# ordered = TRUE,
# levels = c(0, 1, 2, 3, 4))
data_imputed$STARS_imputed <- as.factor(data_imputed$STARS_imputed)
# Check for missing values again
plot_missing(data_imputed )
ResidualSugar’s distribution is the only one that looks different, but that should be okay for this case. All the STARS related variables show correlation, but that is to be expected. Nothing else looks like it has changed too drastically after the transformations.
# Plot histograms
plot_histogram(data_imputed)
# Plot correlations
plot_correlation(data_imputed)
# Comparing categorical features with the target variable - Chi-square
# got this piece of code from the internet
# Loop through categorical variables, and apply chi square test for significance
data_imputed_categorical <- data_imputed |>
dplyr::select(where(is.factor) | where(is.character))
data_imputed_categorical_predictors <- names(data_imputed_categorical)[names(data_imputed_categorical) != "TARGET"]
results <- map_dfr(data_imputed_categorical_predictors, function(var) {
tibble(
variable = var,
chisq_p_value = chisq.test(table(data_imputed[[var]], data_imputed$TARGET))$p.value
)
})
## Warning in chisq.test(table(data_imputed[[var]], data_imputed$TARGET)):
## Chi-squared approximation may be incorrect
## Warning in chisq.test(table(data_imputed[[var]], data_imputed$TARGET)):
## Chi-squared approximation may be incorrect
## Warning in chisq.test(table(data_imputed[[var]], data_imputed$TARGET)):
## Chi-squared approximation may be incorrect
## Warning in chisq.test(table(data_imputed[[var]], data_imputed$TARGET)):
## Chi-squared approximation may be incorrect
## Warning in chisq.test(table(data_imputed[[var]], data_imputed$TARGET)):
## Chi-squared approximation may be incorrect
## Warning in chisq.test(table(data_imputed[[var]], data_imputed$TARGET)):
## Chi-squared approximation may be incorrect
results %>%
arrange(desc(chisq_p_value)) %>%
kbl() %>%
kable_styling(full_width = TRUE)
| variable | chisq_p_value |
|---|---|
| AcidIndex | 0 |
| LabelAppeal | 0 |
| STARS | 0 |
| STARS_new | 0 |
| is_stars_missing | 0 |
| STARS_imputed | 0 |
Use log, square root, and box cox transformations for the skewed predictors. The histograms and QQ plots definitely show an improvement.
data_transform <- data_imputed
# Log and sqrt transformation
data_transform <- data_transform%>%
mutate(
# LOG_AcidIndex = log(data_transform$AcidIndex + 1),
SQRT_Alcohol = sqrt(data_transform$Alcohol +1),
LOG_Chlorides = log(data_transform$Chlorides + 1),
# ASINH_Chlorides = asinh(data_transform$Chlorides + 1),
LOG_CitricAcid = log(data_transform$CitricAcid + 1),
SQRT_FixedAcidity = sqrt(data_transform$FixedAcidity +1),
LOG_FreeSulfurDioxide = log(data_transform$FreeSulfurDioxide + 1),
LOG_ResidualSugar = log(data_transform$ResidualSugar + 1),
LOG_Sulphates = log(data_transform$Sulphates + 1),
LOG_TotalSulfurDioxide = log(data_transform$TotalSulfurDioxide + 1),
LOG_VolatileAcidity = log(data_transform$VolatileAcidity + 1)
)
# Box Cox transformation also being checked if its any favorable
bc_model_1 <- BoxCoxTrans(data_transform$CitricAcid + 1)
bc_model_2 <- BoxCoxTrans(data_transform$Chlorides + 1)
bc_model_3 <- BoxCoxTrans(data_transform$VolatileAcidity + 1)
bc_model_4 <- BoxCoxTrans(data_transform$Sulphates + 1)
data_transform$BC_CitricAcid <- predict(bc_model_1, data_transform$CitricAcid + 1)
data_transform$BC_Chlorides <- predict(bc_model_2, data_transform$Chlorides + 1)
data_transform$BC_VolatileAcidity <- predict(bc_model_3, data_transform$VolatileAcidity + 1)
data_transform$BC_Sulphates <- predict(bc_model_4, data_transform$Sulphates + 1)
data_transform |>
dplyr::select(where(is.numeric)) |>
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") |>
filter(!is.na(Value)) |>
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~Feature, scales = "free") +
labs(title = "Histograms of Numerical Features", x = NULL, y = "Frequency") +
theme_minimal()
# QQ Plot
plot_qq(data_transform, sampled_rows = 1000L)
To limit the amount of data/predictors, create new feature variables which combines the existing features.
# 4. Create new features
# total acidity -- combine all acidity predictors
data_transform$CombinedAcidity <- data_transform$VolatileAcidity + data_transform$FixedAcidity + data_transform$CitricAcid
YES, there is imbalance—but this is NORMAL for count data. Hence we will not be changing anything.
The distribution is not dominated by zeros as it is not > 50% . Hence looks like it is not Zero Inflated.
table(data_transform$TARGET)
##
## 0 1 2 3 4 5 6 7 8
## 2734 244 1091 2611 3177 2014 765 142 17
prop.table(table(data_transform$TARGET))
##
## 0 1 2 3 4 5
## 0.213677218 0.019069949 0.085267683 0.204064088 0.248300117 0.157405236
## 6 7 8
## 0.059788980 0.011098085 0.001328644
To ensure objective model evaluation and prevent overfitting, we split each dataset into training (80%) and testing (20%) sets using a consistent random seed (set.seed(123)) for reproducibility.
# Split data for validation
set.seed(123)
train_idx <- createDataPartition(y = raw_data$TARGET, p = 0.8, list = FALSE)
# Create train/test splits for each dataset
train <- raw_data[train_idx, ]
test <- raw_data[-train_idx, ]
# Imputed predictors dataset
train_imputed <- data_imputed[train_idx, ]
test_imputed <- data_imputed[-train_idx, ]
# Transformed predictors dataset
train_transformed <- data_transform[train_idx, ]
test_transformed <- data_transform[-train_idx, ]
This model is the base model, but since poisson regression models can’t have missing data, we test out the first option we created for handling STARS.
set.seed(123)
# Model 1 - All original predictors except for STARS_new since poisson regression models can't have missing values
# Poisson Regression Model
pr_model1a <- glm(TARGET ~ AcidIndex + Alcohol + Chlorides +
CitricAcid + Density + FixedAcidity + FreeSulfurDioxide +
LabelAppeal + ResidualSugar + STARS_new + Sulphates + TotalSulfurDioxide +
VolatileAcidity + pH,
family = poisson, data = train_transformed)
summary(pr_model1a)
##
## Call:
## glm(formula = TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid +
## Density + FixedAcidity + FreeSulfurDioxide + LabelAppeal +
## ResidualSugar + STARS_new + Sulphates + TotalSulfurDioxide +
## VolatileAcidity + pH, family = poisson, data = train_transformed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.320e-01 3.832e-01 0.606 0.54481
## AcidIndex5 -1.011e-01 3.244e-01 -0.312 0.75540
## AcidIndex6 -9.765e-02 3.176e-01 -0.307 0.75846
## AcidIndex7 -1.287e-01 3.173e-01 -0.406 0.68494
## AcidIndex8 -1.659e-01 3.173e-01 -0.523 0.60104
## AcidIndex9 -2.679e-01 3.178e-01 -0.843 0.39912
## AcidIndex10 -4.304e-01 3.191e-01 -1.349 0.17742
## AcidIndex11 -7.457e-01 3.232e-01 -2.307 0.02106 *
## AcidIndex12 -7.536e-01 3.294e-01 -2.288 0.02216 *
## AcidIndex13 -5.981e-01 3.321e-01 -1.801 0.07171 .
## AcidIndex14 -7.065e-01 3.495e-01 -2.022 0.04322 *
## AcidIndex15 -2.394e-01 4.040e-01 -0.593 0.55349
## AcidIndex16 -9.554e-01 5.488e-01 -1.741 0.08171 .
## AcidIndex17 -1.276e+01 1.269e+02 -0.101 0.91991
## Alcohol 5.128e-03 1.662e-03 3.085 0.00204 **
## Chlorides -6.947e-02 2.961e-02 -2.346 0.01898 *
## CitricAcid 1.179e-02 1.046e-02 1.128 0.25942
## Density -2.576e-01 2.128e-01 -1.210 0.22618
## FixedAcidity -3.026e-04 1.282e-03 -0.236 0.81341
## FreeSulfurDioxide 1.511e-04 6.097e-05 2.479 0.01318 *
## LabelAppeal-1 2.308e-01 4.175e-02 5.528 3.23e-08 ***
## LabelAppeal0 4.200e-01 4.066e-02 10.331 < 2e-16 ***
## LabelAppeal1 5.583e-01 4.143e-02 13.477 < 2e-16 ***
## LabelAppeal2 6.883e-01 4.690e-02 14.676 < 2e-16 ***
## ResidualSugar 1.908e-04 2.750e-04 0.694 0.48795
## STARS_new1 7.757e-01 2.194e-02 35.360 < 2e-16 ***
## STARS_new2 1.088e+00 2.055e-02 52.936 < 2e-16 ***
## STARS_new3 1.209e+00 2.162e-02 55.908 < 2e-16 ***
## STARS_new4 1.316e+00 2.738e-02 48.070 < 2e-16 ***
## Sulphates -1.033e-02 9.858e-03 -1.048 0.29449
## TotalSulfurDioxide 1.065e-04 3.880e-05 2.745 0.00605 **
## VolatileAcidity -5.714e-02 1.173e-02 -4.873 1.10e-06 ***
## pH -7.310e-03 8.525e-03 -0.858 0.39116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18252 on 10237 degrees of freedom
## Residual deviance: 10742 on 10205 degrees of freedom
## AIC: 36376
##
## Number of Fisher Scoring iterations: 10
This model is the base model, but since poisson regression models cant have missing data, we need to do something with STARS. This tests out the second option we created for handling STARS, which imputes STARS, but adds a logical indicator variable for “is missing”.
The results show that Wines without a STARS rating sell about 66% less than similar wines with a rating(exp(-1.088) ≈ 0.337). Wines with more stars sell more. The model reduces deviance by ~41% → very solid improvement.
AIC is good for a Poisson model at this scale.
Dispersion = 1 → no overdispersion → Poisson GLM is appropriate.
Strong predictors: Alcohol (positive) Chlorides (negative) FreeSulfurDioxide (slightly positive) LabelAppeal (very strong — all levels) VolatileAcidity (strong negative) TotalSulfurDioxide (small positive)
Weak / noise-like predictors: (These are not statistically significant) CitricAcid Density FixedAcidity ResidualSugar Sulphates pH Many AcidIndex levels
set.seed(123)
# Model 1B - All original predictors w/ STARS_imputed + is_stars_missing
# Poisson Regression Model
pr_model1b <- glm(TARGET ~ AcidIndex + Alcohol + Chlorides +
CitricAcid + Density + FixedAcidity + FreeSulfurDioxide +
LabelAppeal + ResidualSugar + STARS_imputed + is_stars_missing + Sulphates + TotalSulfurDioxide +
VolatileAcidity + pH,
family = poisson, data = train_transformed)
# View key model results
summary(pr_model1b)
##
## Call:
## glm(formula = TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid +
## Density + FixedAcidity + FreeSulfurDioxide + LabelAppeal +
## ResidualSugar + STARS_imputed + is_stars_missing + Sulphates +
## TotalSulfurDioxide + VolatileAcidity + pH, family = poisson,
## data = train_transformed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.008e+00 3.832e-01 2.630 0.00854 **
## AcidIndex5 -1.011e-01 3.244e-01 -0.312 0.75540
## AcidIndex6 -9.765e-02 3.176e-01 -0.307 0.75846
## AcidIndex7 -1.287e-01 3.173e-01 -0.406 0.68494
## AcidIndex8 -1.659e-01 3.173e-01 -0.523 0.60104
## AcidIndex9 -2.679e-01 3.178e-01 -0.843 0.39912
## AcidIndex10 -4.304e-01 3.191e-01 -1.349 0.17742
## AcidIndex11 -7.457e-01 3.232e-01 -2.307 0.02106 *
## AcidIndex12 -7.536e-01 3.294e-01 -2.288 0.02216 *
## AcidIndex13 -5.981e-01 3.321e-01 -1.801 0.07171 .
## AcidIndex14 -7.065e-01 3.495e-01 -2.022 0.04322 *
## AcidIndex15 -2.394e-01 4.040e-01 -0.593 0.55349
## AcidIndex16 -9.554e-01 5.488e-01 -1.741 0.08171 .
## AcidIndex17 -1.276e+01 1.269e+02 -0.101 0.91991
## Alcohol 5.128e-03 1.662e-03 3.085 0.00204 **
## Chlorides -6.947e-02 2.961e-02 -2.346 0.01898 *
## CitricAcid 1.179e-02 1.046e-02 1.128 0.25942
## Density -2.576e-01 2.128e-01 -1.210 0.22618
## FixedAcidity -3.026e-04 1.282e-03 -0.236 0.81341
## FreeSulfurDioxide 1.511e-04 6.097e-05 2.479 0.01318 *
## LabelAppeal-1 2.308e-01 4.175e-02 5.528 3.23e-08 ***
## LabelAppeal0 4.200e-01 4.066e-02 10.331 < 2e-16 ***
## LabelAppeal1 5.583e-01 4.143e-02 13.477 < 2e-16 ***
## LabelAppeal2 6.883e-01 4.690e-02 14.676 < 2e-16 ***
## ResidualSugar 1.908e-04 2.750e-04 0.694 0.48795
## STARS_imputed2 3.122e-01 1.602e-02 19.492 < 2e-16 ***
## STARS_imputed3 4.329e-01 1.741e-02 24.859 < 2e-16 ***
## STARS_imputed4 5.405e-01 2.431e-02 22.234 < 2e-16 ***
## is_stars_missingTRUE -1.088e+00 2.055e-02 -52.936 < 2e-16 ***
## Sulphates -1.033e-02 9.858e-03 -1.048 0.29449
## TotalSulfurDioxide 1.065e-04 3.880e-05 2.745 0.00605 **
## VolatileAcidity -5.714e-02 1.173e-02 -4.873 1.10e-06 ***
## pH -7.310e-03 8.525e-03 -0.858 0.39116
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18252 on 10237 degrees of freedom
## Residual deviance: 10742 on 10205 degrees of freedom
## AIC: 36376
##
## Number of Fisher Scoring iterations: 10
step_model1a <- stats::step(pr_model1a, direction = "both")
## Start: AIC=36376.47
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density +
## FixedAcidity + FreeSulfurDioxide + LabelAppeal + ResidualSugar +
## STARS_new + Sulphates + TotalSulfurDioxide + VolatileAcidity +
## pH
##
## Df Deviance AIC
## - FixedAcidity 1 10743 36375
## - ResidualSugar 1 10743 36375
## - pH 1 10743 36375
## - Sulphates 1 10744 36376
## - CitricAcid 1 10744 36376
## - Density 1 10744 36376
## <none> 10742 36376
## - Chlorides 1 10748 36380
## - FreeSulfurDioxide 1 10749 36381
## - TotalSulfurDioxide 1 10750 36382
## - Alcohol 1 10752 36384
## - VolatileAcidity 1 10767 36399
## - AcidIndex 13 11066 36674
## - LabelAppeal 4 11299 36925
## - STARS_new 4 15378 41004
##
## Step: AIC=36374.52
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density +
## FreeSulfurDioxide + LabelAppeal + ResidualSugar + STARS_new +
## Sulphates + TotalSulfurDioxide + VolatileAcidity + pH
##
## Df Deviance AIC
## - ResidualSugar 1 10743 36373
## - pH 1 10743 36373
## - Sulphates 1 10744 36374
## - CitricAcid 1 10744 36374
## - Density 1 10744 36374
## <none> 10743 36375
## + FixedAcidity 1 10742 36376
## - Chlorides 1 10748 36378
## - FreeSulfurDioxide 1 10749 36379
## - TotalSulfurDioxide 1 10750 36380
## - Alcohol 1 10752 36382
## - VolatileAcidity 1 10767 36397
## - AcidIndex 13 11075 36681
## - LabelAppeal 4 11299 36923
## - STARS_new 4 15379 41003
##
## Step: AIC=36373.01
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density +
## FreeSulfurDioxide + LabelAppeal + STARS_new + Sulphates +
## TotalSulfurDioxide + VolatileAcidity + pH
##
## Df Deviance AIC
## - pH 1 10744 36372
## - Sulphates 1 10744 36372
## - CitricAcid 1 10744 36372
## - Density 1 10744 36372
## <none> 10743 36373
## + ResidualSugar 1 10743 36375
## + FixedAcidity 1 10743 36375
## - Chlorides 1 10749 36377
## - FreeSulfurDioxide 1 10749 36377
## - TotalSulfurDioxide 1 10751 36379
## - Alcohol 1 10752 36380
## - VolatileAcidity 1 10767 36395
## - AcidIndex 13 11076 36680
## - LabelAppeal 4 11300 36922
## - STARS_new 4 15381 41003
##
## Step: AIC=36371.73
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density +
## FreeSulfurDioxide + LabelAppeal + STARS_new + Sulphates +
## TotalSulfurDioxide + VolatileAcidity
##
## Df Deviance AIC
## - Sulphates 1 10745 36371
## - CitricAcid 1 10745 36371
## - Density 1 10745 36371
## <none> 10744 36372
## + pH 1 10743 36373
## + ResidualSugar 1 10743 36373
## + FixedAcidity 1 10744 36374
## - Chlorides 1 10749 36375
## - FreeSulfurDioxide 1 10750 36376
## - TotalSulfurDioxide 1 10751 36377
## - Alcohol 1 10753 36379
## - VolatileAcidity 1 10768 36394
## - AcidIndex 13 11076 36678
## - LabelAppeal 4 11300 36920
## - STARS_new 4 15385 41005
##
## Step: AIC=36370.87
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density +
## FreeSulfurDioxide + LabelAppeal + STARS_new + TotalSulfurDioxide +
## VolatileAcidity
##
## Df Deviance AIC
## - CitricAcid 1 10746 36370
## - Density 1 10746 36370
## <none> 10745 36371
## + Sulphates 1 10744 36372
## + pH 1 10744 36372
## + ResidualSugar 1 10744 36372
## + FixedAcidity 1 10745 36373
## - Chlorides 1 10750 36375
## - FreeSulfurDioxide 1 10751 36375
## - TotalSulfurDioxide 1 10752 36376
## - Alcohol 1 10754 36378
## - VolatileAcidity 1 10769 36393
## - AcidIndex 13 11078 36679
## - LabelAppeal 4 11301 36919
## - STARS_new 4 15388 41007
##
## Step: AIC=36370.07
## TARGET ~ AcidIndex + Alcohol + Chlorides + Density + FreeSulfurDioxide +
## LabelAppeal + STARS_new + TotalSulfurDioxide + VolatileAcidity
##
## Df Deviance AIC
## - Density 1 10748 36370
## <none> 10746 36370
## + CitricAcid 1 10745 36371
## + Sulphates 1 10745 36371
## + pH 1 10745 36371
## + ResidualSugar 1 10746 36372
## + FixedAcidity 1 10746 36372
## - Chlorides 1 10752 36374
## - FreeSulfurDioxide 1 10752 36374
## - TotalSulfurDioxide 1 10754 36376
## - Alcohol 1 10756 36378
## - VolatileAcidity 1 10771 36393
## - AcidIndex 13 11079 36677
## - LabelAppeal 4 11303 36919
## - STARS_new 4 15395 41011
##
## Step: AIC=36369.59
## TARGET ~ AcidIndex + Alcohol + Chlorides + FreeSulfurDioxide +
## LabelAppeal + STARS_new + TotalSulfurDioxide + VolatileAcidity
##
## Df Deviance AIC
## <none> 10748 36370
## + Density 1 10746 36370
## + CitricAcid 1 10746 36370
## + Sulphates 1 10746 36370
## + pH 1 10747 36371
## + ResidualSugar 1 10747 36371
## + FixedAcidity 1 10748 36372
## - Chlorides 1 10754 36373
## - FreeSulfurDioxide 1 10754 36374
## - TotalSulfurDioxide 1 10755 36375
## - Alcohol 1 10757 36377
## - VolatileAcidity 1 10772 36392
## - AcidIndex 13 11082 36678
## - LabelAppeal 4 11304 36918
## - STARS_new 4 15400 41014
summary(step_model1a)
##
## Call:
## glm(formula = TARGET ~ AcidIndex + Alcohol + Chlorides + FreeSulfurDioxide +
## LabelAppeal + STARS_new + TotalSulfurDioxide + VolatileAcidity,
## family = poisson, data = train_transformed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.230e-02 3.200e-01 -0.163 0.87017
## AcidIndex5 -1.015e-01 3.240e-01 -0.313 0.75414
## AcidIndex6 -9.536e-02 3.171e-01 -0.301 0.76364
## AcidIndex7 -1.260e-01 3.167e-01 -0.398 0.69082
## AcidIndex8 -1.626e-01 3.168e-01 -0.513 0.60773
## AcidIndex9 -2.648e-01 3.172e-01 -0.835 0.40387
## AcidIndex10 -4.284e-01 3.185e-01 -1.345 0.17863
## AcidIndex11 -7.458e-01 3.226e-01 -2.312 0.02079 *
## AcidIndex12 -7.501e-01 3.288e-01 -2.281 0.02253 *
## AcidIndex13 -5.940e-01 3.314e-01 -1.792 0.07309 .
## AcidIndex14 -7.042e-01 3.487e-01 -2.019 0.04346 *
## AcidIndex15 -2.255e-01 4.035e-01 -0.559 0.57622
## AcidIndex16 -9.523e-01 5.481e-01 -1.738 0.08230 .
## AcidIndex17 -1.275e+01 1.269e+02 -0.100 0.91995
## Alcohol 5.179e-03 1.662e-03 3.117 0.00183 **
## Chlorides -7.114e-02 2.960e-02 -2.404 0.01623 *
## FreeSulfurDioxide 1.517e-04 6.094e-05 2.489 0.01282 *
## LabelAppeal-1 2.315e-01 4.174e-02 5.547 2.91e-08 ***
## LabelAppeal0 4.203e-01 4.066e-02 10.337 < 2e-16 ***
## LabelAppeal1 5.591e-01 4.143e-02 13.497 < 2e-16 ***
## LabelAppeal2 6.879e-01 4.690e-02 14.667 < 2e-16 ***
## STARS_new1 7.763e-01 2.193e-02 35.395 < 2e-16 ***
## STARS_new2 1.089e+00 2.054e-02 53.008 < 2e-16 ***
## STARS_new3 1.210e+00 2.161e-02 55.999 < 2e-16 ***
## STARS_new4 1.317e+00 2.737e-02 48.125 < 2e-16 ***
## TotalSulfurDioxide 1.063e-04 3.876e-05 2.742 0.00611 **
## VolatileAcidity -5.727e-02 1.172e-02 -4.885 1.04e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18252 on 10237 degrees of freedom
## Residual deviance: 10748 on 10211 degrees of freedom
## AIC: 36370
##
## Number of Fisher Scoring iterations: 10
Stepwise selected a model with these predictors:
Important: AcidIndex Alcohol Chlorides FreeSulfurDioxide LabelAppeal STARS_imputed is_stars_missing TotalSulfurDioxide VolatileAcidity
Dropped/Not Important: FixedAcidity ResidualSugar pH Sulphates CitricAcid Density
AIC: AIC=36369.59
step_model1b <- stepAIC(pr_model1b, direction = "both",Trace = 0)
## Start: AIC=36376.47
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density +
## FixedAcidity + FreeSulfurDioxide + LabelAppeal + ResidualSugar +
## STARS_imputed + is_stars_missing + Sulphates + TotalSulfurDioxide +
## VolatileAcidity + pH
##
## Df Deviance AIC
## - FixedAcidity 1 10743 36375
## - ResidualSugar 1 10743 36375
## - pH 1 10743 36375
## - Sulphates 1 10744 36376
## - CitricAcid 1 10744 36376
## - Density 1 10744 36376
## <none> 10742 36376
## - Chlorides 1 10748 36380
## - FreeSulfurDioxide 1 10749 36381
## - TotalSulfurDioxide 1 10750 36382
## - Alcohol 1 10752 36384
## - VolatileAcidity 1 10767 36399
## - AcidIndex 13 11066 36674
## - LabelAppeal 4 11299 36925
## - STARS_imputed 3 11551 37179
## - is_stars_missing 1 14033 39665
##
## Step: AIC=36374.52
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density +
## FreeSulfurDioxide + LabelAppeal + ResidualSugar + STARS_imputed +
## is_stars_missing + Sulphates + TotalSulfurDioxide + VolatileAcidity +
## pH
##
## Df Deviance AIC
## - ResidualSugar 1 10743 36373
## - pH 1 10743 36373
## - Sulphates 1 10744 36374
## - CitricAcid 1 10744 36374
## - Density 1 10744 36374
## <none> 10743 36375
## + FixedAcidity 1 10742 36376
## - Chlorides 1 10748 36378
## - FreeSulfurDioxide 1 10749 36379
## - TotalSulfurDioxide 1 10750 36380
## - Alcohol 1 10752 36382
## - VolatileAcidity 1 10767 36397
## - AcidIndex 13 11075 36681
## - LabelAppeal 4 11299 36923
## - STARS_imputed 3 11552 37178
## - is_stars_missing 1 14034 39664
##
## Step: AIC=36373.01
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density +
## FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing +
## Sulphates + TotalSulfurDioxide + VolatileAcidity + pH
##
## Df Deviance AIC
## - pH 1 10744 36372
## - Sulphates 1 10744 36372
## - CitricAcid 1 10744 36372
## - Density 1 10744 36372
## <none> 10743 36373
## + ResidualSugar 1 10743 36375
## + FixedAcidity 1 10743 36375
## - Chlorides 1 10749 36377
## - FreeSulfurDioxide 1 10749 36377
## - TotalSulfurDioxide 1 10751 36379
## - Alcohol 1 10752 36380
## - VolatileAcidity 1 10767 36395
## - AcidIndex 13 11076 36680
## - LabelAppeal 4 11300 36922
## - STARS_imputed 3 11553 37177
## - is_stars_missing 1 14034 39662
##
## Step: AIC=36371.73
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density +
## FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing +
## Sulphates + TotalSulfurDioxide + VolatileAcidity
##
## Df Deviance AIC
## - Sulphates 1 10745 36371
## - CitricAcid 1 10745 36371
## - Density 1 10745 36371
## <none> 10744 36372
## + pH 1 10743 36373
## + ResidualSugar 1 10743 36373
## + FixedAcidity 1 10744 36374
## - Chlorides 1 10749 36375
## - FreeSulfurDioxide 1 10750 36376
## - TotalSulfurDioxide 1 10751 36377
## - Alcohol 1 10753 36379
## - VolatileAcidity 1 10768 36394
## - AcidIndex 13 11076 36678
## - LabelAppeal 4 11300 36920
## - STARS_imputed 3 11554 37176
## - is_stars_missing 1 14038 39664
##
## Step: AIC=36370.87
## TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + Density +
## FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing +
## TotalSulfurDioxide + VolatileAcidity
##
## Df Deviance AIC
## - CitricAcid 1 10746 36370
## - Density 1 10746 36370
## <none> 10745 36371
## + Sulphates 1 10744 36372
## + pH 1 10744 36372
## + ResidualSugar 1 10744 36372
## + FixedAcidity 1 10745 36373
## - Chlorides 1 10750 36375
## - FreeSulfurDioxide 1 10751 36375
## - TotalSulfurDioxide 1 10752 36376
## - Alcohol 1 10754 36378
## - VolatileAcidity 1 10769 36393
## - AcidIndex 13 11078 36679
## - LabelAppeal 4 11301 36919
## - STARS_imputed 3 11556 37175
## - is_stars_missing 1 14039 39663
##
## Step: AIC=36370.07
## TARGET ~ AcidIndex + Alcohol + Chlorides + Density + FreeSulfurDioxide +
## LabelAppeal + STARS_imputed + is_stars_missing + TotalSulfurDioxide +
## VolatileAcidity
##
## Df Deviance AIC
## - Density 1 10748 36370
## <none> 10746 36370
## + CitricAcid 1 10745 36371
## + Sulphates 1 10745 36371
## + pH 1 10745 36371
## + ResidualSugar 1 10746 36372
## + FixedAcidity 1 10746 36372
## - Chlorides 1 10752 36374
## - FreeSulfurDioxide 1 10752 36374
## - TotalSulfurDioxide 1 10754 36376
## - Alcohol 1 10756 36378
## - VolatileAcidity 1 10771 36393
## - AcidIndex 13 11079 36677
## - LabelAppeal 4 11303 36919
## - STARS_imputed 3 11557 37175
## - is_stars_missing 1 14044 39666
##
## Step: AIC=36369.59
## TARGET ~ AcidIndex + Alcohol + Chlorides + FreeSulfurDioxide +
## LabelAppeal + STARS_imputed + is_stars_missing + TotalSulfurDioxide +
## VolatileAcidity
##
## Df Deviance AIC
## <none> 10748 36370
## + Density 1 10746 36370
## + CitricAcid 1 10746 36370
## + Sulphates 1 10746 36370
## + pH 1 10747 36371
## + ResidualSugar 1 10747 36371
## + FixedAcidity 1 10748 36372
## - Chlorides 1 10754 36373
## - FreeSulfurDioxide 1 10754 36374
## - TotalSulfurDioxide 1 10755 36375
## - Alcohol 1 10757 36377
## - VolatileAcidity 1 10772 36392
## - AcidIndex 13 11082 36678
## - LabelAppeal 4 11304 36918
## - STARS_imputed 3 11559 37175
## - is_stars_missing 1 14048 39668
summary(step_model1b)
##
## Call:
## glm(formula = TARGET ~ AcidIndex + Alcohol + Chlorides + FreeSulfurDioxide +
## LabelAppeal + STARS_imputed + is_stars_missing + TotalSulfurDioxide +
## VolatileAcidity, family = poisson, data = train_transformed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.240e-01 3.200e-01 2.262 0.02367 *
## AcidIndex5 -1.015e-01 3.240e-01 -0.313 0.75414
## AcidIndex6 -9.536e-02 3.171e-01 -0.301 0.76364
## AcidIndex7 -1.260e-01 3.167e-01 -0.398 0.69082
## AcidIndex8 -1.626e-01 3.168e-01 -0.513 0.60773
## AcidIndex9 -2.648e-01 3.172e-01 -0.835 0.40387
## AcidIndex10 -4.284e-01 3.185e-01 -1.345 0.17863
## AcidIndex11 -7.458e-01 3.226e-01 -2.312 0.02079 *
## AcidIndex12 -7.501e-01 3.288e-01 -2.281 0.02253 *
## AcidIndex13 -5.940e-01 3.314e-01 -1.792 0.07309 .
## AcidIndex14 -7.042e-01 3.487e-01 -2.019 0.04346 *
## AcidIndex15 -2.255e-01 4.035e-01 -0.559 0.57622
## AcidIndex16 -9.523e-01 5.481e-01 -1.738 0.08230 .
## AcidIndex17 -1.275e+01 1.269e+02 -0.100 0.91995
## Alcohol 5.179e-03 1.662e-03 3.117 0.00183 **
## Chlorides -7.114e-02 2.960e-02 -2.404 0.01623 *
## FreeSulfurDioxide 1.517e-04 6.094e-05 2.489 0.01282 *
## LabelAppeal-1 2.315e-01 4.174e-02 5.547 2.91e-08 ***
## LabelAppeal0 4.203e-01 4.066e-02 10.337 < 2e-16 ***
## LabelAppeal1 5.591e-01 4.143e-02 13.497 < 2e-16 ***
## LabelAppeal2 6.879e-01 4.690e-02 14.667 < 2e-16 ***
## STARS_imputed2 3.127e-01 1.601e-02 19.529 < 2e-16 ***
## STARS_imputed3 4.337e-01 1.741e-02 24.914 < 2e-16 ***
## STARS_imputed4 5.411e-01 2.431e-02 22.258 < 2e-16 ***
## is_stars_missingTRUE -1.089e+00 2.054e-02 -53.008 < 2e-16 ***
## TotalSulfurDioxide 1.063e-04 3.876e-05 2.742 0.00611 **
## VolatileAcidity -5.727e-02 1.172e-02 -4.885 1.04e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18252 on 10237 degrees of freedom
## Residual deviance: 10748 on 10211 degrees of freedom
## AIC: 36370
##
## Number of Fisher Scoring iterations: 10
Box-cox is not necessarily needed for poisson, since the predictors do not need to be normally distributed (only the residuals should ideally be normally distributed).
Poisson regression assumes log link on the mean, variance = mean (or NB variance > mean) and no normality requirement. Null deviance: 18252 on 10237 degrees of freedom Residual deviance: 10701 on 10205 degrees of freedom
This means the model explains ~40% of the deviance, which is respectable for sales-count prediction.
AIC: 36335
set.seed(123)
# Model 2A - All transformed predictors except Box-cox w/ STARS_imputed + is_stars_missing
# Poisson Regression Model
pr_model2a <- glm(TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
LOG_CitricAcid + Density + SQRT_FixedAcidity + LOG_FreeSulfurDioxide +
LabelAppeal + LOG_ResidualSugar + STARS_imputed + is_stars_missing + LOG_Sulphates + LOG_TotalSulfurDioxide + LOG_VolatileAcidity + pH,
family = poisson, data = train_transformed)
# View key model results
summary(pr_model2a)
##
## Call:
## glm(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
## LOG_CitricAcid + Density + SQRT_FixedAcidity + LOG_FreeSulfurDioxide +
## LabelAppeal + LOG_ResidualSugar + STARS_imputed + is_stars_missing +
## LOG_Sulphates + LOG_TotalSulfurDioxide + LOG_VolatileAcidity +
## pH, family = poisson, data = train_transformed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.685627 0.388030 1.767 0.07724 .
## AcidIndex5 -0.095278 0.324422 -0.294 0.76900
## AcidIndex6 -0.096417 0.317604 -0.304 0.76145
## AcidIndex7 -0.130118 0.317281 -0.410 0.68173
## AcidIndex8 -0.165732 0.317361 -0.522 0.60152
## AcidIndex9 -0.263934 0.317803 -0.830 0.40626
## AcidIndex10 -0.415322 0.319194 -1.301 0.19320
## AcidIndex11 -0.726621 0.323303 -2.247 0.02461 *
## AcidIndex12 -0.722677 0.329529 -2.193 0.02830 *
## AcidIndex13 -0.573343 0.332169 -1.726 0.08434 .
## AcidIndex14 -0.683363 0.349551 -1.955 0.05059 .
## AcidIndex15 -0.236369 0.404091 -0.585 0.55859
## AcidIndex16 -0.938628 0.548728 -1.711 0.08716 .
## AcidIndex17 -12.746968 126.889324 -0.100 0.91998
## SQRT_Alcohol 0.034064 0.010984 3.101 0.00193 **
## LOG_Chlorides -0.094544 0.039468 -2.395 0.01660 *
## LOG_CitricAcid 0.037606 0.020680 1.818 0.06899 .
## Density -0.266816 0.212829 -1.254 0.20997
## SQRT_FixedAcidity -0.002293 0.008169 -0.281 0.77897
## LOG_FreeSulfurDioxide 0.028216 0.006364 4.434 9.26e-06 ***
## LabelAppeal-1 0.232349 0.041745 5.566 2.61e-08 ***
## LabelAppeal0 0.421596 0.040660 10.369 < 2e-16 ***
## LabelAppeal1 0.558933 0.041430 13.491 < 2e-16 ***
## LabelAppeal2 0.687593 0.046905 14.659 < 2e-16 ***
## LOG_ResidualSugar 0.008681 0.005862 1.481 0.13866
## STARS_imputed2 0.309103 0.016025 19.289 < 2e-16 ***
## STARS_imputed3 0.430572 0.017422 24.715 < 2e-16 ***
## STARS_imputed4 0.539338 0.024309 22.187 < 2e-16 ***
## is_stars_missingTRUE -1.080252 0.020584 -52.481 < 2e-16 ***
## LOG_Sulphates -0.021862 0.021542 -1.015 0.31017
## LOG_TotalSulfurDioxide 0.037061 0.007933 4.671 2.99e-06 ***
## LOG_VolatileAcidity -0.116069 0.022305 -5.204 1.95e-07 ***
## pH -0.006466 0.008523 -0.759 0.44805
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18252 on 10237 degrees of freedom
## Residual deviance: 10701 on 10205 degrees of freedom
## AIC: 36335
##
## Number of Fisher Scoring iterations: 10
Density, FixedAcidity (sqrt),Sulphates (log), pH, ResidualSugar (log), CitricAcid are removed as its p value is high and as observed in stepwise selection.
Null deviance: 18252 on 10237 degrees of freedom Residual deviance: 10710 on 10211 degrees of freedom AIC: 36332
pr_model2b <- glm(TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing + LOG_TotalSulfurDioxide + LOG_VolatileAcidity,
family = poisson, data = train_transformed)
# View key model results
summary(pr_model2b)
##
## Call:
## glm(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
## LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing +
## LOG_TotalSulfurDioxide + LOG_VolatileAcidity, family = poisson,
## data = train_transformed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.408632 0.325098 1.257 0.20877
## AcidIndex5 -0.099774 0.324005 -0.308 0.75813
## AcidIndex6 -0.094697 0.317111 -0.299 0.76523
## AcidIndex7 -0.127005 0.316750 -0.401 0.68845
## AcidIndex8 -0.161721 0.316798 -0.510 0.60971
## AcidIndex9 -0.259867 0.317179 -0.819 0.41261
## AcidIndex10 -0.413412 0.318545 -1.298 0.19435
## AcidIndex11 -0.726702 0.322612 -2.253 0.02429 *
## AcidIndex12 -0.717667 0.328826 -2.183 0.02907 *
## AcidIndex13 -0.567891 0.331442 -1.713 0.08664 .
## AcidIndex14 -0.679543 0.348730 -1.949 0.05134 .
## AcidIndex15 -0.220163 0.403494 -0.546 0.58531
## AcidIndex16 -0.933608 0.548093 -1.703 0.08850 .
## AcidIndex17 -12.732403 126.889876 -0.100 0.92007
## SQRT_Alcohol 0.033951 0.010974 3.094 0.00198 **
## LOG_Chlorides -0.097672 0.039446 -2.476 0.01328 *
## LOG_FreeSulfurDioxide 0.028718 0.006356 4.518 6.23e-06 ***
## LabelAppeal-1 0.233074 0.041741 5.584 2.35e-08 ***
## LabelAppeal0 0.421678 0.040659 10.371 < 2e-16 ***
## LabelAppeal1 0.559917 0.041426 13.516 < 2e-16 ***
## LabelAppeal2 0.686947 0.046902 14.646 < 2e-16 ***
## STARS_imputed2 0.309782 0.016020 19.337 < 2e-16 ***
## STARS_imputed3 0.431544 0.017416 24.779 < 2e-16 ***
## STARS_imputed4 0.540208 0.024308 22.224 < 2e-16 ***
## is_stars_missingTRUE -1.082050 0.020575 -52.592 < 2e-16 ***
## LOG_TotalSulfurDioxide 0.037833 0.007917 4.779 1.76e-06 ***
## LOG_VolatileAcidity -0.116714 0.022301 -5.234 1.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18252 on 10237 degrees of freedom
## Residual deviance: 10710 on 10211 degrees of freedom
## AIC: 36332
##
## Number of Fisher Scoring iterations: 10
Uses CombinedAcidity
Null deviance: 18252 on 10237 degrees of freedom Residual deviance: 10737 on 10211 degrees of freedom AIC: 36359
pr_model3 <- glm(TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing + LOG_TotalSulfurDioxide +CombinedAcidity,
family = poisson, data = train_transformed)
# View key model results
summary(pr_model3)
##
## Call:
## glm(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
## LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing +
## LOG_TotalSulfurDioxide + CombinedAcidity, family = poisson,
## data = train_transformed)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.730e-01 3.251e-01 1.147 0.25124
## AcidIndex5 -1.261e-01 3.240e-01 -0.389 0.69713
## AcidIndex6 -1.148e-01 3.171e-01 -0.362 0.71727
## AcidIndex7 -1.453e-01 3.168e-01 -0.459 0.64647
## AcidIndex8 -1.806e-01 3.168e-01 -0.570 0.56876
## AcidIndex9 -2.797e-01 3.172e-01 -0.882 0.37799
## AcidIndex10 -4.335e-01 3.186e-01 -1.361 0.17363
## AcidIndex11 -7.501e-01 3.227e-01 -2.324 0.02010 *
## AcidIndex12 -7.362e-01 3.289e-01 -2.238 0.02521 *
## AcidIndex13 -5.908e-01 3.315e-01 -1.782 0.07473 .
## AcidIndex14 -7.102e-01 3.489e-01 -2.036 0.04178 *
## AcidIndex15 -2.395e-01 4.036e-01 -0.593 0.55285
## AcidIndex16 -9.174e-01 5.486e-01 -1.672 0.09447 .
## AcidIndex17 -1.280e+01 1.268e+02 -0.101 0.91962
## SQRT_Alcohol 3.355e-02 1.097e-02 3.059 0.00222 **
## LOG_Chlorides -9.968e-02 3.946e-02 -2.526 0.01154 *
## LOG_FreeSulfurDioxide 2.955e-02 6.350e-03 4.654 3.26e-06 ***
## LabelAppeal-1 2.381e-01 4.173e-02 5.705 1.16e-08 ***
## LabelAppeal0 4.265e-01 4.065e-02 10.493 < 2e-16 ***
## LabelAppeal1 5.652e-01 4.141e-02 13.648 < 2e-16 ***
## LabelAppeal2 6.926e-01 4.689e-02 14.771 < 2e-16 ***
## STARS_imputed2 3.123e-01 1.601e-02 19.502 < 2e-16 ***
## STARS_imputed3 4.336e-01 1.741e-02 24.907 < 2e-16 ***
## STARS_imputed4 5.441e-01 2.429e-02 22.397 < 2e-16 ***
## is_stars_missingTRUE -1.086e+00 2.056e-02 -52.851 < 2e-16 ***
## LOG_TotalSulfurDioxide 3.947e-02 7.913e-03 4.988 6.09e-07 ***
## CombinedAcidity -7.925e-04 1.267e-03 -0.626 0.53145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18252 on 10237 degrees of freedom
## Residual deviance: 10737 on 10211 degrees of freedom
## AIC: 36359
##
## Number of Fisher Scoring iterations: 10
Dispersion = 0.8698125 indicates mild underdispersion, but is not a concern as it often indicates: Good model fit, Residuals are not excessively variable, No need to switch models.
disp <- sum(residuals(pr_model2b, type="pearson")^2) / pr_model1b$df.residual
disp
## [1] 0.8703239
The predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
Theta (dispersion parameter) is extremely large 40,995. The NB algorithm struggled to estimate theta accurately, thus the warning. This typically means the model is too close to Poisson for NB to add value. Overdispersion is not present or is very small.
AIC = 36,379 — comparable to Poisson AIC.
# Negative Binomial Model 1
nb_model1 <- glm.nb(TARGET ~ AcidIndex + Alcohol + Chlorides +
CitricAcid + Density + FixedAcidity + FreeSulfurDioxide +
LabelAppeal + ResidualSugar + STARS_imputed + is_stars_missing + Sulphates + TotalSulfurDioxide +
VolatileAcidity + pH, data = train_transformed, init.theta = 40957.00204, link = log)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in glm.nb(TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid + :
## alternation limit reached
summary(nb_model1)
##
## Call:
## glm.nb(formula = TARGET ~ AcidIndex + Alcohol + Chlorides + CitricAcid +
## Density + FixedAcidity + FreeSulfurDioxide + LabelAppeal +
## ResidualSugar + STARS_imputed + is_stars_missing + Sulphates +
## TotalSulfurDioxide + VolatileAcidity + pH, data = train_transformed,
## init.theta = 40994.61229, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.008e+00 3.832e-01 2.630 0.00854 **
## AcidIndex5 -1.011e-01 3.244e-01 -0.312 0.75537
## AcidIndex6 -9.767e-02 3.176e-01 -0.308 0.75843
## AcidIndex7 -1.287e-01 3.173e-01 -0.406 0.68491
## AcidIndex8 -1.660e-01 3.173e-01 -0.523 0.60101
## AcidIndex9 -2.680e-01 3.178e-01 -0.843 0.39911
## AcidIndex10 -4.305e-01 3.192e-01 -1.349 0.17741
## AcidIndex11 -7.457e-01 3.233e-01 -2.307 0.02106 *
## AcidIndex12 -7.536e-01 3.294e-01 -2.288 0.02216 *
## AcidIndex13 -5.981e-01 3.321e-01 -1.801 0.07171 .
## AcidIndex14 -7.065e-01 3.495e-01 -2.022 0.04322 *
## AcidIndex15 -2.394e-01 4.040e-01 -0.593 0.55347
## AcidIndex16 -9.554e-01 5.488e-01 -1.741 0.08171 .
## AcidIndex17 -3.750e+01 3.001e+07 0.000 1.00000
## Alcohol 5.128e-03 1.662e-03 3.084 0.00204 **
## Chlorides -6.947e-02 2.961e-02 -2.346 0.01898 *
## CitricAcid 1.179e-02 1.046e-02 1.128 0.25942
## Density -2.576e-01 2.128e-01 -1.210 0.22619
## FixedAcidity -3.026e-04 1.282e-03 -0.236 0.81342
## FreeSulfurDioxide 1.511e-04 6.097e-05 2.479 0.01318 *
## LabelAppeal-1 2.308e-01 4.175e-02 5.528 3.23e-08 ***
## LabelAppeal0 4.200e-01 4.066e-02 10.331 < 2e-16 ***
## LabelAppeal1 5.583e-01 4.143e-02 13.477 < 2e-16 ***
## LabelAppeal2 6.883e-01 4.690e-02 14.675 < 2e-16 ***
## ResidualSugar 1.908e-04 2.751e-04 0.694 0.48794
## STARS_imputed2 3.122e-01 1.602e-02 19.491 < 2e-16 ***
## STARS_imputed3 4.329e-01 1.742e-02 24.858 < 2e-16 ***
## STARS_imputed4 5.405e-01 2.431e-02 22.233 < 2e-16 ***
## is_stars_missingTRUE -1.088e+00 2.055e-02 -52.935 < 2e-16 ***
## Sulphates -1.033e-02 9.858e-03 -1.048 0.29449
## TotalSulfurDioxide 1.065e-04 3.880e-05 2.745 0.00605 **
## VolatileAcidity -5.714e-02 1.173e-02 -4.873 1.10e-06 ***
## pH -7.311e-03 8.525e-03 -0.858 0.39113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40994.61) family taken to be 1)
##
## Null deviance: 18251 on 10237 degrees of freedom
## Residual deviance: 10742 on 10205 degrees of freedom
## AIC: 36379
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40995
## Std. Err.: 38244
## Warning while fitting theta: alternation limit reached
##
## 2 x log-likelihood: -36310.81
Theta (dispersion parameter) is extremely large 41,187. The NB algorithm struggled to estimate theta accurately, thus the warning. This typically means the model is too close to Poisson for NB to add value. Overdispersion is very small.
AcidIndex is messy espacially AcidIndex17 as it has estimate =
-37.48
Looks like it is Very rare category (almost no records),Complete or
quasi-complete separation and the Model cannot estimate its effect →
numerical instability. AIC = 36,334 — comparable to Poisson AIC.
nb_model2 <- glm.nb(TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing + LOG_TotalSulfurDioxide + LOG_VolatileAcidity,
data = train_transformed)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in glm.nb(TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
## LOG_FreeSulfurDioxide + : alternation limit reached
summary(nb_model2)
##
## Call:
## glm.nb(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
## LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing +
## LOG_TotalSulfurDioxide + LOG_VolatileAcidity, data = train_transformed,
## init.theta = 41187.11289, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.086e-01 3.251e-01 1.257 0.20878
## AcidIndex5 -9.979e-02 3.240e-01 -0.308 0.75810
## AcidIndex6 -9.471e-02 3.171e-01 -0.299 0.76519
## AcidIndex7 -1.270e-01 3.168e-01 -0.401 0.68842
## AcidIndex8 -1.617e-01 3.168e-01 -0.511 0.60969
## AcidIndex9 -2.599e-01 3.172e-01 -0.819 0.41259
## AcidIndex10 -4.134e-01 3.186e-01 -1.298 0.19435
## AcidIndex11 -7.267e-01 3.226e-01 -2.253 0.02429 *
## AcidIndex12 -7.177e-01 3.288e-01 -2.182 0.02907 *
## AcidIndex13 -5.679e-01 3.315e-01 -1.713 0.08664 .
## AcidIndex14 -6.796e-01 3.487e-01 -1.949 0.05134 .
## AcidIndex15 -2.202e-01 4.035e-01 -0.546 0.58529
## AcidIndex16 -9.336e-01 5.481e-01 -1.703 0.08850 .
## AcidIndex17 -3.748e+01 3.001e+07 0.000 1.00000
## SQRT_Alcohol 3.395e-02 1.098e-02 3.093 0.00198 **
## LOG_Chlorides -9.767e-02 3.945e-02 -2.476 0.01329 *
## LOG_FreeSulfurDioxide 2.872e-02 6.356e-03 4.518 6.23e-06 ***
## LabelAppeal-1 2.331e-01 4.174e-02 5.584 2.35e-08 ***
## LabelAppeal0 4.217e-01 4.066e-02 10.371 < 2e-16 ***
## LabelAppeal1 5.599e-01 4.143e-02 13.516 < 2e-16 ***
## LabelAppeal2 6.869e-01 4.690e-02 14.646 < 2e-16 ***
## STARS_imputed2 3.098e-01 1.602e-02 19.336 < 2e-16 ***
## STARS_imputed3 4.315e-01 1.742e-02 24.778 < 2e-16 ***
## STARS_imputed4 5.402e-01 2.431e-02 22.223 < 2e-16 ***
## is_stars_missingTRUE -1.082e+00 2.057e-02 -52.591 < 2e-16 ***
## LOG_TotalSulfurDioxide 3.784e-02 7.917e-03 4.779 1.76e-06 ***
## LOG_VolatileAcidity -1.167e-01 2.230e-02 -5.233 1.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(41187.11) family taken to be 1)
##
## Null deviance: 18251 on 10237 degrees of freedom
## Residual deviance: 10709 on 10211 degrees of freedom
## AIC: 36334
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 41187
## Std. Err.: 38484
## Warning while fitting theta: alternation limit reached
##
## 2 x log-likelihood: -36278.1
As there are too many predictors via AcidIndex categories, LabelAppeal levels, STARS indicators etc, ZIP cannot estimate 17+ dummy variables in zero-inflation any it may produce NA standard errors and unreliable coefficient estimates. Hence using only 2–4 predictors that logically explain excess zeros, not the count value.
The results show that these variables increase predicted TARGET: SQRT_Alcohol (positive, highly significant) FreeSulfurDioxide LabelAppeal (very strong and monotonic) STARS levels (strong positive) AcidIndex moderately negative at high levels LOG_VolatileAcidity (negative)
These variables do not influence count much: TotalSulfurDioxide (non-significant) Many AcidIndex categories (flat)
The vuong Test shows that the Poisson model is way better as the Vuong z-statistic < - 1.96
library(pscl) # for zeroinfl()
zip_model1<- zeroinfl(
TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing +
LOG_TotalSulfurDioxide + LOG_VolatileAcidity |
LabelAppeal + is_stars_missing,
data = train_transformed,
dist = "poisson"
)
summary(zip_model1)
##
## Call:
## zeroinfl(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
## LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing +
## LOG_TotalSulfurDioxide + LOG_VolatileAcidity | LabelAppeal + is_stars_missing,
## data = train_transformed, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.016342 -0.460885 0.004954 0.420080 3.829268
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.255397 0.330177 0.774 0.439217
## AcidIndex5 0.018491 0.328226 0.056 0.955073
## AcidIndex6 0.033527 0.321246 0.104 0.916880
## AcidIndex7 0.007095 0.320884 0.022 0.982359
## AcidIndex8 -0.014261 0.320938 -0.044 0.964557
## AcidIndex9 -0.069162 0.321361 -0.215 0.829600
## AcidIndex10 -0.173341 0.322902 -0.537 0.591390
## AcidIndex11 -0.380668 0.328052 -1.160 0.245890
## AcidIndex12 -0.285785 0.336223 -0.850 0.395333
## AcidIndex13 -0.205579 0.338612 -0.607 0.543769
## AcidIndex14 -0.369095 0.360221 -1.025 0.305536
## AcidIndex15 -0.028509 0.421429 -0.068 0.946066
## AcidIndex16 -0.379628 0.603070 -0.629 0.529027
## AcidIndex17 -12.733210 267.007828 -0.048 0.961964
## SQRT_Alcohol 0.044272 0.011187 3.957 7.58e-05 ***
## LOG_Chlorides -0.068288 0.040352 -1.692 0.090589 .
## LOG_FreeSulfurDioxide 0.014236 0.006580 2.163 0.030507 *
## LabelAppeal-1 0.398920 0.045012 8.862 < 2e-16 ***
## LabelAppeal0 0.665417 0.044120 15.082 < 2e-16 ***
## LabelAppeal1 0.858689 0.045032 19.068 < 2e-16 ***
## LabelAppeal2 1.013235 0.050552 20.043 < 2e-16 ***
## STARS_imputed2 0.241131 0.017728 13.602 < 2e-16 ***
## STARS_imputed3 0.336338 0.019366 17.368 < 2e-16 ***
## STARS_imputed4 0.423897 0.025954 16.333 < 2e-16 ***
## is_stars_missingTRUE -0.201249 0.022380 -8.992 < 2e-16 ***
## LOG_TotalSulfurDioxide 0.006696 0.008321 0.805 0.421014
## LOG_VolatileAcidity -0.077351 0.022838 -3.387 0.000707 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.2824 0.4111 -15.282 < 2e-16 ***
## LabelAppeal-1 1.5251 0.3917 3.894 9.87e-05 ***
## LabelAppeal0 2.2396 0.3895 5.750 8.94e-09 ***
## LabelAppeal1 2.9723 0.3961 7.503 6.24e-14 ***
## LabelAppeal2 3.3821 0.4460 7.583 3.38e-14 ***
## is_stars_missingTRUE 4.5215 0.1577 28.666 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 47
## Log-likelihood: -1.673e+04 on 33 Df
# ZIP vs Poisson
vuong(pr_model2b, zip_model1)
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -35.05579 model2 > model1 < 2.22e-16
## AIC-corrected -34.90612 model2 > model1 < 2.22e-16
## BIC-corrected -34.36480 model2 > model1 < 2.22e-16
Highly significant (p < 0.001, ***): VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal levels, STARS_imputed3, STARS_imputed4
Significant (p < 0.05, *): Density, pH, Sulphates, STARS_imputed2, CitricAcid, Chlorides
Not significant (p > 0.05): FixedAcidity, ResidualSugar, most AcidIndex levels
R squared ~ 0.35 → Model explains ~35% of the variance in TARGET. Residual SE ~1.56 → Typical prediction error magnitude. F-statistic highly significant → Model as a whole is statistically meaningful.
cat("\n=== LM1: Basic Model ===\n")
##
## === LM1: Basic Model ===
lm_model1 <- lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
Density + pH + Sulphates + Alcohol + LabelAppeal +
AcidIndex + STARS_imputed, data = data_transform)
summary(lm_model1)
##
## Call:
## lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + LabelAppeal + AcidIndex +
## STARS_imputed, data = data_transform)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2451 -0.8678 0.2754 1.0801 4.7328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.335e+00 1.040e+00 2.244 0.024824 *
## FixedAcidity -1.958e-03 3.069e-03 -0.638 0.523390
## VolatileAcidity -2.442e-01 2.716e-02 -8.991 < 2e-16 ***
## CitricAcid 7.584e-02 2.547e-02 2.978 0.002907 **
## ResidualSugar 4.528e-04 6.701e-04 0.676 0.499211
## Chlorides -2.482e-01 6.929e-02 -3.582 0.000342 ***
## FreeSulfurDioxide 6.210e-04 1.508e-04 4.117 3.86e-05 ***
## TotalSulfurDioxide 4.575e-04 9.446e-05 4.844 1.29e-06 ***
## Density -1.248e+00 5.203e-01 -2.398 0.016504 *
## pH -5.147e-02 2.065e-02 -2.493 0.012695 *
## Sulphates -5.672e-02 2.363e-02 -2.401 0.016384 *
## Alcohol 1.986e-02 4.036e-03 4.921 8.70e-07 ***
## LabelAppeal-1 6.469e-01 7.490e-02 8.637 < 2e-16 ***
## LabelAppeal0 1.237e+00 7.287e-02 16.974 < 2e-16 ***
## LabelAppeal1 1.753e+00 7.609e-02 23.043 < 2e-16 ***
## LabelAppeal2 2.301e+00 1.004e-01 22.903 < 2e-16 ***
## AcidIndex5 7.958e-01 9.187e-01 0.866 0.386391
## AcidIndex6 7.660e-01 9.021e-01 0.849 0.395831
## AcidIndex7 6.631e-01 9.014e-01 0.736 0.461944
## AcidIndex8 5.002e-01 9.015e-01 0.555 0.578975
## AcidIndex9 9.078e-02 9.022e-01 0.101 0.919853
## AcidIndex10 -4.801e-01 9.037e-01 -0.531 0.595222
## AcidIndex11 -1.065e+00 9.066e-01 -1.175 0.240041
## AcidIndex12 -1.276e+00 9.119e-01 -1.400 0.161609
## AcidIndex13 -9.033e-01 9.206e-01 -0.981 0.326536
## AcidIndex14 -1.025e+00 9.301e-01 -1.102 0.270367
## AcidIndex15 -1.500e-01 1.057e+00 -0.142 0.887117
## AcidIndex16 -1.613e+00 1.140e+00 -1.416 0.156872
## AcidIndex17 -1.985e+00 1.078e+00 -1.841 0.065579 .
## STARS_imputed2 -1.277e-01 3.414e-02 -3.742 0.000184 ***
## STARS_imputed3 1.502e+00 4.492e-02 33.443 < 2e-16 ***
## STARS_imputed4 2.169e+00 7.128e-02 30.424 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.558 on 12763 degrees of freedom
## Multiple R-squared: 0.3476, Adjusted R-squared: 0.346
## F-statistic: 219.4 on 31 and 12763 DF, p-value: < 2.2e-16
R squared ~ 0.55 indicates that model explains ~55% of the variance, a substantial improvement over 0.35 from the previous model. Residual SE smaller indicates that predictions are closer to observed values. Highly significant F-statistic indicates that model is meaningful overall.
Hence lm_model2 is clearly better. It improves fit, reduces residual error, and focuses on the most important predictors.
cat("\n=== LM1: Log-transformed response (recommended for skewed target) ===\n")
##
## === LM1: Log-transformed response (recommended for skewed target) ===
lm_model2 <- lm(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing + LOG_TotalSulfurDioxide + LOG_VolatileAcidity, data = data_transform)
summary(lm_model2)
##
## Call:
## lm(formula = TARGET ~ AcidIndex + SQRT_Alcohol + LOG_Chlorides +
## LOG_FreeSulfurDioxide + LabelAppeal + STARS_imputed + is_stars_missing +
## LOG_TotalSulfurDioxide + LOG_VolatileAcidity, data = data_transform)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.103 -0.855 0.028 0.842 6.048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.25810 0.76189 1.651 0.098703 .
## AcidIndex5 -0.25387 0.76484 -0.332 0.739948
## AcidIndex6 -0.15846 0.75096 -0.211 0.832879
## AcidIndex7 -0.26134 0.75025 -0.348 0.727592
## AcidIndex8 -0.35866 0.75029 -0.478 0.632632
## AcidIndex9 -0.64678 0.75078 -0.861 0.388989
## AcidIndex10 -0.92769 0.75200 -1.234 0.217361
## AcidIndex11 -1.39437 0.75430 -1.849 0.064544 .
## AcidIndex12 -1.39316 0.75866 -1.836 0.066329 .
## AcidIndex13 -1.41262 0.76601 -1.844 0.065190 .
## AcidIndex14 -1.23344 0.77361 -1.594 0.110872
## AcidIndex15 -0.57621 0.87930 -0.655 0.512287
## AcidIndex16 -1.63670 0.94851 -1.726 0.084453 .
## AcidIndex17 -1.74480 0.89662 -1.946 0.051681 .
## SQRT_Alcohol 0.10383 0.02199 4.723 2.35e-06 ***
## LOG_Chlorides -0.27313 0.07749 -3.525 0.000425 ***
## LOG_FreeSulfurDioxide 0.07391 0.01249 5.918 3.34e-09 ***
## LabelAppeal-1 0.37044 0.06253 5.925 3.21e-09 ***
## LabelAppeal0 0.84005 0.06097 13.778 < 2e-16 ***
## LabelAppeal1 1.30534 0.06370 20.493 < 2e-16 ***
## LabelAppeal2 1.88433 0.08389 22.461 < 2e-16 ***
## STARS_imputed2 1.02458 0.03243 31.591 < 2e-16 ***
## STARS_imputed3 1.59047 0.03746 42.460 < 2e-16 ***
## STARS_imputed4 2.28275 0.05941 38.422 < 2e-16 ***
## is_stars_missingTRUE -2.36369 0.03197 -73.932 < 2e-16 ***
## LOG_TotalSulfurDioxide 0.10835 0.01510 7.175 7.62e-13 ***
## LOG_VolatileAcidity -0.33201 0.04353 -7.628 2.56e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.298 on 12768 degrees of freedom
## Multiple R-squared: 0.5467, Adjusted R-squared: 0.5458
## F-statistic: 592.4 on 26 and 12768 DF, p-value: < 2.2e-16
We know we have some strongly correlated features, so let’s address this potential issue by looking at the VIF for the models and removing predictors.
For the count regression models, luckily no predictor has a VIF > 5 (for Df > 1 (categorical), must look at GVIF^(1/(2*Df)) value), so we don’t need to remove any predictors.
cat("=== VIF for Count Regression Model 2a (Transformed Predictions with imputed values) ===\n")
## === VIF for Count Regression Model 2a (Transformed Predictions with imputed values) ===
car::vif(pr_model2a)
## GVIF Df GVIF^(1/(2*Df))
## AcidIndex 1.113178 13 1.004132
## SQRT_Alcohol 1.015515 1 1.007728
## LOG_Chlorides 1.004816 1 1.002405
## LOG_CitricAcid 1.011555 1 1.005761
## Density 1.005432 1 1.002712
## SQRT_FixedAcidity 1.027941 1 1.013874
## LOG_FreeSulfurDioxide 1.017914 1 1.008917
## LabelAppeal 1.147837 4 1.017384
## LOG_ResidualSugar 1.011349 1 1.005658
## STARS_imputed 1.331621 3 1.048890
## is_stars_missing 1.191941 1 1.091760
## LOG_Sulphates 1.003924 1 1.001960
## LOG_TotalSulfurDioxide 1.025321 1 1.012581
## LOG_VolatileAcidity 1.011188 1 1.005578
## pH 1.008212 1 1.004098
cat("=== VIF for Count Regression Model 2b (Transformed Predictions with imputed values and reduced variables) ===\n")
## === VIF for Count Regression Model 2b (Transformed Predictions with imputed values and reduced variables) ===
car::vif(pr_model2b)
## GVIF Df GVIF^(1/(2*Df))
## AcidIndex 1.061706 13 1.002306
## SQRT_Alcohol 1.013544 1 1.006749
## LOG_Chlorides 1.003526 1 1.001762
## LOG_FreeSulfurDioxide 1.016303 1 1.008119
## LabelAppeal 1.145305 4 1.017104
## STARS_imputed 1.330172 3 1.048700
## is_stars_missing 1.190888 1 1.091278
## LOG_TotalSulfurDioxide 1.021614 1 1.010749
## LOG_VolatileAcidity 1.010478 1 1.005225
cat("=== VIF for Negative Binomial Regression Model 1 (Original variables with STAR imputed) ===\n")
## === VIF for Negative Binomial Regression Model 1 (Original variables with STAR imputed) ===
car::vif(nb_model1)
## GVIF Df GVIF^(1/(2*Df))
## AcidIndex 1.093487 13 1.003443
## Alcohol 1.014259 1 1.007104
## Chlorides 1.004569 1 1.002282
## CitricAcid 1.007985 1 1.003985
## Density 1.005802 1 1.002897
## FixedAcidity 1.028534 1 1.014167
## FreeSulfurDioxide 1.005196 1 1.002595
## LabelAppeal 1.147204 4 1.017314
## ResidualSugar 1.006058 1 1.003025
## STARS_imputed 1.331197 3 1.048835
## is_stars_missing 1.188221 1 1.090055
## Sulphates 1.003359 1 1.001678
## TotalSulfurDioxide 1.008743 1 1.004362
## VolatileAcidity 1.008293 1 1.004138
## pH 1.008587 1 1.004284
cat("=== VIF for Negative Binomial Regression Model 1 (Transformed variables with STAR imputed) ===\n")
## === VIF for Negative Binomial Regression Model 1 (Transformed variables with STAR imputed) ===
car::vif(nb_model2)
## GVIF Df GVIF^(1/(2*Df))
## AcidIndex 1.061706 13 1.002306
## SQRT_Alcohol 1.013544 1 1.006749
## LOG_Chlorides 1.003526 1 1.001762
## LOG_FreeSulfurDioxide 1.016303 1 1.008119
## LabelAppeal 1.145303 4 1.017103
## STARS_imputed 1.330179 3 1.048701
## is_stars_missing 1.190897 1 1.091282
## LOG_TotalSulfurDioxide 1.021614 1 1.010749
## LOG_VolatileAcidity 1.010478 1 1.005225
cat("=== VIF for Multiple linear regression model 1 (Basic Model) ===\n")
## === VIF for Multiple linear regression model 1 (Basic Model) ===
car::vif(lm_model1)
## GVIF Df GVIF^(1/(2*Df))
## FixedAcidity 1.040478 1 1.020038
## VolatileAcidity 1.008520 1 1.004251
## CitricAcid 1.006970 1 1.003479
## ResidualSugar 1.005799 1 1.002895
## Chlorides 1.004192 1 1.002094
## FreeSulfurDioxide 1.004541 1 1.002268
## TotalSulfurDioxide 1.010929 1 1.005450
## Density 1.004997 1 1.002495
## pH 1.006637 1 1.003313
## Sulphates 1.004492 1 1.002244
## Alcohol 1.011061 1 1.005515
## LabelAppeal 1.112231 4 1.013385
## AcidIndex 1.106704 13 1.003907
## STARS_imputed 1.128292 3 1.020321
cat("=== VIF for Multiple linear regression model 2 (Transformed variables with STAR imputed) ===\n")
## === VIF for Multiple linear regression model 2 (Transformed variables with STAR imputed) ===
car::vif(lm_model2)
## GVIF Df GVIF^(1/(2*Df))
## AcidIndex 1.096031 13 1.003533
## SQRT_Alcohol 1.010943 1 1.005457
## LOG_Chlorides 1.003974 1 1.001985
## LOG_FreeSulfurDioxide 1.024492 1 1.012172
## LabelAppeal 1.124884 4 1.014819
## STARS_imputed 1.595534 3 1.080980
## is_stars_missing 1.502366 1 1.225710
## LOG_TotalSulfurDioxide 1.034608 1 1.017157
## LOG_VolatileAcidity 1.014201 1 1.007075
The LabelAppeal effect increases monotonically with LabelAppeal level. The magnitude is very similar across all models. There is strong, consistent positive impact: better label appeal strongly increases expected TARGET.
STARS_imputed has a strong positive effect for imputed stars 2–4 indicating that higher stars increase expected TARGET.
AcidIndex variables coefficients gradually decrease from AcidIndex5 → AcidIndex17.Hence smaller AcidIndex tends to slightly increase expected count (TARGET).
Alcohol, VolatileAcidity also have meaningful effects.
pH has smaller or even slightly negative effects in Poisson/NB vs linear regression. We are still preferring Poisson model as pH is not a significant predictor.
Model Comparison: Poisson vs NB have coefficients nearly identical for most predictors; NB may handle overdispersion better but doesn’t change effect direction/magnitude much.
Poisson 2a vs 2b have slight differences (~0.001–0.01), negligible practically. 2b seems slightly refined with log transformations.
Linear regression vs Poisson/NB comparison indicates that some variables flip direction (e.g., pH, Density), chemical variables are often more interpretable on log scale in Poisson/NB.
# Poisson regression
# Interpretation of coefficients (Model 2a)
cat("\n=== Coefficient Interpretation (Poisson regression Model 2a) ===\n")
##
## === Coefficient Interpretation (Poisson regression Model 2a) ===
coef_exp <- exp(coef(pr_model2a))
print(round(coef_exp, 4))
## (Intercept) AcidIndex5 AcidIndex6
## 1.9850 0.9091 0.9081
## AcidIndex7 AcidIndex8 AcidIndex9
## 0.8780 0.8473 0.7680
## AcidIndex10 AcidIndex11 AcidIndex12
## 0.6601 0.4835 0.4855
## AcidIndex13 AcidIndex14 AcidIndex15
## 0.5636 0.5049 0.7895
## AcidIndex16 AcidIndex17 SQRT_Alcohol
## 0.3912 0.0000 1.0347
## LOG_Chlorides LOG_CitricAcid Density
## 0.9098 1.0383 0.7658
## SQRT_FixedAcidity LOG_FreeSulfurDioxide LabelAppeal-1
## 0.9977 1.0286 1.2616
## LabelAppeal0 LabelAppeal1 LabelAppeal2
## 1.5244 1.7488 1.9889
## LOG_ResidualSugar STARS_imputed2 STARS_imputed3
## 1.0087 1.3622 1.5381
## STARS_imputed4 is_stars_missingTRUE LOG_Sulphates
## 1.7149 0.3395 0.9784
## LOG_TotalSulfurDioxide LOG_VolatileAcidity pH
## 1.0378 0.8904 0.9936
# Interpretation of coefficients (Model 2b)
cat("\n=== Coefficient Interpretation (Poisson regression Model 2b) ===\n")
##
## === Coefficient Interpretation (Poisson regression Model 2b) ===
coef_exp <- exp(coef(pr_model2b))
print(round(coef_exp, 4))
## (Intercept) AcidIndex5 AcidIndex6
## 1.5048 0.9050 0.9096
## AcidIndex7 AcidIndex8 AcidIndex9
## 0.8807 0.8507 0.7712
## AcidIndex10 AcidIndex11 AcidIndex12
## 0.6614 0.4835 0.4879
## AcidIndex13 AcidIndex14 AcidIndex15
## 0.5667 0.5068 0.8024
## AcidIndex16 AcidIndex17 SQRT_Alcohol
## 0.3931 0.0000 1.0345
## LOG_Chlorides LOG_FreeSulfurDioxide LabelAppeal-1
## 0.9069 1.0291 1.2625
## LabelAppeal0 LabelAppeal1 LabelAppeal2
## 1.5245 1.7505 1.9876
## STARS_imputed2 STARS_imputed3 STARS_imputed4
## 1.3631 1.5396 1.7164
## is_stars_missingTRUE LOG_TotalSulfurDioxide LOG_VolatileAcidity
## 0.3389 1.0386 0.8898
# Negative Binomial Regression Models
# Interpretation of coefficients (Negative Binomial Regression Model 1)
cat("\n=== Coefficient Interpretation (Negative Binomial Regression Model 1) ===\n")
##
## === Coefficient Interpretation (Negative Binomial Regression Model 1) ===
coef_exp <- exp(coef(nb_model1))
print(round(coef_exp, 4))
## (Intercept) AcidIndex5 AcidIndex6
## 2.7395 0.9039 0.9069
## AcidIndex7 AcidIndex8 AcidIndex9
## 0.8792 0.8471 0.7649
## AcidIndex10 AcidIndex11 AcidIndex12
## 0.6502 0.4744 0.4707
## AcidIndex13 AcidIndex14 AcidIndex15
## 0.5499 0.4934 0.7871
## AcidIndex16 AcidIndex17 Alcohol
## 0.3847 0.0000 1.0051
## Chlorides CitricAcid Density
## 0.9329 1.0119 0.7729
## FixedAcidity FreeSulfurDioxide LabelAppeal-1
## 0.9997 1.0002 1.2596
## LabelAppeal0 LabelAppeal1 LabelAppeal2
## 1.5220 1.7477 1.9904
## ResidualSugar STARS_imputed2 STARS_imputed3
## 1.0002 1.3664 1.5417
## STARS_imputed4 is_stars_missingTRUE Sulphates
## 1.7169 0.3369 0.9897
## TotalSulfurDioxide VolatileAcidity pH
## 1.0001 0.9445 0.9927
# Interpretation of coefficients (Negative Binomial Regression Model 1)
cat("\n=== Coefficient Interpretation (Negative Binomial Regression Model 2) ===\n")
##
## === Coefficient Interpretation (Negative Binomial Regression Model 2) ===
coef_exp <- exp(coef(nb_model2))
print(round(coef_exp, 4))
## (Intercept) AcidIndex5 AcidIndex6
## 1.5048 0.9050 0.9096
## AcidIndex7 AcidIndex8 AcidIndex9
## 0.8807 0.8507 0.7711
## AcidIndex10 AcidIndex11 AcidIndex12
## 0.6614 0.4835 0.4879
## AcidIndex13 AcidIndex14 AcidIndex15
## 0.5667 0.5068 0.8024
## AcidIndex16 AcidIndex17 SQRT_Alcohol
## 0.3931 0.0000 1.0345
## LOG_Chlorides LOG_FreeSulfurDioxide LabelAppeal-1
## 0.9069 1.0291 1.2625
## LabelAppeal0 LabelAppeal1 LabelAppeal2
## 1.5245 1.7505 1.9876
## STARS_imputed2 STARS_imputed3 STARS_imputed4
## 1.3631 1.5396 1.7164
## is_stars_missingTRUE LOG_TotalSulfurDioxide LOG_VolatileAcidity
## 0.3389 1.0386 0.8898
# Function to convert linear model log-coefficients into percent change
interpret_lm <- function(model) {
coefs <- coef(model)
pct <- (exp(coefs) - 1) * 100
out <- data.frame(
Coefficient = round(coefs, 4),
Percent_Change = round(pct, 2)
)
return(out)
}
# Interpretation for LM1
cat("\n=== Coefficient Interpretation (LM1: Basic Model) ===\n")
##
## === Coefficient Interpretation (LM1: Basic Model) ===
print(interpret_lm(lm_model1))
## Coefficient Percent_Change
## (Intercept) 2.3346 932.49
## FixedAcidity -0.0020 -0.20
## VolatileAcidity -0.2442 -21.67
## CitricAcid 0.0758 7.88
## ResidualSugar 0.0005 0.05
## Chlorides -0.2482 -21.98
## FreeSulfurDioxide 0.0006 0.06
## TotalSulfurDioxide 0.0005 0.05
## Density -1.2475 -71.28
## pH -0.0515 -5.02
## Sulphates -0.0567 -5.51
## Alcohol 0.0199 2.01
## LabelAppeal-1 0.6469 90.96
## LabelAppeal0 1.2370 244.53
## LabelAppeal1 1.7534 477.43
## LabelAppeal2 2.3005 897.92
## AcidIndex5 0.7958 121.62
## AcidIndex6 0.7660 115.12
## AcidIndex7 0.6631 94.08
## AcidIndex8 0.5002 64.91
## AcidIndex9 0.0908 9.50
## AcidIndex10 -0.4801 -38.13
## AcidIndex11 -1.0652 -65.53
## AcidIndex12 -1.2764 -72.10
## AcidIndex13 -0.9033 -59.48
## AcidIndex14 -1.0252 -64.13
## AcidIndex15 -0.1500 -13.93
## AcidIndex16 -1.6135 -80.08
## AcidIndex17 -1.9850 -86.26
## STARS_imputed2 -0.1277 -11.99
## STARS_imputed3 1.5023 349.20
## STARS_imputed4 2.1686 774.62
# Interpretation for LM2
cat("\n=== Coefficient Interpretation (LM2: Transformed Model with STARS imputed) ===\n")
##
## === Coefficient Interpretation (LM2: Transformed Model with STARS imputed) ===
print(interpret_lm(lm_model2))
## Coefficient Percent_Change
## (Intercept) 1.2581 251.87
## AcidIndex5 -0.2539 -22.42
## AcidIndex6 -0.1585 -14.65
## AcidIndex7 -0.2613 -23.00
## AcidIndex8 -0.3587 -30.14
## AcidIndex9 -0.6468 -47.63
## AcidIndex10 -0.9277 -60.45
## AcidIndex11 -1.3944 -75.20
## AcidIndex12 -1.3932 -75.17
## AcidIndex13 -1.4126 -75.65
## AcidIndex14 -1.2334 -70.87
## AcidIndex15 -0.5762 -43.80
## AcidIndex16 -1.6367 -80.54
## AcidIndex17 -1.7448 -82.53
## SQRT_Alcohol 0.1038 10.94
## LOG_Chlorides -0.2731 -23.90
## LOG_FreeSulfurDioxide 0.0739 7.67
## LabelAppeal-1 0.3704 44.84
## LabelAppeal0 0.8401 131.65
## LabelAppeal1 1.3053 268.89
## LabelAppeal2 1.8843 558.20
## STARS_imputed2 1.0246 178.59
## STARS_imputed3 1.5905 390.61
## STARS_imputed4 2.2828 880.36
## is_stars_missingTRUE -2.3637 -90.59
## LOG_TotalSulfurDioxide 0.1083 11.44
## LOG_VolatileAcidity -0.3320 -28.25
# Evaluate poisson regression model
evaluate_pr_model <- function(model, test_data, model_name) {
preds <- predict(model, newdata = test_data, type = "response")
actual <- test_data$TARGET
rmse <- sqrt(mean((preds - actual)^2))
mae <- mean(abs(preds - actual))
r2 <- 1 - sum((preds - actual)^2) / sum((actual - mean(actual))^2)
aic <- AIC(model)
# Poisson Regression Overdispersion test
residual_dev <- model$deviance
df_residual <- model$df.residual
dispersion <- residual_dev / df_residual
# Pearson chi-squared goodness of fit statistic
pearson_chi2_statistic <- sum(residuals(model, type = "pearson")^2)
degree_freedom_pearson <- df.residual(model)
p_value <- 1 - pchisq(pearson_chi2_statistic, degree_freedom_pearson)
# standard deviance/Pearson residuals can be unreliable for Poisson - DHARMa can help
# DHARMa simulates residuals to standardized residuals
simulationOutput <- simulateResiduals(model)
# Plot the diagnostics
plot(simulationOutput)
testOutliers(model, type = "bootstrap")
output <- paste(
"\n=== Model Selection and Evaluation ===\n\n",
"=== ", model_name, " Evaluation ===\n",
"RMSE:", round(rmse, 4),
"| MAE:", round(mae, 4),
"| R²:", round(r2, 4), "\n",
"| AIC:", round(aic, 4), "\n",
"| Pearson Chi-squared Goodness of Fit Statistic:", round(pearson_chi2_statistic, 4), "\n",
"| Degrees of Freedom:", round(degree_freedom_pearson, 4), "\n",
"| P-value:", round(p_value, 4), "\n",
"| Dispersion Statistic:", round(dispersion, 4), "\n\n",
sep = " "
)
cat(output)
}
pr_model_diagnostic_graphs <- function(model) {
par(mfrow = c(2,3))
plot(model, which = 1:6)
}
# === PR1 Evaluation ===
evaluate_pr_model(pr_model2a, test_transformed, "PR2a: Transformed variables with Stars Imputed")
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
##
## === Model Selection and Evaluation ===
##
## === PR2a: Transformed variables with Stars Imputed Evaluation ===
## RMSE: 1.3176 | MAE: 1.0247 | R²: 0.5372
## | AIC: 36334.9878
## | Pearson Chi-squared Goodness of Fit Statistic: 8874.9655
## | Degrees of Freedom: 10205
## | P-value: 1
## | Dispersion Statistic: 1.0486
pr_model_diagnostic_graphs(pr_model2a)
# Model 2 Evaluation
# === PR2 Evaluation ===
evaluate_pr_model(pr_model2b, test_transformed, "PR2b: Transformed variables with Stars Imputed")
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
##
## === Model Selection and Evaluation ===
##
## === PR2b: Transformed variables with Stars Imputed Evaluation ===
## RMSE: 1.3189 | MAE: 1.0257 | R²: 0.5363
## | AIC: 36331.7641
## | Pearson Chi-squared Goodness of Fit Statistic: 8881.655
## | Degrees of Freedom: 10211
## | P-value: 1
## | Dispersion Statistic: 1.0488
pr_model_diagnostic_graphs(pr_model2b)
# === Negative Binomial Regression Evaluation ===
evaluate_pr_model(nb_model1, test_transformed, "NB1: Original variables with Stars Imputed")
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
##
## === Model Selection and Evaluation ===
##
## === NB1: Original variables with Stars Imputed Evaluation ===
## RMSE: 1.3178 | MAE: 1.0226 | R²: 0.5371
## | AIC: 36378.8095
## | Pearson Chi-squared Goodness of Fit Statistic: 8912.2927
## | Degrees of Freedom: 10205
## | P-value: 1
## | Dispersion Statistic: 1.0526
pr_model_diagnostic_graphs(nb_model1)
# === Negative Binomial Regression Evaluation ===
evaluate_pr_model(nb_model2, test_transformed, "NB2: Transformed variables with Stars Imputed")
## DHARMa:testOutliers with type = binomial may have inflated Type I error rates for integer-valued distributions. To get a more exact result, it is recommended to re-run testOutliers with type = 'bootstrap'. See ?testOutliers for details
##
## === Model Selection and Evaluation ===
##
## === NB2: Transformed variables with Stars Imputed Evaluation ===
## RMSE: 1.3189 | MAE: 1.0257 | R²: 0.5363
## | AIC: 36334.1035
## | Pearson Chi-squared Goodness of Fit Statistic: 8881.2506
## | Degrees of Freedom: 10211
## | P-value: 1
## | Dispersion Statistic: 1.0488
pr_model_diagnostic_graphs(nb_model2)
# === Linear Regression Evaluation ===
evaluate_pr_model(lm_model1, test_transformed, "LM1: Basic variables with Stars Imputed")
##
## === Model Selection and Evaluation ===
##
## === LM1: Basic variables with Stars Imputed Evaluation ===
## RMSE: 1.5537 | MAE: 1.2403 | R²: 0.3565
## | AIC: 47688.0218
## | Pearson Chi-squared Goodness of Fit Statistic: 30972.5625
## | Degrees of Freedom: 12763
## | P-value: 0
## | Dispersion Statistic:
pr_model_diagnostic_graphs(lm_model1)
# === Linear Regression Evaluation ===
evaluate_pr_model(lm_model2, test_transformed, "LM1: Basic variables with Stars Imputed")
##
## === Model Selection and Evaluation ===
##
## === LM1: Basic variables with Stars Imputed Evaluation ===
## RMSE: 1.3184 | MAE: 1.0256 | R²: 0.5366
## | AIC: 43018.5555
## | Pearson Chi-squared Goodness of Fit Statistic: 21519.0519
## | Degrees of Freedom: 12768
## | P-value: 0
## | Dispersion Statistic:
pr_model_diagnostic_graphs(lm_model2)
This table summarizes the RMSE, RSQUARED, MAE, AIC, Pvalue and Dispersion for all SIX models.
NegBinom1 has the lowest RMSE (1.3178) and lowest MAE (1.0226) → slightly better predictive accuracy. Diagnostic plots show
NegBinom1 has highest R² (0.5371) → explains slightly more variance in TARGET.
Poisson Model 2a and 2b has slightly lower AIC (~36331.76) than NegBinom1 (36378.81). However Poisson dispersion ~1.0488 → mild overdispersion. NegBinom1 explicitly models overdispersion (dispersion = 1.0526).
Even if AIC is slightly lower for Poisson1, NegBinom1 is more robust because it accounts for overdispersion, which is a common issue in count data.
Diagnostic plots:
Hence NegBinom1 is the best count regression model overall because:
It has slightly better predictive performance (lowest RMSE/MAE, highest R²).
It handles overdispersion, which Poisson models only approximate.
Its coefficients are interpretable and consistent with Poisson models.
In the future, one model we could explore could be the ZINB model or the quasi-Poisson model. Additionally, we could test out different models that have outliers removed.
results <- data.frame(Model = c("Poisson Model 2a",
"Poisson Model 2b",
"NegBinom1",
"NegBinom2",
"MLM 1",
"MLM 2"),
RMSE = c(1.3189,1.3189,1.3178,1.3189,1.5537,1.3184),
MAE = c(1.0257,1.0257,1.0226,1.0257,1.2403 ,1.0256 ),
R2 = c(0.5363,0.5363, 0.5371,0.5363,0.3565,0.5366 ),
AIC = c(36331.7641,36331.7641,36378.8095,36334.1035,47688.0218, 43018.5555),
P_value = c(1,1,1,1,0,0),
Dispersion_Statistic = c(1.0488,1.0488,1.0526,1.0488,NA,NA )
)
knitr::kable(results)
| Model | RMSE | MAE | R2 | AIC | P_value | Dispersion_Statistic |
|---|---|---|---|---|---|---|
| Poisson Model 2a | 1.3189 | 1.0257 | 0.5363 | 36331.76 | 1 | 1.0488 |
| Poisson Model 2b | 1.3189 | 1.0257 | 0.5363 | 36331.76 | 1 | 1.0488 |
| NegBinom1 | 1.3178 | 1.0226 | 0.5371 | 36378.81 | 1 | 1.0526 |
| NegBinom2 | 1.3189 | 1.0257 | 0.5363 | 36334.10 | 1 | 1.0488 |
| MLM 1 | 1.5537 | 1.2403 | 0.3565 | 47688.02 | 0 | NA |
| MLM 2 | 1.3184 | 1.0256 | 0.5366 | 43018.56 | 0 | NA |
results <- data.frame(
Model = c("Poisson1", "Poisson2", "NegBinom1", "NegBinom2", "MLM 1", "MLM 2"),
RMSE = c(1.3189, 1.3189, 1.3178, 1.3189, 1.5537, 1.3184),
MAE = c(1.0257, 1.0257, 1.0226, 1.0257, 1.2403, 1.0256),
R2 = c(0.5363, 0.5363, 0.5371, 0.5363, 0.3565, 0.5366),
AIC = c(36331.7641, 36331.7641, 36378.8095, 36334.1035, 47688.0218, 43018.5555),
P_value = c(1, 1, 1, 1, 0, 0),
Dispersion_Statistic = c(1.0488, 1.0488, 1.0526, 1.0488, NA, NA)
)
knitr::kable(results)
| Model | RMSE | MAE | R2 | AIC | P_value | Dispersion_Statistic |
|---|---|---|---|---|---|---|
| Poisson1 | 1.3189 | 1.0257 | 0.5363 | 36331.76 | 1 | 1.0488 |
| Poisson2 | 1.3189 | 1.0257 | 0.5363 | 36331.76 | 1 | 1.0488 |
| NegBinom1 | 1.3178 | 1.0226 | 0.5371 | 36378.81 | 1 | 1.0526 |
| NegBinom2 | 1.3189 | 1.0257 | 0.5363 | 36334.10 | 1 | 1.0488 |
| MLM 1 | 1.5537 | 1.2403 | 0.3565 | 47688.02 | 0 | NA |
| MLM 2 | 1.3184 | 1.0256 | 0.5366 | 43018.56 | 0 | NA |
# Check missing data
eval_data %>%
summarise_all(~ sum(is.na(.))) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "missing_count") %>%
filter(missing_count != 0) %>%
arrange(desc(missing_count))
## # A tibble: 9 × 2
## variable missing_count
## <chr> <int>
## 1 TARGET 3335
## 2 STARS 841
## 3 Sulphates 310
## 4 Alcohol 185
## 5 ResidualSugar 168
## 6 TotalSulfurDioxide 157
## 7 FreeSulfurDioxide 152
## 8 Chlorides 138
## 9 pH 104
# Check for duplicates
duplicates <- duplicated(eval_data)
# Print the duplicates
print(eval_data[duplicates, ])
## [1] IN TARGET FixedAcidity VolatileAcidity
## [5] CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## [9] TotalSulfurDioxide Density pH Sulphates
## [13] Alcohol LabelAppeal AcidIndex STARS
## <0 rows> (or 0-length row.names)
#Converting ordinal variables as factor data type.
eval_data$LabelAppeal= as.factor(eval_data$LabelAppeal)
eval_data$AcidIndex = as.factor(eval_data$AcidIndex)
# If value is negative (most likely inputted incorrectly), update to NA
eval_data$FixedAcidity = ifelse(eval_data$FixedAcidity < 0, NA, eval_data$FixedAcidity)
eval_data$VolatileAcidity = ifelse(eval_data$VolatileAcidity < 0, NA, eval_data$VolatileAcidity)
eval_data$CitricAcid = ifelse(eval_data$CitricAcid < 0, NA, eval_data$CitricAcid)
eval_data$ResidualSugar = ifelse(eval_data$ResidualSugar < 0, NA, eval_data$ResidualSugar)
eval_data$Chlorides = ifelse(eval_data$Chlorides < 0, NA, eval_data$Chlorides)
eval_data$FreeSulfurDioxide = ifelse(eval_data$FreeSulfurDioxide < 0, NA, eval_data$FreeSulfurDioxide)
eval_data$TotalSulfurDioxide = ifelse(eval_data$TotalSulfurDioxide < 0, NA, eval_data$TotalSulfurDioxide)
eval_data$Sulphates = ifelse(eval_data$Sulphates < 0, NA, eval_data$Sulphates)
eval_data$Alcohol = ifelse(eval_data$Alcohol < 0, NA, raw_data$Alcohol)
# Fill in NA with median
evaldata_imputed <- eval_data %>%
mutate_at(vars(c("FixedAcidity", "VolatileAcidity", "CitricAcid", "ResidualSugar", "Chlorides", "FreeSulfurDioxide", "TotalSulfurDioxide", "pH", "Sulphates", "Alcohol")), ~ifelse(is.na(.), median(., na.rm = TRUE), .))
# Create new missing indicator variable
evaldata_imputed <- evaldata_imputed %>%
mutate(is_stars_missing = is.na(STARS))
# Convert STARS to numeric as it is a factor
evaldata_imputed$STARS <- as.numeric(as.character(evaldata_imputed$STARS))
# impute STARS missing values using the median
evaldata_imputed <- evaldata_imputed %>%
mutate(STARS_imputed = coalesce(STARS, median(STARS, na.rm = TRUE)))
# Convert missing indicator and STARS variables to factor
evaldata_imputed$is_stars_missing <- as.factor(evaldata_imputed$is_stars_missing)
evaldata_imputed$STARS <- as.factor(evaldata_imputed$STARS)
evaldata_imputed$STARS_imputed <- as.factor(evaldata_imputed$STARS_imputed)
# remove TARGET variable
evaldata_imputed <- evaldata_imputed |>
dplyr::select(-TARGET)
# Check for missing values again
plot_missing(evaldata_imputed )
The histogram shows the TARGET predicted distribution.We see that as 763 observations has STARS NA, the imputation generated a TARGET value of 1 for the same.hence there is a modal at 1.
#evaldata_transform <- evaldata_imputed
#evaldata_transform <- evaldata_transform%>%
# mutate(
# SQRT_Alcohol = sqrt(evaldata_transform$Alcohol +1),
# LOG_Chlorides = log(evaldata_transform$Chlorides + 1),
# LOG_FreeSulfurDioxide = log(evaldata_transform$FreeSulfurDioxide + 1),
# LOG_TotalSulfurDioxide = log(evaldata_transform$TotalSulfurDioxide + 1),
# LOG_VolatileAcidity = log(evaldata_transform$VolatileAcidity + 1)
# )
finalpreds <- predict(nb_model1, evaldata_imputed,type = "response" )
finaldf <- cbind(evaldata_imputed,TARGET=round(finalpreds,0))
hist(finalpreds)
write.csv(finaldf, 'wine-predicted-data.csv', row.names = FALSE)