Data621 - Homework 5

Author

Bikash Bhowmik, Rupendra Shrestha, Anthony Josue Roman, Jerald Melukkaran, Haoming Chen

Published

May 17, 2026

Introduction

Data-driven decision-making is essential for achieving effective business outcomes. In this analysis, we begin by exploring the dataset to identify potential issues such as outliers, missing values, encoding inconsistencies, and multicollinearity. After addressing these challenges through appropriate data preprocessing techniques, we develop and evaluate several predictive models for estimating sales.

The dataset is divided into training and evaluation subsets, allowing us to train models on one portion and validate their performance on unseen data. Multiple modeling approaches are compared, and the final model is selected based on its ability to balance predictive accuracy with model simplicity.

Code
library(summarytools)
library(MASS)
library(rpart.plot)
library(ggplot2)
library(ggfortify)
library(gridExtra)
library(forecast)
library(fpp2)
library(fma)
library(kableExtra)
library(e1071)
library(mlbench)
library(ggcorrplot)
library(DataExplorer)
library(timeDate)
library(caret)
library(GGally)
library(corrplot)
library(RColorBrewer)
library(tibble)
library(tidyr)
library(tidyverse)
library(dplyr)
library(reshape2)
library(mixtools)
library(tidymodels)
library(ggpmisc)
library(regclass)
library(skimr)
library(corrgram)
library(mice)
#' Print a side-by-side Histogram and QQPlot of Residuals
#'
#' @param model A model
#' @examples
#' residPlot(myModel)
#' @return null
#' @export
residPlot <- function(model) {
  # Make sure a model was passed
  if (is.null(model)) {
    return
  }
  
  layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
  plot(residuals(model))
  hist(model[["residuals"]], freq = FALSE, breaks = "fd", main = "Residual Histogram",
       xlab = "Residuals",col="lightgreen")
  lines(density(model[["residuals"]], kernel = "ep"),col="blue", lwd=3)
  curve(dnorm(x,mean=mean(model[["residuals"]]), sd=sd(model[["residuals"]])), col="red", lwd=3, lty="dotted", add=T)
  qqnorm(model[["residuals"]], main = "Residual Q-Q plot")
  qqline(model[["residuals"]],col="red", lwd=3, lty="dotted")
  par(mfrow = c(1, 1))
}
#' Print a Variable Importance Plot for the provided model
#'
#' @param model The model
#' @param chart_title The Title to show on the plot
#' @examples
#' variableImportancePlot(myLinearModel, 'My Title)
#' @return null
#' @export
variableImportancePlot <- function(model=NULL, chart_title='Variable Importance Plot') {
  # Make sure a model was passed
  if (is.null(model)) {
    return
  }
  
  # use caret and gglot to print a variable importance plot
  varImp(model) %>% as.data.frame() %>% 
    ggplot(aes(x = reorder(rownames(.), desc(Overall)), y = Overall)) +
    geom_col(aes(fill = Overall)) +
    theme(panel.background = element_blank(),
          panel.grid = element_blank(),
          axis.text.x = element_text(angle = 90)) +
    scale_fill_gradient() +
    labs(title = chart_title,
         x = "Parameter",
         y = "Relative Importance")
}
#' Print a Facet Chart of histograms
#'
#' @param df Dataset
#' @param box Facet size (rows)
#' @examples
#' histbox(my_df, 3)
#' @return null
#' @export
histbox <- function(df, box) {
    par(mfrow = box)
    ndf <- dimnames(df)[[2]]
    
    for (i in seq_along(ndf)) {
            data <- na.omit(unlist(df[, i]))
            hist(data, breaks = "fd", main = paste("Histogram of", ndf[i]),
                 xlab = ndf[i], freq = FALSE)
            lines(density(data, kernel = "ep"), col = 'red')
    }
    
    par(mfrow = c(1, 1))
}
#' Extract key performance results from a model
#'
#' @param model A linear model of interest
#' @examples
#' model_performance_extraction(my_model)
#' @return data.frame
#' @export
model_performance_extraction <- function(model=NULL) {
  # Make sure a model was passed
  if (is.null(model)) {
    return
  }
  
  data.frame("RSE" = model$sigma,
             "Adj R2" = model$adj.r.squared,
             "F-Statistic" = model$fstatistic[1])
}

Data Exploration

Code
train_df <- read.csv('https://raw.githubusercontent.com/BIKASHBHOWMIK15/Data-621/main/wine-training-data.csv', header=T) %>% as.tibble()
evaluate_df <- read.csv('https://raw.githubusercontent.com/BIKASHBHOWMIK15/Data-621/main/wine-evaluation-data.csv', header=T) %>% as.tibble()
train_df$INDEX <- NULL # Remove index column
evaluate_df$IN <- NULL
#evaluate_df$TARGET <- NULL
str(train_df)
tibble [12,795 × 15] (S3: tbl_df/tbl/data.frame)
 $ TARGET            : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
 $ FixedAcidity      : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
 $ VolatileAcidity   : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
 $ CitricAcid        : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
 $ ResidualSugar     : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
 $ Chlorides         : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
 $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
 $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
 $ Density           : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
 $ pH                : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
 $ Sulphates         : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
 $ Alcohol           : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
 $ LabelAppeal       : int [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
 $ AcidIndex         : int [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
 $ STARS             : int [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...
Code
str(evaluate_df)
tibble [3,335 × 15] (S3: tbl_df/tbl/data.frame)
 $ TARGET            : logi [1:3335] NA NA NA NA NA NA ...
 $ FixedAcidity      : num [1:3335] 5.4 12.4 7.2 6.2 11.4 17.6 15.5 15.9 11.6 3.8 ...
 $ VolatileAcidity   : num [1:3335] -0.86 0.385 1.75 0.1 0.21 0.04 0.53 1.19 0.32 0.22 ...
 $ CitricAcid        : num [1:3335] 0.27 -0.76 0.17 1.8 0.28 -1.15 -0.53 1.14 0.55 0.31 ...
 $ ResidualSugar     : num [1:3335] -10.7 -19.7 -33 1 1.2 1.4 4.6 31.9 -50.9 -7.7 ...
 $ Chlorides         : num [1:3335] 0.092 1.169 0.065 -0.179 0.038 ...
 $ FreeSulfurDioxide : num [1:3335] 23 -37 9 104 70 -250 10 115 35 40 ...
 $ TotalSulfurDioxide: num [1:3335] 398 68 76 89 53 140 17 381 83 129 ...
 $ Density           : num [1:3335] 0.985 0.99 1.046 0.989 1.029 ...
 $ pH                : num [1:3335] 5.02 3.37 4.61 3.2 2.54 3.06 3.07 2.99 3.32 4.72 ...
 $ Sulphates         : num [1:3335] 0.64 1.09 0.68 2.11 -0.07 -0.02 0.75 0.31 2.18 -0.64 ...
 $ Alcohol           : num [1:3335] 12.3 16 8.55 12.3 4.8 11.4 8.5 11.4 -0.5 10.9 ...
 $ LabelAppeal       : int [1:3335] -1 0 0 -1 0 1 0 1 0 0 ...
 $ AcidIndex         : int [1:3335] 6 6 8 8 10 8 12 7 12 7 ...
 $ STARS             : int [1:3335] NA 2 1 1 NA 4 3 NA NA NA ...

The training dataset consists of 12,795 observations and 15 variables, including 14 predictors and one response variable (TARGET). The INDEX column serves only as an identifier and is removed prior to analysis. Most predictors represent chemical properties of wine, while LabelAppeal and STARS reflect subjective quality and presentation ratings.

The response variable, TARGET, represents the number of wine cases purchased and takes integer values between 0 and 8. A large proportion of observations contain zero values, indicating many instances with no purchases.

Code
library(vtable)
st(train_df)
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3 1.9 0 2 4 8
FixedAcidity 12795 7.1 6.3 -18 5.2 9.5 34
VolatileAcidity 12795 0.32 0.78 -2.8 0.13 0.64 3.7
CitricAcid 12795 0.31 0.86 -3.2 0.03 0.58 3.9
ResidualSugar 12179 5.4 34 -128 -2 16 141
Chlorides 12157 0.055 0.32 -1.2 -0.031 0.15 1.4
FreeSulfurDioxide 12148 31 149 -555 0 70 623
TotalSulfurDioxide 12113 121 232 -823 27 208 1057
Density 12795 0.99 0.027 0.89 0.99 1 1.1
pH 12400 3.2 0.68 0.48 3 3.5 6.1
Sulphates 11585 0.53 0.93 -3.1 0.28 0.86 4.2
Alcohol 12142 10 3.7 -4.7 9 12 26
LabelAppeal 12795 -0.0091 0.89 -2 -1 1 2
AcidIndex 12795 7.8 1.3 4 7 8 17
STARS 9436 2 0.9 1 1 3 4

Predictor Variables

Among the predictor variables, several contain missing values, particularly STARS, Sulphates, and sulfur-related measurements. The dataset includes both continuous and discrete variables, with LabelAppeal, AcidIndex, and STARS treated as categorical features.

Some chemical variables contain negative values, which are not physically meaningful. These values are likely the result of prior transformations such as normalization or scaling. While this inconsistency is noted, the values are retained (with adjustments applied later) due to limited information about the data generation process.

Response Variables

The distribution of the response variable indicates that most observations correspond to low sales volumes, with a high frequency of zero values. For non-zero observations, the distribution appears approximately symmetric, suggesting moderate variability in purchase behavior.

Variables Distributions

Most continuous variables exhibit approximately normal distributions, although some degree of skewness is present. Variables such as AcidIndex and STARS show noticeable right skewness, indicating that higher values occur less frequently.

Boxplot analysis reveals the presence of outliers across several variables, particularly in TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar. These variables also display wider ranges compared to others, suggesting greater variability.

We observe that continuous variables exhibit a somewhat normal steep distribution. However, variables such as AcidIndex and STARS display right skewness.

In the box plot, there are not many outliers in the variables. However, TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar variables have large ranges compared to other variables.We can tell a high number of variables have numerous outliers.

Relationship Between Categorical and Response Variables

Exploration of categorical predictors shows meaningful relationships with the response variable. Higher values of STARS are generally associated with increased sales, indicating that perceived quality influences purchasing behavior. Similarly, LabelAppeal appears positively related to sales, suggesting that product presentation plays a role in consumer decisions.

The AcidIndex variable shows variation in sales across its levels, indicating that certain acidity levels may be more favorable in the market.

The bar charts compare the three discrete categorical variables to the TARGET variable. AcidIndex shows large quantity of wine were sold with the index number 7 and 8. LabelAppeal shows us generic label does yield higher number of wine samples per order. Lastly, STARS shows high star wine bottles have high price tags. For each of these predictors, there appears to be a significant relationship between the ordered levels and the number of wine cases sold.

Multicolinearity

Correlation analysis reveals that STARS and LabelAppeal have the strongest positive relationships with the response variable. In contrast, AcidIndex shows a mild negative correlation with sales. Overall, while some correlations exist among predictors, no severe multicollinearity issues are observed.

x
TARGET 1.0000000
FixedAcidity -0.0125381
VolatileAcidity -0.0759979
CitricAcid 0.0023450
ResidualSugar 0.0035196
Chlorides -0.0304301
FreeSulfurDioxide 0.0226398
TotalSulfurDioxide 0.0216021
Density -0.0475989
pH 0.0002199
Sulphates -0.0212204
Alcohol 0.0737771
LabelAppeal 0.4979465
AcidIndex -0.1676431
STARS 0.5546857

In the correlation table, we can see that STARS and LabelAppeal are most positively correlated variables with the response variable. Also, we some mild negative correlation between the response variable and AcidIndex variable.

Data Preparation

Negative Values

To address negative values in chemical variables, absolute transformations are applied to ensure all values are non-negative. For LabelAppeal, values are shifted by adding the minimum value to maintain relative differences while eliminating negative entries.

Code
train_df$FixedAcidity <- abs(train_df$FixedAcidity)
evaluate_df$FixedAcidity <- abs(evaluate_df$FixedAcidity)

train_df$VolatileAcidity <- abs(train_df$VolatileAcidity)
evaluate_df$VolatileAcidity <- abs(evaluate_df$VolatileAcidity)

train_df$CitricAcid <- abs(train_df$CitricAcid)
evaluate_df$CitricAcid <- abs(evaluate_df$CitricAcid)

train_df$ResidualSugar <- abs(train_df$ResidualSugar)
evaluate_df$ResidualSugar <- abs(evaluate_df$ResidualSugar)

train_df$Chlorides <- abs(train_df$Chlorides)
evaluate_df$Chlorides <- abs(evaluate_df$Chlorides)

train_df$FreeSulfurDioxide <- abs(train_df$FreeSulfurDioxide)
evaluate_df$FreeSulfurDioxide <- abs(evaluate_df$FreeSulfurDioxide)

train_df$TotalSulfurDioxide <- abs(train_df$TotalSulfurDioxide)
evaluate_df$TotalSulfurDioxide <- abs(evaluate_df$TotalSulfurDioxide)

train_df$Sulphates <- abs(train_df$Sulphates)
evaluate_df$Sulphates <- abs(evaluate_df$Sulphates)

train_df$LabelAppeal <- train_df$LabelAppeal + abs(min(train_df$LabelAppeal))
evaluate_df$LabelAppeal <- evaluate_df$LabelAppeal + abs(min(evaluate_df$LabelAppeal))

Missing Data

Several variables contain missing values, with STARS having the highest proportion. Missing values are assumed to be Missing at Random (MAR) and are imputed using the MICE (Multiple Imputation by Chained Equations) method.

According to the graph, the data set has multiple variables with missing variables. The STARS variable has the most NA values. The Sulphates variable records missing values in roughly 10% of observations, while the remaining six predictors have missing values ranging from 3% to 5%. These missing values will be imputed later on during the data preparation using the MICE package and random forest prediction method.

Code
train_df$STARS[is.na(train_df$STARS)] <- 0
evaluate_df$STARS[is.na(evaluate_df$STARS)] <- 0
# Perform multiple imputation
mice_imputes <- mice(train_df, m = 2, maxit = 2, print = FALSE)
# Visualize the imputed values with density plot
densityplot(mice_imputes)

We can see that each of the remaining variables with missing values seem to be MAR, as the mice imputation distributions roughly match the existing. We’ll also run the mice imputation again on both the train and test set. Instead of using it for our models, however, we’ll simplify our run and fill in our data. Finally, after our analysis, we can use it in in our model, we’ll update STARS to become a factor variable.

Code
mice_train <-  mice(train_df, m = 1, maxit = 1, print = FALSE)
cleaned_train <- complete(mice_train)

mice_evaluate <- mice(evaluate_df, m = 1, maxit = 1, print = FALSE)
cleaned_evaluate <- complete(mice_evaluate)

cleaned_train$STARS <- as.factor(cleaned_train$STARS)
cleaned_evaluate$STARS <- as.factor(cleaned_evaluate$STARS)

Descriptive Summaries and Correlation Review

Following data cleaning and imputation, updated summaries and visualizations confirm that the dataset is well-structured and suitable for modeling. The distributions remain consistent with earlier observations, and correlation patterns are preserved.

Code
# Create a summary table
summary_table <- descr(cleaned_train)

# Transpose the summary table
transposed_summary_table <- t(summary_table)

# Display transposed summary table as an HTML table
knitr::kable(transposed_summary_table, "html", escape = FALSE) %>%
  kableExtra::kable_styling("striped", full_width = FALSE) %>%
  kableExtra::column_spec(1, bold = TRUE) %>%
  kableExtra::scroll_box(height = "500px")
Mean Std.Dev Min Q1 Median Q3 Max MAD IQR CV Skewness SE.Skewness Kurtosis N.Valid N Pct.Valid
AcidIndex 7.7727237 1.3239264 4.00000 7.00000 8.00000 8.00000 17.00000 1.4826000 1.000000 0.1703298 1.6484959 0.0216523 5.1900925 12795 12795 100
Alcohol 10.4993787 3.7221234 -4.70000 9.00000 10.40000 12.40000 26.50000 2.3721600 3.400000 0.3545089 -0.0229672 0.0216523 1.5377444 12795 12795 100
Chlorides 0.2219656 0.2339522 0.00000 0.04600 0.09800 0.36700 1.35100 0.1008168 0.321000 1.0540020 1.4904903 0.0216523 2.2167294 12795 12795 100
CitricAcid 0.6863150 0.6060052 0.00000 0.28000 0.44000 0.97000 3.86000 0.3261720 0.690000 0.8829841 1.6428101 0.0216523 2.9474433 12795 12795 100
Density 0.9942027 0.0265376 0.88809 0.98772 0.99449 1.00052 1.09924 0.0093552 0.012795 0.0266924 -0.0186938 0.0216523 1.8999592 12795 12795 100
FixedAcidity 8.0632513 4.9961186 0.00000 5.60000 7.00000 9.80000 34.40000 2.9652000 4.200000 0.6196159 1.1742806 0.0216523 1.9666235 12795 12795 100
FreeSulfurDioxide 106.3788980 107.9424065 0.00000 28.00000 55.50000 171.00000 623.00000 60.0453000 143.000000 1.0146975 1.5304247 0.0216523 2.4323558 12795 12795 100
LabelAppeal 1.9909340 0.8910892 0.00000 1.00000 2.00000 3.00000 4.00000 1.4826000 2.000000 0.4475735 0.0084295 0.0216523 -0.2622916 12795 12795 100
pH 3.2095756 0.6796052 0.48000 2.96000 3.20000 3.47000 6.13000 0.3854760 0.510000 0.2117430 0.0530955 0.0216523 1.6364090 12795 12795 100
ResidualSugar 23.3332200 24.9046958 0.00000 3.60000 12.90000 38.70000 141.15000 16.3086000 35.050000 1.0673493 1.4648801 0.0216523 2.2121765 12795 12795 100
Sulphates 0.8457593 0.6543182 0.00000 0.43000 0.59000 1.10000 4.24000 0.3261720 0.665000 0.7736459 1.6875669 0.0216523 3.1921092 12795 12795 100
TARGET 3.0290739 1.9263682 0.00000 2.00000 3.00000 4.00000 8.00000 1.4826000 2.000000 0.6359595 -0.3263010 0.0216523 -0.8772457 12795 12795 100
TotalSulfurDioxide 204.4345057 163.3449901 0.00000 99.00000 154.00000 263.00000 1057.00000 102.2994000 164.000000 0.7990089 1.6066196 0.0216523 2.9981944 12795 12795 100
VolatileAcidity 0.6410856 0.5556141 0.00000 0.25000 0.41000 0.91000 3.68000 0.3261720 0.660000 0.8666770 1.6529782 0.0216523 3.0833040 12795 12795 100
Code
corrgram(drop_na(cleaned_train), order=TRUE,
         upper.panel=panel.cor, main="Revised Correlation")

Split the Sample data Set

The dataset is divided into training (80%) and testing (20%) subsets. Models are trained on the training data and evaluated on the testing data to ensure unbiased performance assessment.

[1] "Number of Training Samples:  10238"
[1] "Number of Testing Samples:  2557"

Build Models

Multiple models are developed to predict wine sales, including Poisson regression, Negative Binomial regression, and Multiple Linear Regression. Both full models (using all predictors) and reduced models (using selected predictors) are evaluated.

Poisson Regression Model 1

In this first model, we include all available features: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS


Call:
glm(formula = TARGET ~ ., family = poisson, data = trainingData)

Coefficients:
                      Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)         0.77409205  0.21936126   3.529             0.000417 ***
FixedAcidity       -0.00040551  0.00116853  -0.347             0.728569    
VolatileAcidity    -0.03771453  0.01057979  -3.565             0.000364 ***
CitricAcid          0.00871051  0.00933750   0.933             0.350897    
ResidualSugar      -0.00002156  0.00022846  -0.094             0.924824    
Chlorides          -0.02905990  0.02440929  -1.191             0.233840    
FreeSulfurDioxide   0.00004275  0.00005265   0.812             0.416811    
TotalSulfurDioxide  0.00007030  0.00003499   2.009             0.044508 *  
Density            -0.27194833  0.21359608  -1.273             0.202951    
pH                 -0.00909826  0.00848421  -1.072             0.283551    
Sulphates          -0.01380324  0.00884872  -1.560             0.118780    
Alcohol             0.00340288  0.00154565   2.202             0.027695 *  
LabelAppeal         0.15693239  0.00689540  22.759 < 0.0000000000000002 ***
AcidIndex          -0.07844295  0.00510771 -15.358 < 0.0000000000000002 ***
STARS1              0.78894479  0.02198143  35.891 < 0.0000000000000002 ***
STARS2              1.10700979  0.02051650  53.957 < 0.0000000000000002 ***
STARS3              1.23395885  0.02160440  57.116 < 0.0000000000000002 ***
STARS4              1.34651749  0.02722745  49.454 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 18365  on 10237  degrees of freedom
Residual deviance: 10963  on 10220  degrees of freedom
AIC: 36520

Number of Fisher Scoring iterations: 6
         RMSE      Rsquared           MAE           aic           bic 
    2.5894545     0.5202773     2.2187243 36519.6914823 36649.9009905 

Poisson Regression Model 2

In this Model 2, we only include the most predictive features : VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS


Call:
glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
    Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson, 
    data = trainingData)

Coefficients:
                      Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)         0.46404014  0.05004844   9.272 < 0.0000000000000002 ***
VolatileAcidity    -0.03779725  0.01057921  -3.573             0.000353 ***
TotalSulfurDioxide  0.00007139  0.00003497   2.042             0.041176 *  
Alcohol             0.00346452  0.00154517   2.242             0.024951 *  
LabelAppeal         0.15704568  0.00689219  22.786 < 0.0000000000000002 ***
AcidIndex          -0.07878071  0.00503813 -15.637 < 0.0000000000000002 ***
STARS1              0.79022846  0.02197602  35.959 < 0.0000000000000002 ***
STARS2              1.10830042  0.02050642  54.046 < 0.0000000000000002 ***
STARS3              1.23542281  0.02159526  57.208 < 0.0000000000000002 ***
STARS4              1.34776225  0.02722184  49.510 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 18365  on 10237  degrees of freedom
Residual deviance: 10971  on 10228  degrees of freedom
AIC: 36512

Number of Fisher Scoring iterations: 6
         RMSE      Rsquared           MAE           aic           bic 
    2.5896019     0.5196981     2.2189296 36512.1625275 36584.5011431 

Negative Binomial Regression Model 3

Similar to Poisson Model 1, the predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS


Call:
glm.nb(formula = TARGET ~ ., data = trainingData, init.theta = 40385.77048, 
    link = log)

Coefficients:
                      Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)         0.77412034  0.21937119   3.529             0.000417 ***
FixedAcidity       -0.00040553  0.00116858  -0.347             0.728572    
VolatileAcidity    -0.03771592  0.01058025  -3.565             0.000364 ***
CitricAcid          0.00871069  0.00933794   0.933             0.350909    
ResidualSugar      -0.00002156  0.00022847  -0.094             0.924825    
Chlorides          -0.02906023  0.02441040  -1.190             0.233856    
FreeSulfurDioxide   0.00004275  0.00005265   0.812             0.416809    
TotalSulfurDioxide  0.00007031  0.00003499   2.009             0.044505 *  
Density            -0.27195199  0.21360583  -1.273             0.202965    
pH                 -0.00909909  0.00848460  -1.072             0.283530    
Sulphates          -0.01380381  0.00884911  -1.560             0.118781    
Alcohol             0.00340279  0.00154572   2.201             0.027706 *  
LabelAppeal         0.15693140  0.00689571  22.758 < 0.0000000000000002 ***
AcidIndex          -0.07844532  0.00510790 -15.358 < 0.0000000000000002 ***
STARS1              0.78894380  0.02198189  35.891 < 0.0000000000000002 ***
STARS2              1.10700885  0.02051696  53.956 < 0.0000000000000002 ***
STARS3              1.23395871  0.02160498  57.115 < 0.0000000000000002 ***
STARS4              1.34651820  0.02722862  49.452 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(40385.77) family taken to be 1)

    Null deviance: 18364  on 10237  degrees of freedom
Residual deviance: 10962  on 10220  degrees of freedom
AIC: 36522

Number of Fisher Scoring iterations: 1

              Theta:  40386 
          Std. Err.:  37815 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -36484.03 
         RMSE      Rsquared           MAE           aic           bic 
    2.5894544     0.5202774     2.2187242 36522.0294485 36659.4728183 

Negative Binomial Regression Model 4

Similar to Poisson Model 2, the predictors for the following model are: VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS


Call:
glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide + 
    TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + 
    STARS, data = cleaned_train, init.theta = 40652.06888, link = log)

Coefficients:
                      Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)         0.47968629  0.04487625  10.689 < 0.0000000000000002 ***
VolatileAcidity    -0.03722498  0.00939627  -3.962            0.0000744 ***
FreeSulfurDioxide   0.00005225  0.00004695   1.113              0.26578    
TotalSulfurDioxide  0.00007202  0.00003124   2.306              0.02112 *  
Alcohol             0.00383886  0.00137502   2.792              0.00524 ** 
LabelAppeal         0.15921448  0.00612586  25.991 < 0.0000000000000002 ***
AcidIndex          -0.08029836  0.00449927 -17.847 < 0.0000000000000002 ***
STARS1              0.77095784  0.01952990  39.476 < 0.0000000000000002 ***
STARS2              1.09248107  0.01820351  60.015 < 0.0000000000000002 ***
STARS3              1.21301989  0.01917442  63.262 < 0.0000000000000002 ***
STARS4              1.32845978  0.02428370  54.706 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(40652.07) family taken to be 1)

    Null deviance: 22860  on 12794  degrees of freedom
Residual deviance: 13685  on 12784  degrees of freedom
AIC: 45652

Number of Fisher Scoring iterations: 1

              Theta:  40652 
          Std. Err.:  34179 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -45628.3 
        RMSE     Rsquared          MAE          aic          bic 
    2.587590     0.521018     2.219023 45652.300273 45741.781990 

Multiple Linear Regression Model 5

The predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS


Call:
lm(formula = TARGET ~ ., data = trainingData)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7840 -0.8650  0.0312  0.8399  6.2291 

Coefficients:
                      Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)         2.75422419  0.49566783   5.557         0.0000000282 ***
FixedAcidity       -0.00027954  0.00263634  -0.106              0.91556    
VolatileAcidity    -0.10895369  0.02352111  -4.632         0.0000036633 ***
CitricAcid          0.02956979  0.02137977   1.383              0.16667    
ResidualSugar      -0.00022937  0.00051916  -0.442              0.65864    
Chlorides          -0.07282474  0.05520024  -1.319              0.18710    
FreeSulfurDioxide   0.00013280  0.00011998   1.107              0.26838    
TotalSulfurDioxide  0.00021374  0.00007931   2.695              0.00705 ** 
Density            -0.81604691  0.48503467  -1.682              0.09251 .  
pH                 -0.02156609  0.01915625  -1.126              0.26028    
Sulphates          -0.03754085  0.01986690  -1.890              0.05884 .  
Alcohol             0.01139949  0.00350881   3.249              0.00116 ** 
LabelAppeal         0.45963839  0.01536330  29.918 < 0.0000000000000002 ***
AcidIndex          -0.19857363  0.01020326 -19.462 < 0.0000000000000002 ***
STARS1              1.39777083  0.03686492  37.916 < 0.0000000000000002 ***
STARS2              2.41504952  0.03577240  67.512 < 0.0000000000000002 ***
STARS3              3.01790496  0.04153312  72.663 < 0.0000000000000002 ***
STARS4              3.68540530  0.06599243  55.846 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.308 on 10220 degrees of freedom
Multiple R-squared:  0.5403,    Adjusted R-squared:  0.5396 
F-statistic: 706.7 on 17 and 10220 DF,  p-value: < 0.00000000000000022
         RMSE      Rsquared           MAE           aic           bic 
    1.3073277     0.5403412     1.0267270 34579.4481653 34716.8915351 

Multiple Linear Regression Model 6

For the final Linear Model, we leverage stepAIC on our Linear Model 5 to choose the most important features.


Call:
lm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
    Density + Sulphates + Alcohol + LabelAppeal + AcidIndex + 
    STARS, data = trainingData)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8173 -0.8659  0.0271  0.8387  6.2224 

Coefficients:
                      Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)         2.70580477  0.49038315   5.518         0.0000000352 ***
VolatileAcidity    -0.10935501  0.02351808  -4.650         0.0000033637 ***
TotalSulfurDioxide  0.00021735  0.00007927   2.742              0.00612 ** 
Density            -0.83317901  0.48490585  -1.718              0.08579 .  
Sulphates          -0.03791323  0.01985604  -1.909              0.05624 .  
Alcohol             0.01145759  0.00350852   3.266              0.00110 ** 
LabelAppeal         0.46023187  0.01535940  29.964 < 0.0000000000000002 ***
AcidIndex          -0.19810353  0.01002861 -19.754 < 0.0000000000000002 ***
STARS1              1.39966651  0.03685559  37.977 < 0.0000000000000002 ***
STARS2              2.41704093  0.03574989  67.610 < 0.0000000000000002 ***
STARS3              3.02002624  0.04151441  72.746 < 0.0000000000000002 ***
STARS4              3.68598701  0.06599099  55.856 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.309 on 10226 degrees of freedom
Multiple R-squared:  0.5401,    Adjusted R-squared:  0.5396 
F-statistic:  1092 on 11 and 10226 DF,  p-value: < 0.00000000000000022
         RMSE      Rsquared           MAE           aic           bic 
    1.3077413     0.5400503     1.0268622 34573.9257117 34667.9659121 

Select Models

Model performance is evaluated using metrics such as RMSE, R-squared, MAE, AIC, and BIC. Among all models, the multiple linear regression models demonstrate superior predictive performance.

In particular, the stepwise-selected model (Model 6) achieves the best balance across evaluation metrics, with lower AIC and BIC values indicating improved model efficiency.

RMSE Rsquared MAE aic bic
Poission_Evaluate1 2.589454 0.5202773 2.218724 36519.69 36649.90
Poission_Evaluate2 2.589602 0.5196981 2.218930 36512.16 36584.50
Negative_Binomial_eval3 2.589454 0.5202774 2.218724 36522.03 36659.47
Negative_Binomial_eval4 2.587590 0.5210180 2.219023 45652.30 45741.78
Linear_Regression_eval5 1.307328 0.5403412 1.026727 34579.45 34716.89
Linear_Regression_eval6 1.307741 0.5400503 1.026862 34573.93 34667.97

This table summarizes the RMSE, RSQUARED, MAE, AIC and BIC for all SIX models. The Linear regressions (Linear Model 5 and Linear Model 6) had the overall best performance based on RMSE and RSQUARED; as well as Linear Model 6 had the best performance based on AIC and BIC.

Finally, we chose Multiple Linear Regression Model 6 as our final model since it had a far lower AIC and BIC.

Prediction

The selected model is applied to the evaluation dataset to generate predictions. The resulting distribution of predicted values closely resembles that of the training data, suggesting that the model generalizes well.

Code
eval_data <- cleaned_evaluate %>% select(-TARGET)
predictions <- predict(lm6, eval_data)

eval_data$TARGET <- predictions

hist(predictions)

Code
write.csv(eval_data, 'DATA621_HW5_Predictions.csv', row.names=FALSE)

head(eval_data)
  FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
1          5.4           0.860       0.27          10.7     0.092
2         12.4           0.385       0.76          19.7     1.169
3          7.2           1.750       0.17          33.0     0.065
4          6.2           0.100       1.80           1.0     0.179
5         11.4           0.210       0.28           1.2     0.038
6         17.6           0.040       1.15           1.4     0.535
  FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
1                23                398 0.98527 5.02      0.64   12.30
2                37                 68 0.99048 3.37      1.09   16.00
3                 9                 76 1.04641 4.61      0.68    8.55
4               104                 89 0.98877 3.20      2.11   12.30
5                70                 53 1.02899 2.54      0.07    4.80
6               250                140 0.95028 3.06      0.02   11.40
  LabelAppeal AcidIndex STARS   TARGET
1           1         6     0 1.265634
2           2         6     2 4.144115
3           2         8     1 2.466589
4           1         8     1 2.226393
5           2        10     0 0.828798
6           3         8     4 5.551819

The histogram shows that our predictions have a similar shape to our training Target variable, the means and medians are almost identical, and the kurtosis values are close.

Conclusions

The final selected model, a multiple linear regression with stepwise feature selection, achieves an adjusted R-squared of approximately 0.53. Key predictors such as STARS, Alcohol, and LabelAppeal show positive relationships with sales, while AcidIndex is negatively associated.

Overall, the model aligns well with insights obtained during exploratory analysis and provides a reasonable balance between interpretability and predictive performance.