ECNM HW 3

Author

Bryan Calderon

Preparation

Clear Data

          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  582249 31.1    1328857   71         NA   669422 35.8
Vcells 1088659  8.4    8388608   64      16384  1851968 14.2

Loading Packages

Bringing in Data

Introduction

The wine industry is a dynamic and competitive market influenced by factors such as grape harvest quality, consumer preferences, and global trade trends. From boutique wineries to large scale producers, understanding and predicting sales is critical for managing production, inventory, and distribution.

Seasonal demand, marketing campaigns, and pricing strategies all play significant roles in driving wine sales. By leveraging data-driven models to predict the number of wine cases sold, producers and distributors can optimize operations, reduce waste, and adapt to changing market conditions, ensuring both efficiency and profitability in a rapidly evolving industry.

Variables

  • Free Sulfur dioxide: Helps in the preservation of wine.

  • Sulfates: Chemical compounds that are present in wine as a natural byproduct of fermentation and as an added preservative

  • Total Sulfur dioxide: Combined Free Sulfur dioxide with Bound sulfur dioxide which is chemically attached to other molecules in the substance, like sugars or pigments, and is not readily available for its preservative function. 

  • Chloride: Chloride levels in wine contribute the saltiness but are generally low. The concentration of chloride in wine can affect its taste and quality, and can also impact its market appeal.

  • Fixed Acidity: Amount of natural acids in wine that remain in the liquid when heated.

  • Citric acid: Not commonly found in wine, but it is found in small quantities in grapes, making up about 5% of the total acid content. Mainly used to increase acidity and prevent ferric hazes which can form in wine from metal compounds like iron and copper.

  • Volatile acids: A measure of the gaseous acids in wine, and is typically associated with the smell of vinegar, usually used in small amounts.

  • Acidity levels: Wines with higher acidity are more likely to improve over time and develop deeper flavors and more complex aromas but too much can cause it to taste sour.

  • The pH scale: Measures how acidic a wine is, with lower pH numbers indicating more acidity.

1. Cleaning the Data

Understanding Missing Variables

In the Summary table above, there are variables such as Residual sugar, acidindex, alcohol, and others which have missing data. Before continuing it is important to have a clean set of data. The below graph and table gives us a visual and numeric understanding of key variables which will need cleaning.

             INDEX             TARGET       FixedAcidity    VolatileAcidity 
                 0                  0                  0                  0 
        CitricAcid      ResidualSugar          Chlorides  FreeSulfurDioxide 
                 0                616                638                647 
TotalSulfurDioxide            Density                 pH          Sulphates 
               682                  0                395               1210 
           Alcohol        LabelAppeal          AcidIndex              STARS 
               653                  0                  0               3359 

Dropping variables

We will be dropping variables which are not fully necessary to the overall model and have high amounts of missing rows.

This includes FixedAcidity, VolatileAcidity, Chlorides, FreeSulfurDioxide, and Sulphates. Some of these are already covered through Acidity Index and Total sulfur Dioxide.

# List of columns to drop
columns_to_drop <- c("FixedAcidity", "VolatileAcidity", "Chlorides", 
                     "FreeSulfurDioxide", "Sulphates")

# Drop the columns from the dataset
dt <- dt[, !(names(dt) %in% columns_to_drop)]

Cleaning the data

Instead of simply replacing NAs with a median or average of the respected variable, we implement machine learning to predict the missing values by considering the relationships between all variables in the dataset.

  1. Negative numbers to postives: We’ll convert negative values to posotive values using absolute values. Many of the wine ingredients would not have a negative quantity and so turning them positive would make most sense.
  2. Blanks for Ingredients: We will impute these using missforest, which implements machine learning to predict the missing values by considering the relationships between all variables in the dataset.
  3. Blanks for Star ratings: Will be replaced with 0s.
  4. Zeros for ingredients: We will leave the ingrediant quantities as zeros. Having a zero quantity could be a meaningful amount of that ingredient in the wine.
# numeric
dt$STARS <- as.numeric(dt$STARS)

#Impute missing values in STARS with 0
dt$STARS[is.na(dt$STARS)] <- 0
dt$STARS <- as.factor(dt$STARS)

# Convert STARS to a factor, ensuring 1 is the baseline
dt$STARS <- factor(dt$STARS, levels = c(1, 2, 3, 4, 0))  # 1 as baseline, 0 last

There are no longer any missing values

# Summary of missing values
colSums(is.na(dt2) | dt2 == "")
             INDEX             TARGET         CitricAcid      ResidualSugar 
                 0                  0                  0                  0 
TotalSulfurDioxide            Density                 pH            Alcohol 
                 0                  0                  0                  0 
       LabelAppeal          AcidIndex              STARS 
                 0                  0                  0 

2. Data Exploration

Summary table with changes

Below is summary of the data to understand key numerical statistics such median, mean, max, kurtosis, and standard deviation.

Summary Table of Numeric Variables
Mean Median Minimum Maximum Kurtosis Skew SD NA Count
INDEX 8,070 8,110 1 16,129 -1.20 0.00 4,656.91 0
TARGET 3 3 0 8 -0.88 -0.33 1.93 0
CitricAcid 1 0 0 4 2.95 1.64 0.61 0
ResidualSugar 23 14 0 141 2.46 1.49 24.36 0
TotalSulfurDioxide 205 160 0 1,057 3.33 1.64 158.89 0
Density 1 1 1 1 1.90 -0.02 0.03 0
pH 3 3 0 6 1.78 0.04 0.67 0
Alcohol 11 10 0 26 1.25 0.19 3.54 0
AcidIndex 8 8 4 17 5.19 1.65 1.32 0

Graphs

The following graphs help us better understand outliers and distribution patterns, enabling more effective data analysis.

Distribution of Key Variables

Based on the distribution graphs below, variables such as Volatile acidity, Residual sugar, and chlorides seem to be potential variables which might benefit from a transformation as they currently show skewness.

Findings across wines:

  • Majority of acidity, PH, density, and alcohol levels are moderate

  • Most wines have lower to moderate sugar levels.

  • People who buy cases of wine generally buy 2-6 cases, 4 being the most common.

Which variables drive cases sold?

Below we have a list of the average amount of each variable across the number of cases purchased in order to observe the relationship between the ingredients used in wine and how it might translate to increased likability by customers.

There are outliers, driven by the low number of observations in cases sold (shown in the density graph above), mainly for 8 cases sold. This would indicate the overall interpretation should be focused on the areas with higher data points (1-7 cases sold).

Below are some interesting findings:

  1. A higher star rating translates well into a higher quantity of cases purchased.
  2. Customers relatively enjoy higher alcoholic wines.
  3. Label appeal intrigues customers but will not necessarily always translate to sales.
  4. Customers enjoy less acidic wines
  5. Sweeter wines generally sell better but there can be exceptions (residual sugar).

Trend deep dive

Star Rating

  • Higher star ratings drive a larger number of cases purchased

  • 3-4 star wines primarily sell 4-6 cases, but 2-3 star cases sell the most in quantity.

  • 8 cases seem to be a rarity even at 4 star wines.

Label Appeal

  • Doesn’t necessarily drive number of cases purchased as you increase label appeal

  • Labels with a 3 rating have a variety of amounts sold, it has the highest proportion of 6-8 cases sold but doesnt translate into higher volume.

Acidity Index

  • Generally lower acidity levels are preferred and sell in higher quantities.

Correlation Analysis

Star rating is the highest correlated indicator of the # of cases sold, generally the higher the star rating, the higher the chance of a larger amount of cases sold. Acidity level is another variable which shows a meaningful level of correlation, where generally lower acidity levels drive sales.

All other variables are not particularity correlated with # of cases sold.

3. Model Development

Training Data Rows: 10236 
Testing Data Rows: 2559 

Multi Linear Regression Models

Model 1

The first logistic regression model leverages all available variables from the dataset to predict the number of cases purchased. This model includes a range of predictors such as acidity levels, sweetness, label appeal, star ratings, etc. Incorporating them will allow the model to capture as much information as possible on which variables most contribute to maximizing the number of cases purchased. Model 1 will use STARS and Label Appeal as factors in order to keep 0 star ratings and have all ratings be compared to 1 (the baseline).

Model 1 will serve as the baseline for the model development.

#tr_dt$STARS <- as.numeric(tr_dt$STARS)
# Having NAs become 1
tr_dt$STARS[is.na(tr_dt$STARS)] <- 0
tr_dt$STARS <- as.factor(tr_dt$STARS)
# Convert STARS to a factor, ensuring 1 is the baseline
tr_dt$STARS <- factor(tr_dt$STARS, levels = c(1, 2, 3, 4, 0))  # 1 as baseline, 0 last

tr_dt$LabelAppeal <- as.factor(tr_dt$LabelAppeal)

Call:
lm(formula = TARGET ~ TotalSulfurDioxide + CitricAcid + Density + 
    pH + Alcohol + LabelAppeal + AcidIndex + STARS, data = tr_dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9843 -0.9611  0.0707  0.9296  6.6723 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         4.863e+00  5.151e-01   9.442   <2e-16 ***
TotalSulfurDioxide  2.054e-04  8.516e-05   2.411   0.0159 *  
CitricAcid          4.986e-02  2.223e-02   2.243   0.0249 *  
Density            -9.099e-01  5.070e-01  -1.795   0.0727 .  
pH                 -3.665e-02  2.035e-02  -1.801   0.0718 .  
Alcohol             7.115e-03  3.840e-03   1.853   0.0639 .  
LabelAppeal2       -1.798e-02  2.820e-02  -0.638   0.5236    
LabelAppeal3        8.196e-02  5.230e-02   1.567   0.1172    
AcidIndex          -1.822e-01  1.030e-02 -17.690   <2e-16 ***
STARS2              1.192e+00  3.781e-02  31.517   <2e-16 ***
STARS3              1.931e+00  4.273e-02  45.200   <2e-16 ***
STARS4              2.784e+00  6.748e-02  41.263   <2e-16 ***
STARS0             -1.324e+00  3.841e-02 -34.458   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.365 on 10223 degrees of freedom
Multiple R-squared:  0.4988,    Adjusted R-squared:  0.4982 
F-statistic: 847.8 on 12 and 10223 DF,  p-value: < 2.2e-16

Model 1 Results

  • The Adjusted R-squared is 0.4982, indicating that the model explains about 49.8% of the variance in the number of cases of wine purchased, suggesting that the predictors have moderate explanatory power for this outcome. The F statistic of 847.8 with a p-value of 2.2e-16 indicate that that the predictors collectively have a statistically significant relationship with the dependent variable.

  • Star rating has the greatest magnitude between the predictors, this was shown in trends we observed in the graphs above. Wines rated with 4 stars have the highest predicted increase in purchases (2.784 cases) compared to wines with 1 star (the reference category). The missing values which were assigned a 0 star rating overall have a negative magnitutde and would have a negative impact on cases purchased. AcidIndex also had a high magnitude where a 1-unit increase in acidIndex is associated with a decrease in the number of cases purchased by 0.138 cases.

  • Both star rating and acidic index have strong levels of significance, alcohol similarly has a strong signficane level but with a lower magnitude (partly related to what it is measured by). Other variables such as density, PH, and sulfur dioxide carry statistically insignificant impacts.

  • Directionally, the predictors all align with what is expected.

Checking MultiColinarity

Multicollinearity which helps us understand if 2 or more predictor (independent) variables are correlated to each other, looks strong, all centered around 1.

                       GVIF Df GVIF^(1/(2*Df))
TotalSulfurDioxide 1.005509  1        1.002751
CitricAcid         1.002354  1        1.001176
Density            1.002148  1        1.001073
pH                 1.005469  1        1.002731
Alcohol            1.008192  1        1.004088
LabelAppeal        1.008454  2        1.002107
AcidIndex          1.044207  1        1.021864
STARS              1.048251  4        1.005908

Model 2

Model 2 will similarily use all the variables but will implement logs on variables which show skewness. Instead of treating Star rating and label appeal as factors, they will be converted into numeric. As a result, all star ratings with 0 (N/A) will be taken out in order to not confuse the model that 0 is indeed equal group.


Call:
lm(formula = TARGET ~ log_CitricAcid + Alcohol + LabelAppeal + 
    log_AcidIndex + log_TotalSulfurDioxide + log_ResidualSugar + 
    STARS, data = tr_dt2_filtered)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8029 -0.6898  0.2466  0.7705  3.9526 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             3.970601   0.281052  14.128  < 2e-16 ***
log_CitricAcid          0.128166   0.047919   2.675 0.007497 ** 
Alcohol                 0.013892   0.004157   3.342 0.000837 ***
LabelAppeal             0.036643   0.023748   1.543 0.122879    
log_AcidIndex          -1.238379   0.114116 -10.852  < 2e-16 ***
log_TotalSulfurDioxide  0.028793   0.017666   1.630 0.103175    
log_ResidualSugar       0.010172   0.012900   0.789 0.430388    
STARS                   0.946104   0.016325  57.956  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.276 on 7540 degrees of freedom
Multiple R-squared:  0.328, Adjusted R-squared:  0.3274 
F-statistic: 525.8 on 7 and 7540 DF,  p-value: < 2.2e-16

Model 2 Results

  • The Adjusted R-squared is 0.3274, indicating that the model explains about 32.7% of the variance in the number of cases of wine purchased, suggesting that the predictors have moderate explanatory power for this outcome. The F statistic of 525.8 with a p-value of 2.2e-16 indicate that that the predictors collectively have a statistically significant relationship with the dependent variable.

  • Compared to model 1, Log Acid Index shows the highest magnitude of -1.24, indicating a 1 unit increase in the log Acid Index results in a decrease in wine cases purchased by 1.24 units. Similarily, for every increase in Star rating, wine cases sold increases by 0.95. Other variables such as label appeal remain with very insignificant magnitudes and statistical significance. This would imply customers are more focused on the quality of the wine rather than the wine label appearance.

  • While alcohol has a lower magntitude, it along with log acid index, Star rating, and log citric acid all have strong statistical signficance levels.

  • Directionally, the predictors all align with what is expected.

Checking MultiColinarity

Multicollinearity which helps us understand if 2 or more predictor (independent) variables are correlated to each other, looks strong, all centered around 1.

                       vif_linear_model
log_CitricAcid                 1.003618
Alcohol                        1.009864
LabelAppeal                    1.000672
log_AcidIndex                  1.015602
log_TotalSulfurDioxide         1.006880
log_ResidualSugar              1.002388
STARS                          1.011588

Poisson Regression Models

Model 1

The first poisson regression model will leverage the findings from the linear model and focus on the most signficant and impactful variables, these include citric acid, alcohol, acid index, and stars.


Call:
glm(formula = TARGET ~ CitricAcid + Alcohol + AcidIndex + STARS, 
    family = poisson, data = tr_dt2_filtered)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.044411   0.048428  21.566  < 2e-16 ***
CitricAcid   0.014349   0.009854   1.456   0.1453    
Alcohol      0.003431   0.001698   2.021   0.0433 *  
AcidIndex   -0.041518   0.005310  -7.819 5.32e-15 ***
STARS        0.246004   0.006419  38.326  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 6899.5  on 7547  degrees of freedom
Residual deviance: 5314.1  on 7543  degrees of freedom
AIC: 27789

Number of Fisher Scoring iterations: 5

Model 1 Results

The model shows an AIC is 27789. Magnitude for the variables are a bit lower compared to the linear model. A one-unit increase in Stars increases the log count of the cases sold by 0.2460. Acid index remains having one of the largest impacts after star rating. The coefficients all look directionally in line but only Acid index and Stars are significant, alcohol having partial significance as well but low magnitude. Based linear model 2, the data set is filtered to not include missing stars given model performance has imporved with this change.

Checking for Overdispesion

Mean of TARGET: 3.684817 
Variance of TARGET: 2.419926 
No overdispersion detected.

Checking MultiColinarity

Multicollinearity which helps us understand if 2 or more predictor (independent) variables are correlated to each other, looks strong, all centered around 1.

           vif_poisson_model
CitricAcid          1.000809
Alcohol             1.009386
AcidIndex           1.009406
STARS               1.011247

Model 2

The Poisson model 2 will have all the variables similar available, this will give us an indication on how impactful the zero inflated model functions with these type of variables. Similar to past models, this data set is filtered to not include missing stars given model performance has imporved with this change.


Call:
zeroinfl(formula = TARGET ~ TotalSulfurDioxide + CitricAcid + Density + 
    pH + Alcohol + LabelAppeal + AcidIndex + STARS | TotalSulfurDioxide + 
    CitricAcid + Density + ResidualSugar, data = tr_dt2_filtered, dist = "poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-1.9443 -0.3305  0.1133  0.4335  2.4248 

Count model coefficients (poisson with log link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.281e+00  2.366e-01   5.415 6.14e-08 ***
TotalSulfurDioxide -6.146e-05  3.817e-05  -1.610   0.1073    
CitricAcid          9.240e-03  1.007e-02   0.918   0.3586    
Density            -2.353e-01  2.317e-01  -1.015   0.3099    
pH                 -2.846e-03  9.177e-03  -0.310   0.7564    
Alcohol             3.781e-03  1.714e-03   2.205   0.0274 *  
LabelAppeal         9.194e-03  9.780e-03   0.940   0.3472    
AcidIndex          -3.500e-02  5.495e-03  -6.370 1.89e-10 ***
STARS               2.312e-01  6.744e-03  34.278  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -1.706050   5.423403  -0.315   0.7531    
TotalSulfurDioxide -0.026020   0.004137  -6.289  3.2e-10 ***
CitricAcid         -0.466060   0.338411  -1.377   0.1685    
Density             0.787764   5.431048   0.145   0.8847    
ResidualSugar      -0.013240   0.007246  -1.827   0.0677 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 35 
Log-likelihood: -1.384e+04 on 14 Df

Model 2 Results

In order to understand the model performance better, we can convert the log-likelihood to AIC

AIC = 2K − 2(LogLikelihood)

AIC = 2 x 14 − 2(-13,840) = 27,708

When comparing it with the AIC from Model 1 of 27789, we can analyze the zero inflated Poisson has performed well.

Level of signficance has remained the same, Total Sulfur Dioxide in the zero inflated model is statistically significant. Magnitude has shifted across the board but Star rating and Acid index remain the most significant variables with the highest magnitudes. For every unit increase in star rating, the expected number of wine cases sold increases by approximately 26% (\(e^{0.2311}- 1= ~.26\)).

Total sulfur dioxide on the other hand for every unit increase, the expected number of wine cases sold decreases by approximately 0.06% (\(e^{0.0006235}- 1= ~.0006\)). It is interesting as in model 1 this variable had a postive impact on cases sold while here it is negative, minimally, As an interpreation for the zero inflated model, the negatve coefficient indicates that higher total sulfur dioxide reduces the likelihood of cases of where cases sold are 0.

Negative Binomial regression Models

Model 1

Model 1 uses all the majority of the original variables, excluding sugar as it has historically performed poorly and contains transformations for acid index (a strong performing variable) and total sulfur dioxide. Similar to past models, this data set is filtered to not include missing stars given model performance has imporved with this change.

 Family: nbinom2  ( log )
Formula:          TARGET ~ CitricAcid + Alcohol + AcidIndex + STARS
Data: tr_dt2_filtered

     AIC      BIC   logLik deviance df.resid 
      NA       NA       NA       NA     7542 


Dispersion parameter for nbinom2 family (): 4.22e+07 

Conditional model:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.044411   0.048427   21.57  < 2e-16 ***
CitricAcid   0.014349   0.009854    1.46   0.1453    
Alcohol      0.003431   0.001697    2.02   0.0433 *  
AcidIndex   -0.041518   0.005310   -7.82 5.32e-15 ***
STARS        0.246004   0.006419   38.33  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 1 Results

The AIC for model 2 is 27,791 which is below the zero inflated poisson model of 27,708.

Model 1 contains similar trends to past models, leading stars and acide indix to be the most influencial variables in determining cases of win sold. In this model, a one unit increase in Stars is associated with a 27.85% increase in the expected count of the wine cases sold, (\(e^{0.246}- 1= ~0.2785\)). The significance in Stars is high along with Acid Index, similarily, alcohol holding signficance in the model as well but a much lower magnitude. The direction of the variables allign to prior models and are as expected.

Model 2

Model 2 will leverage the variables from model 1 and apply a log to Acid Index and Citric acid in the attemp to improve the model. Similar to past models, this data set is filtered to not include missing stars given model performance has imporved with this chance.

 Family: nbinom2  ( log )
Formula:          TARGET ~ log_CitricAcid + Alcohol + log_AcidIndex + STARS
Data: tr_dt2_filtered

     AIC      BIC   logLik deviance df.resid 
      NA       NA       NA       NA     7542 


Dispersion parameter for nbinom2 family (): 2.99e+07 

Conditional model:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     1.492141   0.107084   13.93  < 2e-16 ***
log_CitricAcid  0.035357   0.019428    1.82   0.0688 .  
Alcohol         0.003431   0.001698    2.02   0.0433 *  
log_AcidIndex  -0.359546   0.047886   -7.51 5.99e-14 ***
STARS           0.246281   0.006417   38.38  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

% Error: Unrecognized object type.

Model 2 Results

The AIC for model 2 is 27,796 which is in line with model 1.

The interpretation of model 2 is alligned with model 1 in signficance, direction, and overall magntitude. A one unit increase in Stars is associated with a 27.85% increase in the expected count of the wine cases sold, (\(e^{0.246}- 1= ~0.2785\)). What did shift slightly were were the magntitudes of the both the logged variables (acid index and citric acid).

4 Choosing the best Model

Comparing the results

Excluding the linear models, the table compares four statistical models Poisson, Zero Inflated Poisson (ZIP), and two Negative Binomial (NB) models based on performance metrics: Akaike Information Criterion (AIC), Log-Likelihood, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE).

The Zero-Inflated Poisson model demonstrates the best performance across all metrics, with the lowest AIC (27,708) and the highest Log-Likelihood (-13,840), indicating a the best balance between model fit and complexity. Additionally, the ZIP model achieves the lowest RMSE (1.289415) and MAE (1.011995), reflecting more accurate predictions compared to the other models. These results highlight the importance of accounting for excess zeros in the data, as the ZIP model effectively addresses this characteristic, making it the most suitable choice for predicting the number of wine cases sold in this data set.

Aspect RMSE MAE
Weighting of Errors Penalizes large errors more (squares them). Treats all errors equally.
Sensitivity Sensitive to outliers. Less sensitive to outliers.
Use Case When large errors are critical. When all errors are equally important.
Model Comparison Metrics
Model AIC Log-Likelihood RMSE MAE
Poisson Model 27789.38 -13889.69 1.291731 1.018105
Zero Inflated Poisson 27708.00 -13839.76 1.289423 1.011987
Negative Binomial NA NA 1.291731 1.018105
Negative Binomial 2 NA NA 1.292190 1.018447

Making predictions with the Zero Inflated Poisson Model on the evalution data

Overall, when comparing the Predicted wine cases sold vs the actual in the testing data, the model performs very well on the 3-6 cases sold but seems to overestimate cases of 7-10 cases sold.

Comparison of Predicted (Evaluation) and Actual Crash Amounts
Predicted
Actual
Statistic Predicted Value Actual Value
Min 3.00 0.00
1st Quartile 4.00 2.00
Median 4.00 3.00
Mean 5.32 3.04
3rd Quartile 8.00 4.00
Max 10.00 8.00