Overview In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided).
Variable Name
Definition
Theoretical Effect
INDEX
Identification variable (do not use)
None
TARGET
Number of cases purchased
None
AcidIndex
Proprietary method of testing total acidity of wine using a weighted average
Alcohol
Alcohol content
Chlorides
Chloride content of wine
CitricAcid
Citric acid content
Density
Density of wine
FixedAcidity
Fixed acidity of wine
FreeSulfurDioxide
Sulfur dioxide content of wine
LabelAppeal
Marketing score indicating label appeal. High: customers like it. Negative: customers dislike it.
Higher label appeal suggests higher sales.
ResidualSugar
Residual sugar of wine
STARS
Wine rating by expert panel (1–4 stars)
Higher number of stars suggests higher sales.
Sulphates
Sulfate content of wine
TotalSulfurDioxide
Total sulfur dioxide content of wine
VolatileAcidity
Volatile acid content of wine
pH
pH of wine
Data Exploration
In this section, describe the structure of the wine training dataset, highlight important summary statistics, identify missing values, examine relationships with the target variable, and show key visualizations.
Data-set Overview
The training dataset contains 12,795 observations and 14 predictor variables, and the target variable (TARGET). The variables include chemical measurements such as acidity, sulfur content, pH levels, alcohol concentration, and marketing-related attributes like LabelAppeal and STARS. The evaluation set holds 3,335 observations, there are all numeric variables withLableAppeal, AcidIndex, Stars being of the int type the other 11 predictors are numeric type.
Summary Statistics
Provide mean, median, standard deviation, and distribution shape for key variables.
IN FixedAcidity VolatileAcidity CitricAcid
Min. : 3 Min. :-18.200 Min. :-2.8300 Min. :-3.1200
1st Qu.: 4018 1st Qu.: 5.200 1st Qu.: 0.0800 1st Qu.: 0.0000
Median : 7906 Median : 6.900 Median : 0.2800 Median : 0.3100
Mean : 8048 Mean : 6.864 Mean : 0.3103 Mean : 0.3124
3rd Qu.:12061 3rd Qu.: 9.000 3rd Qu.: 0.6300 3rd Qu.: 0.6050
Max. :16130 Max. : 33.500 Max. : 3.6100 Max. : 3.7600
ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
Min. :-128.300 Min. :-1.15000 Min. :-563.00 Min. :-769.00
1st Qu.: -2.600 1st Qu.: 0.01600 1st Qu.: 3.00 1st Qu.: 27.25
Median : 3.600 Median : 0.04700 Median : 30.00 Median : 124.00
Mean : 5.319 Mean : 0.06143 Mean : 34.95 Mean : 123.41
3rd Qu.: 17.200 3rd Qu.: 0.17100 3rd Qu.: 79.25 3rd Qu.: 210.00
Max. : 145.400 Max. : 1.26300 Max. : 617.00 Max. :1004.00
NA's :168 NA's :138 NA's :152 NA's :157
Density pH Sulphates Alcohol
Min. :0.8898 Min. :0.600 Min. :-3.0700 Min. :-4.20
1st Qu.:0.9883 1st Qu.:2.980 1st Qu.: 0.3300 1st Qu.: 9.00
Median :0.9946 Median :3.210 Median : 0.5000 Median :10.40
Mean :0.9947 Mean :3.237 Mean : 0.5346 Mean :10.58
3rd Qu.:1.0005 3rd Qu.:3.490 3rd Qu.: 0.8200 3rd Qu.:12.50
Max. :1.0998 Max. :6.210 Max. : 4.1800 Max. :25.60
NA's :104 NA's :310 NA's :185
LabelAppeal AcidIndex STARS
Min. :-2.00000 Min. : 5.000 Min. :1.00
1st Qu.:-1.00000 1st Qu.: 7.000 1st Qu.:1.00
Median : 0.00000 Median : 8.000 Median :2.00
Mean : 0.01349 Mean : 7.748 Mean :2.04
3rd Qu.: 1.00000 3rd Qu.: 8.000 3rd Qu.:3.00
Max. : 2.00000 Max. :17.000 Max. :4.00
NA's :841
Code
library(corrplot)
corrplot 0.95 loaded
Code
corr_p<-corrplot(cor_mat, method ="color", type ="upper")
Code
par(mfrow=c(2,3))for (v innames(wine_data)[names(wine_data) !="TARGET"]) {hist(wine_data[[v]], main=v, col="lightgrey")}
Code
hist(wine_data$TARGET)
Most variables in the dataset display fairly balanced, centered distributions, consistent with standardized chemical measures. FixedAcidity, VolatileAcidity, CitricAcid, Density, pH, Alcohol, and similar predictors do not exhibit strong skew and therefore do not require transformation.
AcidIndex is the main exception, showing a noticeably right-skewed distribution. Because of this, we will bucket AcidIndex into categories rather than apply a mathematical transform, as bucketing provides a more stable and interpretable way to handle its nonlinearity.
The STARS rating variable is discrete (1–4) with many missing values. Instead of treating missing STARS as zeros or excluding them, we will recode missing values as “Not Rated”, so wines without an expert rating are not implicitly penalized or interpreted as having low quality.
Other variables, including LabelAppeal, are well-behaved and can be used in their original numeric form. Thus, only AcidIndex and STARS require special treatment in the preparation step.
The predictor variables also show minimal correlation with one another.
The TARGET variable ranges from 0 to 8 and shows a moderately right-skewed distribution. The most common purchase counts fall between 2 and 5 cases, with a noticeable spike at 0 but no severe zero inflation. Very few wines reach the upper end of the range, and the distribution decreases steadily after 4 cases. Overall, the shape of TARGET is consistent with a typical count outcome, supporting the use of Poisson and Negative Binomial regression models.
test_acid_plt<-ggplot(wine_test, aes(x = AcidIndex_bucket)) +geom_bar(fill ="steelblue") +theme_minimal() +labs(title ="AcidIndex Buckets in Test Data",x ="AcidIndex Bucket",y ="Count")test_stars_plt<-ggplot(wine_test, aes(x = STARS)) +geom_bar(fill ="darkred") +theme_minimal() +labs(title ="STARS Ratings in Test Data",x ="STARS Category",y ="Count")test_stars_plt + test_acid_plt
Code
set.seed(123) # total number of rowsn <-nrow(wine_data)# randomly sample 75% of the data for trainingtrain_idx <-sample(seq_len(n), size =0.75* n)# create training and validation setswine_train <- wine_data[train_idx, ]wine_valid <- wine_data[-train_idx, ]cat("Training rows:", nrow(wine_train), "\n")
Training rows: 9596
Code
cat("Validation rows:", nrow(wine_valid), "\n")
Validation rows: 3199
Several chemical variables contained missing values, including ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Sulphates, and Alcohol. For each of these, a binary missingness indicator was created, as missing measurements may carry predictive value. The original numeric values were then imputed using the median, which maintains the shape of each distribution while preventing the loss of observations. All data preparation was then replicated on the test set. Finally the training data set was split 75-25 for model development.
Building Models
In this section we will explore different models to predict our target we will begin with a baseline linear model, then a multiple linear regression followed by possion and count models.
pred_zinb <-predict(zinb1, newdata = wine_valid, type ="response")# Compute MAE and RMSE for ZINBmae_zinb <-mean(abs(wine_valid$TARGET - pred_zinb))rmse_zinb <-sqrt(mean((wine_valid$TARGET - pred_zinb)^2))# Extract AIC properlyaic_zinb <-AIC(zinb1)# Add row to resultsnew_row <-data.frame(Model ="ZINB1",Type ="ZINB",R2 =NA, # not defined for ZINBAdj_R2 =NA, # not applicableMAE = mae_zinb,RMSE = rmse_zinb,AIC = aic_zinb,stringsAsFactors =FALSE)model_results <-rbind(model_results, new_row)
Selecting Model
Across all models evaluated, the Zero-Inflated Negative Binomial model (ZINB1) demonstrated the strongest overall performance with the lowest AIC (30,923.52) and the lowest MAE and RMSE among all count-based models. Although the expanded linear regression model achieved competitive error metrics, linear models violate distributional assumptions for discrete count data and cannot serve as the final deployed model. They are retained only as benchmarks for comparison.
Among the traditional count regression models, the Negative Binomial and Poisson variants performed reasonably well, but none matched the predictive accuracy or fit of the ZINB model. Both Poisson models showed higher AIC values and larger prediction errors, reflecting their inability to accommodate overdispersion. The Negative Binomial models improved on this but still failed to capture the substantial probability mass at zero: despite roughly 20% of the training data containing zero purchases, standard Poisson and NB models systematically underpredicted zeros due to the log-link structure, which forces strictly positive expected values.
Introducing a zero-inflation component addressed this deficiency. The ZINB model explicitly separates the data-generating process into (1) a structural zero mechanism and (2) a count process for wines that are actually ordered. This two-part structure yielded superior predictive accuracy and a substantially better AIC than the NB or Poisson models. It also allowed the model to produce realistic zero predictions, something the standard GLM-based count models could not accomplish.
Given its substantially better model fit, stronger predictive accuracy, and more realistic handling of zero outcomes, ZINB1 was selected as the final model for generating predictions on the evaluation dataset.
Code
model_results|>arrange((AIC))
Model Type R2 Adj_R2 MAE RMSE AIC
1 ZINB1 ZINB NA NA 1.010027 1.320891 30923.52
2 LM2_Expanded Linear 0.5419982 0.5415682 1.036195 1.322580 32343.47
3 LM1_Baseline Linear 0.5220872 0.5218380 1.075446 1.356140 32743.83
4 Poisson2 Poisson NA NA 1.029192 1.313099 34151.71
5 NegBin2 NegBin NA NA 1.029193 1.313099 34154.03
6 NegBin1 NegBin NA NA 1.024112 1.311981 34176.19
7 Poisson Base Poisson NA NA 1.044851 1.334146 34463.51
# Extract residualsres_raw <-residuals(zinb1, type ="response")res_pearson <-residuals(zinb1, type ="pearson")# Extract fitted valuesfitted_vals <-predict(zinb1, type ="response")# --- 1. Residual Distribution ---hist(res_raw,breaks =30,col ="lightgrey",border ="white",main ="ZINB Residual Distribution",xlab ="Raw Residuals")
Code
# --- 2. QQ Plot ---qqnorm(res_pearson,main ="ZINB QQ Plot (Pearson Residuals)",pch =19, col ="#3366AA60")qqline(res_pearson, col ="red", lwd =2)
Code
# --- 3. Fitted vs Actual ---plot(fitted_vals, wine_train$TARGET,pch =19, col ="#3366AA60",xlab ="Fitted Values (ZINB)",ylab ="Actual TARGET",main ="Fitted vs Actual")abline(0, 1, col ="red", lwd =2)
# 8. Plot prediction distributionpred_dis <-ggplot(predsdf, aes(x = PREDICTED_TARGET)) +geom_histogram(binwidth =1, fill ="lightgrey", color ="white") +labs(title ="Final TARGET Predictions (ZINB)",x ="Predicted Case Sales")# 9. Save results for submissionwrite.csv(predsdf, "final_predictions_ZINB.csv", row.names =FALSE)
Conclusion
Code
hist(wine_train$TARGET)
Code
hist(predsdf$PREDICTED_TARGET)
The comparison between the training TARGET distribution and the ZINB model’s predicted distribution shows that the model successfully reproduces the overall shape, center, and spread of the observed case-sales behavior. Both histograms exhibit their highest frequencies in the 2–5 case range, indicating that the model accurately captures the most common purchasing patterns seen in the training data.
The model also correctly preserves the long right tail, with non-zero predictions extending up to 7–8 cases, matching the maximum observed values in the training set. Importantly, unlike standard Poisson or Negative Binomial models both of which failed to generate zeros, the ZINB model produces a realistic number of zero predictions driven by its explicit zero-inflation component. Although the height of the zero bar is slightly lower than in the training distribution, the overall magnitude is directionally consistent and represents a substantial improvement over traditional count GLMs.
Across the remaining count values, the predicted distribution aligns closely with the empirical one, with only minor differences in relative frequencies. The slight smoothing around counts 3–5 reflects the expected behavior of a probabilistic regression model producing mean-based predictions rather than exact replications of sample frequencies.
Overall, the ZINB model provides a well-calibrated approximation of the true TARGET distribution, capturing both the central mass and the zero-inflation present in the data. This confirms that the model is appropriately tuned for deployment and capable of generating realistic predictions for the evaluation dataset.
In the count model, STARS ratings and LabelAppeal are the strongest positive predictors of case sales. Higher STARS levels significantly increase expected purchases, and LabelAppeal has a large positive effect, confirming that expert ratings and marketing appeal drive higher order volumes. Alcohol content also has a small positive effect. In contrast, wines in the High and Very HighAcidIndex buckets show significantly lower expected sales, indicating that high acidity reduces demand.
In the zero-inflation model, STARS ratings greatly reduce the likelihood of a structural zero, meaning expert-rated wines are far less likely to receive no orders at all. LabelAppeal also significantly decreases the chance of a zero outcome. Together, these results show that both expert ratings and label strength influence not only how many cases a wine sells, but also whether it is ordered in the first place.