Introduction

Data is the key to making informed decisions and achieving success in modern business. We’ll begin our analysis by examining the dataset for outliers, missing data, potential encoding errors, multicollinearity etc., then, we’ll implement any required data cleaning procedures. Once we’ve prepared a reliable dataset, we’ll construct and assess three distinct linear models to forecast sales. Our dataset comprises both training and evaluation data; we’ll train the models using the primary training dataset and then assess their performance against the separate evaluation dataset. Finally, we’ll choose a best model that strikes the optimal balance between accuracy and simplicity.

Data Exploration

## tibble [12,795 × 15] (S3: tbl_df/tbl/data.frame)
##  $ TARGET            : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...
## tibble [3,335 × 15] (S3: tbl_df/tbl/data.frame)
##  $ TARGET            : logi [1:3335] NA NA NA NA NA NA ...
##  $ FixedAcidity      : num [1:3335] 5.4 12.4 7.2 6.2 11.4 17.6 15.5 15.9 11.6 3.8 ...
##  $ VolatileAcidity   : num [1:3335] -0.86 0.385 1.75 0.1 0.21 0.04 0.53 1.19 0.32 0.22 ...
##  $ CitricAcid        : num [1:3335] 0.27 -0.76 0.17 1.8 0.28 -1.15 -0.53 1.14 0.55 0.31 ...
##  $ ResidualSugar     : num [1:3335] -10.7 -19.7 -33 1 1.2 1.4 4.6 31.9 -50.9 -7.7 ...
##  $ Chlorides         : num [1:3335] 0.092 1.169 0.065 -0.179 0.038 ...
##  $ FreeSulfurDioxide : num [1:3335] 23 -37 9 104 70 -250 10 115 35 40 ...
##  $ TotalSulfurDioxide: num [1:3335] 398 68 76 89 53 140 17 381 83 129 ...
##  $ Density           : num [1:3335] 0.985 0.99 1.046 0.989 1.029 ...
##  $ pH                : num [1:3335] 5.02 3.37 4.61 3.2 2.54 3.06 3.07 2.99 3.32 4.72 ...
##  $ Sulphates         : num [1:3335] 0.64 1.09 0.68 2.11 -0.07 -0.02 0.75 0.31 2.18 -0.64 ...
##  $ Alcohol           : num [1:3335] 12.3 16 8.55 12.3 4.8 11.4 8.5 11.4 -0.5 10.9 ...
##  $ LabelAppeal       : int [1:3335] -1 0 0 -1 0 1 0 1 0 0 ...
##  $ AcidIndex         : int [1:3335] 6 6 8 8 10 8 12 7 12 7 ...
##  $ STARS             : int [1:3335] NA 2 1 1 NA 4 3 NA NA NA ...

The training data set has 12,795 rows and 15 columns: 14 features and 1 response variable, TARGET. The variable INDEX is used for observed identification. Twelve of the features describe the chemical properties of wine. The remaining two predictors are rating variables: LabelAppeal refers to the perceived attractiveness of a wine’s product label, while STARS is an assessment of wine quality. The output variable, TARGET, is a count measure indicating the number of wine case purchases by distributors.

Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3 1.9 0 2 4 8
FixedAcidity 12795 7.1 6.3 -18 5.2 9.5 34
VolatileAcidity 12795 0.32 0.78 -2.8 0.13 0.64 3.7
CitricAcid 12795 0.31 0.86 -3.2 0.03 0.58 3.9
ResidualSugar 12179 5.4 34 -128 -2 16 141
Chlorides 12157 0.055 0.32 -1.2 -0.031 0.15 1.4
FreeSulfurDioxide 12148 31 149 -555 0 70 623
TotalSulfurDioxide 12113 121 232 -823 27 208 1057
Density 12795 0.99 0.027 0.89 0.99 1 1.1
pH 12400 3.2 0.68 0.48 3 3.5 6.1
Sulphates 11585 0.53 0.93 -3.1 0.28 0.86 4.2
Alcohol 12142 10 3.7 -4.7 9 12 26
LabelAppeal 12795 -0.0091 0.89 -2 -1 1 2
AcidIndex 12795 7.8 1.3 4 7 8 17
STARS 9436 2 0.9 1 1 3 4

Predictor Variables

Of the 14 feature columns, 8 of them such as ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates, Alcohol, and STARS variables contain several missing values. LabelAppeal, AcidIndex, and STARS are discrete variables (i.e categorical) and the rest are continuous. We also noticed that several numerical features representing chemical quantities in the wine exhibit negative minimum values. We hypothesize that the original chemical measurements might have been normalized (potentially via a log transform), allowing for negative values. However, from a physical standpoint, negative concentrations shouldn’t be possible. Despite this, we’ve opted to retain these values as they are.

Response Variables

We see that the response variable, TARGET value is always between 0 and 8, which makes sense as this is the “Number of Cases of Wine Sold”. In addition, the distribution of wine cases sold, given at least one sale, exhibits symmetry and approximates normality. The target variable, number of cases, is shown below. The data shows a large number of zero values.

Variables Distributions

We observe that continuous variables exhibit a somewhat normal steep distribution. However, variables such as AcidIndex and STARS display right skewness.

In the box plot, there are not many outliers in the variables. However, TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar variables have large ranges compared to other variables.We can tell a high number of variables have numerous outliers.

Relationship Between Categorical and Response Variables

The bar charts compare the three discrete categorical variables to the TARGET variable. AcidIndex shows large quantity of wine were sold with the index number 7 and 8. LabelAppeal shows us generic label does yield higher number of wine samples per order. Lastly, STARS shows high star wine bottles have high price tags. For each of these predictors, there appears to be a significant relationship between the ordered levels and the number of wine cases sold.

Multicolinearity

x
TARGET 1.0000000
FixedAcidity -0.0125381
VolatileAcidity -0.0759979
CitricAcid 0.0023450
ResidualSugar 0.0035196
Chlorides -0.0304301
FreeSulfurDioxide 0.0226398
TotalSulfurDioxide 0.0216021
Density -0.0475989
pH 0.0002199
Sulphates -0.0212204
Alcohol 0.0737771
LabelAppeal 0.4979465
AcidIndex -0.1676431
STARS 0.5546857

In the correlation table, we can see that STARS and LabelAppeal are most positively correlated variables with the response variable. Also, we some mild negative correlation between the response variable and AcidIndex variable.

Data Preparation

Negative Values

The data has some wine quality measures that are negative that should not be. We will simply take the absolute value of these for now. The alternative would be to center by adding the min of each variable. Since we are given little information about the source of this dataset, and why these quality measures are so off, it is difficult to ascertain the best overall approach. For LabelAppeal, we will add the min.

Missing Data

According to the graph, the data set has multiple variables with missing variables. The STARS variable has the most NA values. The Sulphates variable records missing values in roughly 10% of observations, while the remaining six predictors have missing values ranging from 3% to 5%. These missing values will be imputed later on during the data preparation using the MICE package and random forest prediction method.

We can see that each of the remaining variables with missing values seem to be MAR, as the mice imputation distributions roughly match the existing. We’ll also run the mice imputation again on both the train and test set. Instead of using it for our models, however, we’ll simplify our run and fill in our data. Finally, after our analysis, we can use it in in our model, we’ll update STARS to become a factor variable.

Descriptive Summaries and Correlation Review

Mean Std.Dev Min Q1 Median Q3 Max MAD IQR CV Skewness SE.Skewness Kurtosis N.Valid Pct.Valid
AcidIndex 7.7727237 1.3239264 4.00000 7.00000 8.00000 8.00000 17.00000 1.4826000 1.000000 0.1703298 1.6484959 0.0216523 5.1900925 12795 100
Alcohol 10.4807704 3.7233165 -4.70000 9.00000 10.40000 12.40000 26.50000 2.3721600 3.400000 0.3552522 -0.0306208 0.0216523 1.5399928 12795 100
Chlorides 0.2227882 0.2342173 0.00000 0.04600 0.09900 0.36900 1.35100 0.1008168 0.323000 1.0513004 1.4798442 0.0216523 2.1837525 12795 100
CitricAcid 0.6863150 0.6060052 0.00000 0.28000 0.44000 0.97000 3.86000 0.3261720 0.690000 0.8829841 1.6428101 0.0216523 2.9474433 12795 100
Density 0.9942027 0.0265376 0.88809 0.98772 0.99449 1.00052 1.09924 0.0093552 0.012795 0.0266924 -0.0186938 0.0216523 1.8999592 12795 100
FixedAcidity 8.0632513 4.9961186 0.00000 5.60000 7.00000 9.80000 34.40000 2.9652000 4.200000 0.6196159 1.1742806 0.0216523 1.9666235 12795 100
FreeSulfurDioxide 106.2892927 108.0521706 0.00000 28.00000 55.00000 171.00000 623.00000 59.3040000 143.000000 1.0165857 1.5351875 0.0216523 2.4592719 12795 100
LabelAppeal 1.9909340 0.8910892 0.00000 1.00000 2.00000 3.00000 4.00000 1.4826000 2.000000 0.4475735 0.0084295 0.0216523 -0.2622916 12795 100
pH 3.2067104 0.6800719 0.48000 2.95000 3.20000 3.47000 6.13000 0.3854760 0.515000 0.2120777 0.0385950 0.0216523 1.6292482 12795 100
ResidualSugar 23.2877061 24.8777541 0.00000 3.60000 12.90000 38.50000 141.15000 16.3086000 34.900000 1.0682784 1.4741280 0.0216523 2.2538118 12795 100
Sulphates 0.8460883 0.6558266 0.00000 0.43000 0.59000 1.10000 4.24000 0.3261720 0.670000 0.7751279 1.6917630 0.0216523 3.2294753 12795 100
TARGET 3.0290739 1.9263682 0.00000 2.00000 3.00000 4.00000 8.00000 1.4826000 2.000000 0.6359595 -0.3263010 0.0216523 -0.8772457 12795 100
TotalSulfurDioxide 204.9099258 163.8645075 0.00000 100.00000 154.00000 263.00000 1057.00000 102.2994000 163.000000 0.7996904 1.6100847 0.0216523 3.0153943 12795 100
VolatileAcidity 0.6410856 0.5556141 0.00000 0.25000 0.41000 0.91000 3.68000 0.3261720 0.660000 0.8666770 1.6529782 0.0216523 3.0833040 12795 100

Split the Sample data Set

With our transformations complete, we can now add these into our cleaned_train dataframe and continue on to build our models. To better measure each model performance, we split our data into a training and testing data set. We will train using the first, then measure model performance again the testing hold out set.

## [1] "Number of Training Samples:  10238"
## [1] "Number of Testing Samples:  2557"

Build Models

Poisson Regression Model 1

In this first model, we include all available features: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = trainingData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2730  -0.6622  -0.0022   0.4590   3.4596  
## 
## Coefficients:
##                       Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)         0.88401786  0.22058337   4.008            0.0000613 ***
## FixedAcidity       -0.00019300  0.00116712  -0.165             0.868655    
## VolatileAcidity    -0.03800614  0.01053001  -3.609             0.000307 ***
## CitricAcid          0.00482117  0.00935375   0.515             0.606255    
## ResidualSugar       0.00008185  0.00022848   0.358             0.720174    
## Chlorides          -0.04430243  0.02467192  -1.796             0.072548 .  
## FreeSulfurDioxide   0.00006365  0.00005342   1.192             0.233451    
## TotalSulfurDioxide  0.00008298  0.00003462   2.397             0.016543 *  
## Density            -0.38653351  0.21575673  -1.792             0.073209 .  
## pH                 -0.00883178  0.00848151  -1.041             0.297737    
## Sulphates          -0.00464684  0.00870712  -0.534             0.593561    
## Alcohol             0.00307482  0.00153833   1.999             0.045630 *  
## LabelAppeal         0.15940496  0.00687464  23.187 < 0.0000000000000002 ***
## AcidIndex          -0.07667561  0.00511881 -14.979 < 0.0000000000000002 ***
## STARS1              0.75937762  0.02186553  34.729 < 0.0000000000000002 ***
## STARS2              1.08626089  0.02032575  53.443 < 0.0000000000000002 ***
## STARS3              1.20920722  0.02137683  56.566 < 0.0000000000000002 ***
## STARS4              1.32798791  0.02736105  48.536 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18279  on 10237  degrees of freedom
## Residual deviance: 10978  on 10220  degrees of freedom
## AIC: 36572
## 
## Number of Fisher Scoring iterations: 6
##          RMSE      Rsquared           MAE           aic           bic 
##     2.5896243     0.5177046     2.2206236 36571.6286867 36701.8381949

Poisson Regression Model 2

In this Model 2, we only include the most predictive features : VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson, 
##     data = trainingData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3026  -0.6633  -0.0009   0.4594   3.4943  
## 
## Coefficients:
##                       Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)         0.46904729  0.04994358   9.392 < 0.0000000000000002 ***
## VolatileAcidity    -0.03830760  0.01052516  -3.640             0.000273 ***
## TotalSulfurDioxide  0.00008305  0.00003461   2.400             0.016413 *  
## Alcohol             0.00314054  0.00153790   2.042             0.041142 *  
## LabelAppeal         0.15938001  0.00686767  23.207 < 0.0000000000000002 ***
## AcidIndex          -0.07702508  0.00504447 -15.269 < 0.0000000000000002 ***
## STARS1              0.76093283  0.02185558  34.816 < 0.0000000000000002 ***
## STARS2              1.08797214  0.02031653  53.551 < 0.0000000000000002 ***
## STARS3              1.21122854  0.02136172  56.701 < 0.0000000000000002 ***
## STARS4              1.32905420  0.02735111  48.592 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18279  on 10237  degrees of freedom
## Residual deviance: 10988  on 10228  degrees of freedom
## AIC: 36566
## 
## Number of Fisher Scoring iterations: 6
##          RMSE      Rsquared           MAE           aic           bic 
##     2.5897983     0.5170154     2.2208191 36565.5340896 36637.8727052

Negative Binomial Regression Model 3

Similar to Poisson Model 1, the predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = trainingData, init.theta = 40727.91209, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2729  -0.6622  -0.0022   0.4590   3.4595  
## 
## Coefficients:
##                       Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)         0.88405271  0.22059333   4.008            0.0000613 ***
## FixedAcidity       -0.00019303  0.00116717  -0.165             0.868642    
## VolatileAcidity    -0.03800764  0.01053047  -3.609             0.000307 ***
## CitricAcid          0.00482133  0.00935418   0.515             0.606260    
## ResidualSugar       0.00008184  0.00022849   0.358             0.720193    
## Chlorides          -0.04430344  0.02467302  -1.796             0.072555 .  
## FreeSulfurDioxide   0.00006365  0.00005342   1.192             0.233443    
## TotalSulfurDioxide  0.00008298  0.00003462   2.397             0.016543 *  
## Density            -0.38654219  0.21576653  -1.791             0.073216 .  
## pH                 -0.00883271  0.00848189  -1.041             0.297708    
## Sulphates          -0.00464712  0.00870751  -0.534             0.593556    
## Alcohol             0.00307470  0.00153840   1.999             0.045648 *  
## LabelAppeal         0.15940399  0.00687495  23.186 < 0.0000000000000002 ***
## AcidIndex          -0.07667809  0.00511900 -14.979 < 0.0000000000000002 ***
## STARS1              0.75937649  0.02186599  34.729 < 0.0000000000000002 ***
## STARS2              1.08625971  0.02032621  53.441 < 0.0000000000000002 ***
## STARS3              1.20920687  0.02137741  56.565 < 0.0000000000000002 ***
## STARS4              1.32798842  0.02736226  48.534 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40727.91) family taken to be 1)
## 
##     Null deviance: 18278  on 10237  degrees of freedom
## Residual deviance: 10978  on 10220  degrees of freedom
## AIC: 36574
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40728 
##           Std. Err.:  38350 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -36535.96
##          RMSE      Rsquared           MAE           aic           bic 
##     2.5896242     0.5177046     2.2206234 36573.9628866 36711.4062564

Negative Binomial Regression Model 4

Similar to Poisson Model 2, the predictors for the following model are: VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, data = cleaned_train, init.theta = 40659.48234, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3090  -0.6536  -0.0031   0.4547   3.7836  
## 
## Coefficients:
##                       Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)         0.47732250  0.04491006  10.628 < 0.0000000000000002 ***
## VolatileAcidity    -0.03706654  0.00939412  -3.946            0.0000796 ***
## FreeSulfurDioxide   0.00005268  0.00004689   1.123              0.26127    
## TotalSulfurDioxide  0.00008837  0.00003104   2.847              0.00442 ** 
## Alcohol             0.00368970  0.00137441   2.685              0.00726 ** 
## LabelAppeal         0.15942451  0.00612676  26.021 < 0.0000000000000002 ***
## AcidIndex          -0.08028377  0.00449971 -17.842 < 0.0000000000000002 ***
## STARS1              0.77095559  0.01953002  39.475 < 0.0000000000000002 ***
## STARS2              1.09236374  0.01820285  60.011 < 0.0000000000000002 ***
## STARS3              1.21265049  0.01917736  63.233 < 0.0000000000000002 ***
## STARS4              1.32851855  0.02428483  54.706 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40659.48) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 13683  on 12784  degrees of freedom
## AIC: 45650
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40659 
##           Std. Err.:  34189 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -45626.25
##          RMSE      Rsquared           MAE           aic           bic 
##     2.5893189     0.5158105     2.2199834 45650.2463626 45739.7280796

Multiple Linear Regression Model 5

The predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## lm(formula = TARGET ~ ., data = trainingData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8414 -0.8737  0.0236  0.8555  5.6826 
## 
## Coefficients:
##                        Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)         3.020068489  0.500891225   6.029         0.0000000017 ***
## FixedAcidity        0.000469154  0.002635703   0.178              0.85873    
## VolatileAcidity    -0.112649856  0.023435095  -4.807         0.0000015548 ***
## CitricAcid          0.019109205  0.021413858   0.892              0.37221    
## ResidualSugar       0.000007501  0.000521091   0.014              0.98851    
## Chlorides          -0.124471650  0.055517825  -2.242              0.02498 *  
## FreeSulfurDioxide   0.000204177  0.000121857   1.676              0.09386 .  
## TotalSulfurDioxide  0.000249243  0.000079087   3.152              0.00163 ** 
## Density            -1.156597986  0.492177579  -2.350              0.01879 *  
## pH                 -0.018547127  0.019254329  -0.963              0.33543    
## Sulphates          -0.010122588  0.019789678  -0.512              0.60901    
## Alcohol             0.010995114  0.003491237   3.149              0.00164 ** 
## LabelAppeal         0.465246728  0.015346756  30.316 < 0.0000000000000002 ***
## AcidIndex          -0.191380479  0.010214692 -18.736 < 0.0000000000000002 ***
## STARS1              1.350459388  0.037032187  36.467 < 0.0000000000000002 ***
## STARS2              2.395624269  0.035846190  66.831 < 0.0000000000000002 ***
## STARS3              2.974936849  0.041397001  71.864 < 0.0000000000000002 ***
## STARS4              3.688404529  0.067342857  54.771 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.311 on 10220 degrees of freedom
## Multiple R-squared:  0.537,  Adjusted R-squared:  0.5362 
## F-statistic: 697.3 on 17 and 10220 DF,  p-value: < 0.00000000000000022
##         RMSE     Rsquared          MAE          aic          bic 
##     1.309785     0.537013     1.031076 34617.903309 34755.346679

Multiple Linear Regression Model 6

For the final Linear Model, we leverage stepAIC on our Linear Model 5 to choose the most important features.

## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, data = trainingData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8773 -0.8745  0.0206  0.8552  5.6964 
## 
## Coefficients:
##                       Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)         2.97476244  0.49647332   5.992        0.00000000215 ***
## VolatileAcidity    -0.11312314  0.02342445  -4.829        0.00000139018 ***
## Chlorides          -0.12555620  0.05549618  -2.262              0.02369 *  
## FreeSulfurDioxide   0.00020461  0.00012182   1.680              0.09307 .  
## TotalSulfurDioxide  0.00024969  0.00007906   3.158              0.00159 ** 
## Density            -1.17196189  0.49197148  -2.382              0.01723 *  
## Alcohol             0.01103529  0.00348994   3.162              0.00157 ** 
## LabelAppeal         0.46516486  0.01534023  30.323 < 0.0000000000000002 ***
## AcidIndex          -0.19022043  0.01002596 -18.973 < 0.0000000000000002 ***
## STARS1              1.35114025  0.03701656  36.501 < 0.0000000000000002 ***
## STARS2              2.39648117  0.03582750  66.889 < 0.0000000000000002 ***
## STARS3              2.97643406  0.04137313  71.941 < 0.0000000000000002 ***
## STARS4              3.68969165  0.06731839  54.810 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.311 on 10225 degrees of freedom
## Multiple R-squared:  0.5369, Adjusted R-squared:  0.5364 
## F-statistic:   988 on 12 and 10225 DF,  p-value: < 0.00000000000000022
##         RMSE     Rsquared          MAE          aic          bic 
##     1.309912     0.536923     1.030963 34609.892810 34711.166872

Select Models

RMSE Rsquared MAE aic bic
Poission_Evaluate1 2.589624 0.5177046 2.220624 36571.63 36701.84
Poission_Evaluate2 2.589798 0.5170154 2.220819 36565.53 36637.87
Negative_Binomial_eval3 2.589624 0.5177046 2.220623 36573.96 36711.41
Negative_Binomial_eval4 2.589319 0.5158105 2.219983 45650.25 45739.73
Linear_Regression_eval5 1.309785 0.5370130 1.031076 34617.90 34755.35
Linear_Regression_eval6 1.309913 0.5369230 1.030963 34609.89 34711.17

This table summarizes the RMSE, RSQUARED, MAE, AIC and BIC for all SIX models. The Linear regressions (Linear Model 5 and Linear Model 6) had the overall best performance based on RMSE and RSQUARED; as well as Linear Model 6 had the best performance based on AIC and BIC.

Finally, we chose Multiple Linear Regression Model 6 as our final model since it had a far lower AIC and BIC.

Prediction

##   FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1          5.4           0.860       0.27          10.7     0.092
## 2         12.4           0.385       0.76          19.7     1.169
## 3          7.2           1.750       0.17          33.0     0.065
## 4          6.2           0.100       1.80           1.0     0.179
## 5         11.4           0.210       0.28           1.2     0.038
## 6         17.6           0.040       1.15           1.4     0.535
##   FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
## 1                23                398 0.98527 5.02      0.64   12.30
## 2                37                 68 0.99048 3.37      1.09   16.00
## 3                 9                 76 1.04641 4.61      0.68    8.55
## 4               104                 89 0.98877 3.20      2.11   12.30
## 5                70                 53 1.02899 2.54      0.07    4.80
## 6               250                140 0.95028 3.06      0.02   11.40
##   LabelAppeal AcidIndex STARS    TARGET
## 1           1         6     0 1.2748842
## 2           2         6     2 4.0102323
## 3           2         8     1 2.4171591
## 4           1         8     1 2.2559522
## 5           2        10     0 0.8489494
## 6           3         8     4 5.5647071

The histogram shows that our predictions have a similar shape to our training Target variable, the means and medians are almost identical, and the kurtosis values are close.

The predicted file is uploaded to Github: https://github.com/waheeb123/Data-621/blob/main/Homeworks/Homework%205/DATA621_HW5_Predictions.csv

Conclusions

The Linear Regression model that we chose as the best model has an adjusted R2 value of 0.53. All predictors and levels within each categorical predictor are significant at the 5% level. The model coefficients make intuitive sense. Sales of wine decrease with each increase in the AcidIndex value. Having a STARS rating results in more sales compared to no ratings, and higher STARS ratings are associated with higher sales. Finally, increases in alcohol content are associated with higher sales.The interpretation of the coefficients aligns with the insights gained during exploratory data analysis.