Introduction

Data is the key to making informed decisions and achieving success in modern business. We’ll begin our analysis by examining the dataset for outliers, missing data, potential encoding errors, multicollinearity etc., then, we’ll implement any required data cleaning procedures. Once we’ve prepared a reliable dataset, we’ll construct and assess three distinct linear models to forecast sales. Our dataset comprises both training and evaluation data; we’ll train the models using the primary training dataset and then assess their performance against the separate evaluation dataset. Finally, we’ll choose a best model that strikes the optimal balance between accuracy and simplicity.

Data Exploration

## tibble [12,795 × 15] (S3: tbl_df/tbl/data.frame)
##  $ TARGET            : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...
## tibble [3,335 × 15] (S3: tbl_df/tbl/data.frame)
##  $ TARGET            : logi [1:3335] NA NA NA NA NA NA ...
##  $ FixedAcidity      : num [1:3335] 5.4 12.4 7.2 6.2 11.4 17.6 15.5 15.9 11.6 3.8 ...
##  $ VolatileAcidity   : num [1:3335] -0.86 0.385 1.75 0.1 0.21 0.04 0.53 1.19 0.32 0.22 ...
##  $ CitricAcid        : num [1:3335] 0.27 -0.76 0.17 1.8 0.28 -1.15 -0.53 1.14 0.55 0.31 ...
##  $ ResidualSugar     : num [1:3335] -10.7 -19.7 -33 1 1.2 1.4 4.6 31.9 -50.9 -7.7 ...
##  $ Chlorides         : num [1:3335] 0.092 1.169 0.065 -0.179 0.038 ...
##  $ FreeSulfurDioxide : num [1:3335] 23 -37 9 104 70 -250 10 115 35 40 ...
##  $ TotalSulfurDioxide: num [1:3335] 398 68 76 89 53 140 17 381 83 129 ...
##  $ Density           : num [1:3335] 0.985 0.99 1.046 0.989 1.029 ...
##  $ pH                : num [1:3335] 5.02 3.37 4.61 3.2 2.54 3.06 3.07 2.99 3.32 4.72 ...
##  $ Sulphates         : num [1:3335] 0.64 1.09 0.68 2.11 -0.07 -0.02 0.75 0.31 2.18 -0.64 ...
##  $ Alcohol           : num [1:3335] 12.3 16 8.55 12.3 4.8 11.4 8.5 11.4 -0.5 10.9 ...
##  $ LabelAppeal       : int [1:3335] -1 0 0 -1 0 1 0 1 0 0 ...
##  $ AcidIndex         : int [1:3335] 6 6 8 8 10 8 12 7 12 7 ...
##  $ STARS             : int [1:3335] NA 2 1 1 NA 4 3 NA NA NA ...

The training data set has 12,795 rows and 15 columns: 14 features and 1 response variable, TARGET. The variable INDEX is used for observed identification. Twelve of the features describe the chemical properties of wine. The remaining two predictors are rating variables: LabelAppeal refers to the perceived attractiveness of a wine’s product label, while STARS is an assessment of wine quality. The output variable, TARGET, is a count measure indicating the number of wine case purchases by distributors.

Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
TARGET 12795 3 1.9 0 2 4 8
FixedAcidity 12795 7.1 6.3 -18 5.2 9.5 34
VolatileAcidity 12795 0.32 0.78 -2.8 0.13 0.64 3.7
CitricAcid 12795 0.31 0.86 -3.2 0.03 0.58 3.9
ResidualSugar 12179 5.4 34 -128 -2 16 141
Chlorides 12157 0.055 0.32 -1.2 -0.031 0.15 1.4
FreeSulfurDioxide 12148 31 149 -555 0 70 623
TotalSulfurDioxide 12113 121 232 -823 27 208 1057
Density 12795 0.99 0.027 0.89 0.99 1 1.1
pH 12400 3.2 0.68 0.48 3 3.5 6.1
Sulphates 11585 0.53 0.93 -3.1 0.28 0.86 4.2
Alcohol 12142 10 3.7 -4.7 9 12 26
LabelAppeal 12795 -0.0091 0.89 -2 -1 1 2
AcidIndex 12795 7.8 1.3 4 7 8 17
STARS 9436 2 0.9 1 1 3 4

Predictor Variables

Of the 14 feature columns, 8 of them such as ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates, Alcohol, and STARS variables contain several missing values. LabelAppeal, AcidIndex, and STARS are discrete variables (i.e categorical) and the rest are continuous. We also noticed that several numerical features representing chemical quantities in the wine exhibit negative minimum values. We hypothesize that the original chemical measurements might have been normalized (potentially via a log transform), allowing for negative values. However, from a physical standpoint, negative concentrations shouldn’t be possible. Despite this, we’ve opted to retain these values as they are.

Response Variables

We see that the response variable, TARGET value is always between 0 and 8, which makes sense as this is the “Number of Cases of Wine Sold”. In addition, the distribution of wine cases sold, given at least one sale, exhibits symmetry and approximates normality. The target variable, number of cases, is shown below. The data shows a large number of zero values.

Variables Distributions

We observe that continuous variables exhibit a somewhat normal steep distribution. However, variables such as AcidIndex and STARS display right skewness.

In the box plot, there are not many outliers in the variables. However, TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar variables have large ranges compared to other variables.We can tell a high number of variables have numerous outliers.

Relationship Between Categorical and Response Variables

The bar charts compare the three discrete categorical variables to the TARGET variable. AcidIndex shows large quantity of wine were sold with the index number 7 and 8. LabelAppeal shows us generic label does yield higher number of wine samples per order. Lastly, STARS shows high star wine bottles have high price tags. For each of these predictors, there appears to be a significant relationship between the ordered levels and the number of wine cases sold.

Multicolinearity

x
TARGET 1.0000000
FixedAcidity -0.0125381
VolatileAcidity -0.0759979
CitricAcid 0.0023450
ResidualSugar 0.0035196
Chlorides -0.0304301
FreeSulfurDioxide 0.0226398
TotalSulfurDioxide 0.0216021
Density -0.0475989
pH 0.0002199
Sulphates -0.0212204
Alcohol 0.0737771
LabelAppeal 0.4979465
AcidIndex -0.1676431
STARS 0.5546857

In the correlation table, we can see that STARS and LabelAppeal are most positively correlated variables with the response variable. Also, we some mild negative correlation between the response variable and AcidIndex variable.

Data Preparation

Negative Values

The data has some wine quality measures that are negative that should not be. We will simply take the absolute value of these for now. The alternative would be to center by adding the min of each variable. Since we are given little information about the source of this dataset, and why these quality measures are so off, it is difficult to ascertain the best overall approach. For LabelAppeal, we will add the min.

Missing Data

According to the graph, the data set has multiple variables with missing variables. The STARS variable has the most NA values. The Sulphates variable records missing values in roughly 10% of observations, while the remaining six predictors have missing values ranging from 3% to 5%. These missing values will be imputed later on during the data preparation using the MICE package and random forest prediction method.

We can see that each of the remaining variables with missing values seem to be MAR, as the mice imputation distributions roughly match the existing. We’ll also run the mice imputation again on both the train and test set. Instead of using it for our models, however, we’ll simplify our run and fill in our data. Finally, after our analysis, we can use it in in our model, we’ll update STARS to become a factor variable.

Descriptive Summaries and Correlation Review

Mean Std.Dev Min Q1 Median Q3 Max MAD IQR CV Skewness SE.Skewness Kurtosis N.Valid Pct.Valid
AcidIndex 7.7727237 1.3239264 4.00000 7.00000 8.00000 8.00000 17.00000 1.4826000 1.000000 0.1703298 1.6484959 0.0216523 5.1900925 12795 100
Alcohol 10.4970736 3.7271443 -4.70000 9.00000 10.40000 12.40000 26.50000 2.4462900 3.400000 0.3550651 -0.0203077 0.0216523 1.5448710 12795 100
Chlorides 0.2227006 0.2340474 0.00000 0.04600 0.09900 0.36900 1.35100 0.1008168 0.323000 1.0509511 1.4732261 0.0216523 2.1533220 12795 100
CitricAcid 0.6863150 0.6060052 0.00000 0.28000 0.44000 0.97000 3.86000 0.3261720 0.690000 0.8829841 1.6428101 0.0216523 2.9474433 12795 100
Density 0.9942027 0.0265376 0.88809 0.98772 0.99449 1.00052 1.09924 0.0093552 0.012795 0.0266924 -0.0186938 0.0216523 1.8999592 12795 100
FixedAcidity 8.0632513 4.9961186 0.00000 5.60000 7.00000 9.80000 34.40000 2.9652000 4.200000 0.6196159 1.1742806 0.0216523 1.9666235 12795 100
FreeSulfurDioxide 106.8281360 108.4394316 0.00000 28.00000 56.00000 172.00000 623.00000 60.7866000 144.000000 1.0150831 1.5281310 0.0216523 2.4394290 12795 100
LabelAppeal 1.9909340 0.8910892 0.00000 1.00000 2.00000 3.00000 4.00000 1.4826000 2.000000 0.4475735 0.0084295 0.0216523 -0.2622916 12795 100
pH 3.2069050 0.6796027 0.48000 2.95000 3.20000 3.47000 6.13000 0.3854760 0.520000 0.2119186 0.0466170 0.0216523 1.6349942 12795 100
ResidualSugar 23.4240953 24.9813809 0.00000 3.60000 12.90000 38.80000 141.15000 16.3086000 35.200000 1.0664822 1.4657347 0.0216523 2.2249837 12795 100
Sulphates 0.8446760 0.6542444 0.00000 0.43000 0.59000 1.09000 4.24000 0.3261720 0.660000 0.7745507 1.6908843 0.0216523 3.1944468 12795 100
TARGET 3.0290739 1.9263682 0.00000 2.00000 3.00000 4.00000 8.00000 1.4826000 2.000000 0.6359595 -0.3263010 0.0216523 -0.8772457 12795 100
TotalSulfurDioxide 203.8295819 162.6289281 0.00000 100.00000 154.00000 261.00000 1057.00000 102.2994000 161.000000 0.7978672 1.6287104 0.0216523 3.1504058 12795 100
VolatileAcidity 0.6410856 0.5556141 0.00000 0.25000 0.41000 0.91000 3.68000 0.3261720 0.660000 0.8666770 1.6529782 0.0216523 3.0833040 12795 100

Split the Sample data Set

With our transformations complete, we can now add these into our cleaned_train dataframe and continue on to build our models. To better measure each model performance, we split our data into a training and testing data set. We will train using the first, then measure model performance again the testing hold out set.

## [1] "Number of Training Samples:  10238"
## [1] "Number of Testing Samples:  2557"

Build Models

Poisson Regression Model 1

In this first model, we include all available features: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = trainingData)
## 
## Coefficients:
##                         Estimate    Std. Error z value             Pr(>|z|)    
## (Intercept)         0.7892257665  0.2181427186   3.618             0.000297 ***
## FixedAcidity       -0.0000909220  0.0011721364  -0.078             0.938171    
## VolatileAcidity    -0.0338187850  0.0106085477  -3.188             0.001433 ** 
## CitricAcid          0.0132612453  0.0092961426   1.427             0.153715    
## ResidualSugar       0.0000007492  0.0002263610   0.003             0.997359    
## Chlorides          -0.0110526480  0.0244656468  -0.452             0.651440    
## FreeSulfurDioxide   0.0000887012  0.0000517645   1.714             0.086611 .  
## TotalSulfurDioxide  0.0000829237  0.0000347822   2.384             0.017122 *  
## Density            -0.2817475213  0.2128419603  -1.324             0.185589    
## pH                 -0.0188391082  0.0084099701  -2.240             0.025085 *  
## Sulphates          -0.0145419498  0.0088317235  -1.647             0.099649 .  
## Alcohol             0.0048269578  0.0015330983   3.148             0.001641 ** 
## LabelAppeal         0.1544563349  0.0068400657  22.581 < 0.0000000000000002 ***
## AcidIndex          -0.0772590165  0.0051544902 -14.989 < 0.0000000000000002 ***
## STARS1              0.7771273256  0.0218970731  35.490 < 0.0000000000000002 ***
## STARS2              1.0984642747  0.0204361382  53.751 < 0.0000000000000002 ***
## STARS3              1.2181307614  0.0215291274  56.581 < 0.0000000000000002 ***
## STARS4              1.3319344540  0.0272443113  48.889 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18214  on 10237  degrees of freedom
## Residual deviance: 10883  on 10220  degrees of freedom
## AIC: 36525
## 
## Number of Fisher Scoring iterations: 6
##          RMSE      Rsquared           MAE           aic           bic 
##     2.5918810     0.5189506     2.2210284 36524.6844093 36654.8939175

Poisson Regression Model 2

In this Model 2, we only include the most predictive features : VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson, 
##     data = trainingData)
## 
## Coefficients:
##                       Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)         0.44778794  0.04993620   8.967 < 0.0000000000000002 ***
## VolatileAcidity    -0.03437320  0.01060849  -3.240              0.00119 ** 
## TotalSulfurDioxide  0.00008357  0.00003477   2.404              0.01623 *  
## Alcohol             0.00484363  0.00153248   3.161              0.00157 ** 
## LabelAppeal         0.15438417  0.00683771  22.578 < 0.0000000000000002 ***
## AcidIndex          -0.07690721  0.00507587 -15.152 < 0.0000000000000002 ***
## STARS1              0.77900265  0.02188616  35.593 < 0.0000000000000002 ***
## STARS2              1.10114091  0.02042280  53.917 < 0.0000000000000002 ***
## STARS3              1.22080022  0.02151477  56.742 < 0.0000000000000002 ***
## STARS4              1.33463103  0.02722954  49.014 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18214  on 10237  degrees of freedom
## Residual deviance: 10898  on 10228  degrees of freedom
## AIC: 36523
## 
## Number of Fisher Scoring iterations: 6
##          RMSE      Rsquared           MAE           aic           bic 
##     2.5921425     0.5179013     2.2215012 36523.3146236 36595.6532392

Negative Binomial Regression Model 3

Similar to Poisson Model 1, the predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = trainingData, init.theta = 40596.4241, 
##     link = log)
## 
## Coefficients:
##                         Estimate    Std. Error z value             Pr(>|z|)    
## (Intercept)         0.7892550644  0.2181525397   3.618             0.000297 ***
## FixedAcidity       -0.0000909421  0.0011721890  -0.078             0.938160    
## VolatileAcidity    -0.0338199312  0.0106090117  -3.188             0.001433 ** 
## CitricAcid          0.0132615966  0.0092965767   1.427             0.153723    
## ResidualSugar       0.0000007457  0.0002263713   0.003             0.997372    
## Chlorides          -0.0110528191  0.0244667637  -0.452             0.651450    
## FreeSulfurDioxide   0.0000887052  0.0000517669   1.714             0.086611 .  
## TotalSulfurDioxide  0.0000829281  0.0000347838   2.384             0.017121 *  
## Density            -0.2817517962  0.2128516115  -1.324             0.185603    
## pH                 -0.0188403506  0.0084103528  -2.240             0.025082 *  
## Sulphates          -0.0145424359  0.0088321177  -1.647             0.099653 .  
## Alcohol             0.0048269046  0.0015331678   3.148             0.001642 ** 
## LabelAppeal         0.1544552702  0.0068403734  22.580 < 0.0000000000000002 ***
## AcidIndex          -0.0772613586  0.0051546889 -14.989 < 0.0000000000000002 ***
## STARS1              0.7771263073  0.0218975340  35.489 < 0.0000000000000002 ***
## STARS2              1.0984632972  0.0204365988  53.750 < 0.0000000000000002 ***
## STARS3              1.2181305180  0.0215297110  56.579 < 0.0000000000000002 ***
## STARS4              1.3319351604  0.0272454888  48.886 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40596.42) family taken to be 1)
## 
##     Null deviance: 18213  on 10237  degrees of freedom
## Residual deviance: 10882  on 10220  degrees of freedom
## AIC: 36527
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40596 
##           Std. Err.:  37984 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -36489.02
##          RMSE      Rsquared           MAE           aic           bic 
##     2.5918809     0.5189507     2.2210281 36527.0228900 36664.4662598

Negative Binomial Regression Model 4

Similar to Poisson Model 2, the predictors for the following model are: VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, data = cleaned_train, init.theta = 40655.73652, link = log)
## 
## Coefficients:
##                       Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)         0.47792499  0.04482061  10.663 < 0.0000000000000002 ***
## VolatileAcidity    -0.03707089  0.00939632  -3.945            0.0000797 ***
## FreeSulfurDioxide   0.00005855  0.00004663   1.256              0.20927    
## TotalSulfurDioxide  0.00008273  0.00003122   2.650              0.00805 ** 
## Alcohol             0.00375994  0.00137412   2.736              0.00621 ** 
## LabelAppeal         0.15910515  0.00612466  25.978 < 0.0000000000000002 ***
## AcidIndex          -0.08030228  0.00449884 -17.850 < 0.0000000000000002 ***
## STARS1              0.77091345  0.01952986  39.474 < 0.0000000000000002 ***
## STARS2              1.09217216  0.01820457  59.994 < 0.0000000000000002 ***
## STARS3              1.21280507  0.01917562  63.247 < 0.0000000000000002 ***
## STARS4              1.32891964  0.02427990  54.733 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40655.74) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 13684  on 12784  degrees of freedom
## AIC: 45651
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40656 
##           Std. Err.:  34184 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -45626.74
##          RMSE      Rsquared           MAE           aic           bic 
##     2.5916959     0.5191463     2.2224820 45650.7381419 45740.2198589

Multiple Linear Regression Model 5

The predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## lm(formula = TARGET ~ ., data = trainingData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7500 -0.8682  0.0239  0.8519  6.2500 
## 
## Coefficients:
##                       Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)         2.76395948  0.49318708   5.604         0.0000000215 ***
## FixedAcidity        0.00046808  0.00263790   0.177              0.85916    
## VolatileAcidity    -0.10072430  0.02355649  -4.276         0.0000192128 ***
## CitricAcid          0.04269888  0.02142398   1.993              0.04628 *  
## ResidualSugar      -0.00023949  0.00051519  -0.465              0.64204    
## Chlorides          -0.02668630  0.05551780  -0.481              0.63075    
## FreeSulfurDioxide   0.00027371  0.00011899   2.300              0.02145 *  
## TotalSulfurDioxide  0.00025870  0.00007956   3.251              0.00115 ** 
## Density            -0.82949711  0.48373374  -1.715              0.08642 .  
## pH                 -0.05238229  0.01912540  -2.739              0.00618 ** 
## Sulphates          -0.03737697  0.01981556  -1.886              0.05929 .  
## Alcohol             0.01574741  0.00347893   4.527         0.0000060642 ***
## LabelAppeal         0.45188116  0.01526953  29.594 < 0.0000000000000002 ***
## AcidIndex          -0.19406714  0.01029532 -18.850 < 0.0000000000000002 ***
## STARS1              1.38165553  0.03688755  37.456 < 0.0000000000000002 ***
## STARS2              2.41499920  0.03585743  67.350 < 0.0000000000000002 ***
## STARS3              2.98625589  0.04153429  71.899 < 0.0000000000000002 ***
## STARS4              3.66886490  0.06626609  55.366 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.309 on 10220 degrees of freedom
## Multiple R-squared:  0.5386, Adjusted R-squared:  0.5378 
## F-statistic: 701.6 on 17 and 10220 DF,  p-value: < 0.00000000000000022
##          RMSE      Rsquared           MAE           aic           bic 
##     1.3075630     0.5385528     1.0276638 34583.1334169 34720.5767867

Multiple Linear Regression Model 6

For the final Linear Model, we leverage stepAIC on our Linear Model 5 to choose the most important features.

## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS, data = trainingData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7399 -0.8674  0.0238  0.8526  6.2554 
## 
## Coefficients:
##                       Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)         2.75688975  0.49276487   5.595         0.0000000227 ***
## VolatileAcidity    -0.10084579  0.02355283  -4.282         0.0000187177 ***
## CitricAcid          0.04273660  0.02141695   1.995              0.04602 *  
## FreeSulfurDioxide   0.00027420  0.00011897   2.305              0.02120 *  
## TotalSulfurDioxide  0.00025796  0.00007954   3.243              0.00119 ** 
## Density            -0.83195073  0.48361572  -1.720              0.08541 .  
## pH                 -0.05240947  0.01912021  -2.741              0.00613 ** 
## Sulphates          -0.03743763  0.01980990  -1.890              0.05881 .  
## Alcohol             0.01575533  0.00347847   4.529         0.0000059825 ***
## LabelAppeal         0.45192875  0.01526602  29.604 < 0.0000000000000002 ***
## AcidIndex          -0.19386468  0.01013673 -19.125 < 0.0000000000000002 ***
## STARS1              1.38228824  0.03686744  37.493 < 0.0000000000000002 ***
## STARS2              2.41529060  0.03584458  67.382 < 0.0000000000000002 ***
## STARS3              2.98644267  0.04152478  71.920 < 0.0000000000000002 ***
## STARS4              3.66914480  0.06625020  55.383 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.309 on 10223 degrees of freedom
## Multiple R-squared:  0.5385, Adjusted R-squared:  0.5379 
## F-statistic: 852.2 on 14 and 10223 DF,  p-value: < 0.00000000000000022
##          RMSE      Rsquared           MAE           aic           bic 
##     1.3075933     0.5385314     1.0277302 34577.6091408 34693.3509259

Select Models

RMSE Rsquared MAE aic bic
Poission_Evaluate1 2.591881 0.5189506 2.221028 36524.68 36654.89
Poission_Evaluate2 2.592143 0.5179013 2.221501 36523.31 36595.65
Negative_Binomial_eval3 2.591881 0.5189507 2.221028 36527.02 36664.47
Negative_Binomial_eval4 2.591696 0.5191463 2.222482 45650.74 45740.22
Linear_Regression_eval5 1.307563 0.5385528 1.027664 34583.13 34720.58
Linear_Regression_eval6 1.307593 0.5385314 1.027730 34577.61 34693.35

This table summarizes the RMSE, RSQUARED, MAE, AIC and BIC for all SIX models. The Linear regressions (Linear Model 5 and Linear Model 6) had the overall best performance based on RMSE and RSQUARED; as well as Linear Model 6 had the best performance based on AIC and BIC.

Finally, we chose Multiple Linear Regression Model 6 as our final model since it had a far lower AIC and BIC.

Prediction

##   FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1          5.4           0.860       0.27          10.7     0.092
## 2         12.4           0.385       0.76          19.7     1.169
## 3          7.2           1.750       0.17          33.0     0.065
## 4          6.2           0.100       1.80           1.0     0.179
## 5         11.4           0.210       0.28           1.2     0.038
## 6         17.6           0.040       1.15           1.4     0.535
##   FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
## 1                23                398 0.98527 5.02      0.64   12.30
## 2                37                 68 0.99048 3.37      1.09   16.00
## 3                 9                 76 1.04641 4.61      0.68    8.55
## 4               104                 89 0.98877 3.20      2.11   12.30
## 5                70                 53 1.02899 2.54      0.07    4.80
## 6               250                140 0.95028 3.06      0.02   11.40
##   LabelAppeal AcidIndex STARS    TARGET
## 1           1         6     0 1.1664538
## 2           2         6     2 4.1448181
## 3           2         8     1 2.3420568
## 4           1         8     1 2.2829844
## 5           2        10     0 0.8295706
## 6           3         8     4 5.6085833

The histogram shows that our predictions have a similar shape to our training Target variable, the means and medians are almost identical, and the kurtosis values are close.

The predicted file is uploaded to Github: https://github.com/waheeb123/Data-621/blob/main/Homeworks/Homework%205/DATA621_HW5_Predictions.csv

Conclusions

The Linear Regression model that we chose as the best model has an adjusted R2 value of 0.53. All predictors and levels within each categorical predictor are significant at the 5% level. The model coefficients make intuitive sense. Sales of wine decrease with each increase in the AcidIndex value. Having a STARS rating results in more sales compared to no ratings, and higher STARS ratings are associated with higher sales. Finally, increases in alcohol content are associated with higher sales.The interpretation of the coefficients aligns with the insights gained during exploratory data analysis.