Data is the key to making informed decisions and achieving success in modern business. We’ll begin our analysis by examining the dataset for outliers, missing data, potential encoding errors, multicollinearity etc., then, we’ll implement any required data cleaning procedures. Once we’ve prepared a reliable dataset, we’ll construct and assess three distinct linear models to forecast sales. Our dataset comprises both training and evaluation data; we’ll train the models using the primary training dataset and then assess their performance against the separate evaluation dataset. Finally, we’ll choose a best model that strikes the optimal balance between accuracy and simplicity.
## tibble [12,795 × 15] (S3: tbl_df/tbl/data.frame)
## $ TARGET : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : int [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...
## tibble [3,335 × 15] (S3: tbl_df/tbl/data.frame)
## $ TARGET : logi [1:3335] NA NA NA NA NA NA ...
## $ FixedAcidity : num [1:3335] 5.4 12.4 7.2 6.2 11.4 17.6 15.5 15.9 11.6 3.8 ...
## $ VolatileAcidity : num [1:3335] -0.86 0.385 1.75 0.1 0.21 0.04 0.53 1.19 0.32 0.22 ...
## $ CitricAcid : num [1:3335] 0.27 -0.76 0.17 1.8 0.28 -1.15 -0.53 1.14 0.55 0.31 ...
## $ ResidualSugar : num [1:3335] -10.7 -19.7 -33 1 1.2 1.4 4.6 31.9 -50.9 -7.7 ...
## $ Chlorides : num [1:3335] 0.092 1.169 0.065 -0.179 0.038 ...
## $ FreeSulfurDioxide : num [1:3335] 23 -37 9 104 70 -250 10 115 35 40 ...
## $ TotalSulfurDioxide: num [1:3335] 398 68 76 89 53 140 17 381 83 129 ...
## $ Density : num [1:3335] 0.985 0.99 1.046 0.989 1.029 ...
## $ pH : num [1:3335] 5.02 3.37 4.61 3.2 2.54 3.06 3.07 2.99 3.32 4.72 ...
## $ Sulphates : num [1:3335] 0.64 1.09 0.68 2.11 -0.07 -0.02 0.75 0.31 2.18 -0.64 ...
## $ Alcohol : num [1:3335] 12.3 16 8.55 12.3 4.8 11.4 8.5 11.4 -0.5 10.9 ...
## $ LabelAppeal : int [1:3335] -1 0 0 -1 0 1 0 1 0 0 ...
## $ AcidIndex : int [1:3335] 6 6 8 8 10 8 12 7 12 7 ...
## $ STARS : int [1:3335] NA 2 1 1 NA 4 3 NA NA NA ...
The training data set has 12,795 rows and 15 columns: 14 features and
1 response variable, TARGET. The variable INDEX is used for
observed identification. Twelve of the features describe the chemical
properties of wine. The remaining two predictors are rating variables:
LabelAppeal refers to the perceived attractiveness of a wine’s product
label, while STARS is an assessment of wine quality. The
output variable, TARGET, is a count measure indicating the
number of wine case purchases by distributors.
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| TARGET | 12795 | 3 | 1.9 | 0 | 2 | 4 | 8 |
| FixedAcidity | 12795 | 7.1 | 6.3 | -18 | 5.2 | 9.5 | 34 |
| VolatileAcidity | 12795 | 0.32 | 0.78 | -2.8 | 0.13 | 0.64 | 3.7 |
| CitricAcid | 12795 | 0.31 | 0.86 | -3.2 | 0.03 | 0.58 | 3.9 |
| ResidualSugar | 12179 | 5.4 | 34 | -128 | -2 | 16 | 141 |
| Chlorides | 12157 | 0.055 | 0.32 | -1.2 | -0.031 | 0.15 | 1.4 |
| FreeSulfurDioxide | 12148 | 31 | 149 | -555 | 0 | 70 | 623 |
| TotalSulfurDioxide | 12113 | 121 | 232 | -823 | 27 | 208 | 1057 |
| Density | 12795 | 0.99 | 0.027 | 0.89 | 0.99 | 1 | 1.1 |
| pH | 12400 | 3.2 | 0.68 | 0.48 | 3 | 3.5 | 6.1 |
| Sulphates | 11585 | 0.53 | 0.93 | -3.1 | 0.28 | 0.86 | 4.2 |
| Alcohol | 12142 | 10 | 3.7 | -4.7 | 9 | 12 | 26 |
| LabelAppeal | 12795 | -0.0091 | 0.89 | -2 | -1 | 1 | 2 |
| AcidIndex | 12795 | 7.8 | 1.3 | 4 | 7 | 8 | 17 |
| STARS | 9436 | 2 | 0.9 | 1 | 1 | 3 | 4 |
Of the 14 feature columns, 8 of them such as
ResidualSugar, Chlorides,
FreeSulfurDioxide, TotalSulfurDioxide,
pH, Sulphates, Alcohol, and
STARS variables contain several missing values.
LabelAppeal, AcidIndex, and STARS
are discrete variables (i.e categorical) and the rest are continuous. We
also noticed that several numerical features representing chemical
quantities in the wine exhibit negative minimum values. We hypothesize
that the original chemical measurements might have been normalized
(potentially via a log transform), allowing for negative values.
However, from a physical standpoint, negative concentrations shouldn’t
be possible. Despite this, we’ve opted to retain these values as they
are.
We see that the response variable, TARGET value is
always between 0 and 8, which makes sense as this is the “Number of
Cases of Wine Sold”. In addition, the distribution of wine cases sold,
given at least one sale, exhibits symmetry and approximates normality.
The target variable, number of cases, is shown below. The data shows a
large number of zero values.
We observe that continuous variables exhibit a somewhat normal steep
distribution. However, variables such as AcidIndex and
STARS display right skewness.
In the box plot, there are not many outliers in the variables.
However, TotalSulfurDioxide,
FreeSulfurDioxide, and ResidualSugar variables
have large ranges compared to other variables.We can tell a high number
of variables have numerous outliers.
The bar charts compare the three discrete categorical variables to
the TARGET variable. AcidIndex shows large
quantity of wine were sold with the index number 7 and 8.
LabelAppeal shows us generic label does yield higher number
of wine samples per order. Lastly, STARS shows high star
wine bottles have high price tags. For each of these predictors, there
appears to be a significant relationship between the ordered levels and
the number of wine cases sold.
| x | |
|---|---|
| TARGET | 1.0000000 |
| FixedAcidity | -0.0125381 |
| VolatileAcidity | -0.0759979 |
| CitricAcid | 0.0023450 |
| ResidualSugar | 0.0035196 |
| Chlorides | -0.0304301 |
| FreeSulfurDioxide | 0.0226398 |
| TotalSulfurDioxide | 0.0216021 |
| Density | -0.0475989 |
| pH | 0.0002199 |
| Sulphates | -0.0212204 |
| Alcohol | 0.0737771 |
| LabelAppeal | 0.4979465 |
| AcidIndex | -0.1676431 |
| STARS | 0.5546857 |
In the correlation table, we can see that STARS and
LabelAppeal are most positively correlated variables with
the response variable. Also, we some mild negative correlation between
the response variable and AcidIndex variable.
The data has some wine quality measures that are negative that should
not be. We will simply take the absolute value of these for now. The
alternative would be to center by adding the min of each variable. Since
we are given little information about the source of this dataset, and
why these quality measures are so off, it is difficult to ascertain the
best overall approach. For LabelAppeal, we will add the
min.
According to the graph, the data set has multiple variables with
missing variables. The STARS variable has the most NA
values. The Sulphates variable records missing values in
roughly 10% of observations, while the remaining six predictors have
missing values ranging from 3% to 5%. These missing values will be
imputed later on during the data preparation using the MICE package and
random forest prediction method.
We can see that each of the remaining variables with missing values seem to be MAR, as the mice imputation distributions roughly match the existing. We’ll also run the mice imputation again on both the train and test set. Instead of using it for our models, however, we’ll simplify our run and fill in our data. Finally, after our analysis, we can use it in in our model, we’ll update STARS to become a factor variable.
| Mean | Std.Dev | Min | Q1 | Median | Q3 | Max | MAD | IQR | CV | Skewness | SE.Skewness | Kurtosis | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AcidIndex | 7.7727237 | 1.3239264 | 4.00000 | 7.00000 | 8.00000 | 8.00000 | 17.00000 | 1.4826000 | 1.000000 | 0.1703298 | 1.6484959 | 0.0216523 | 5.1900925 | 12795 | 100 |
| Alcohol | 10.4807704 | 3.7233165 | -4.70000 | 9.00000 | 10.40000 | 12.40000 | 26.50000 | 2.3721600 | 3.400000 | 0.3552522 | -0.0306208 | 0.0216523 | 1.5399928 | 12795 | 100 |
| Chlorides | 0.2227882 | 0.2342173 | 0.00000 | 0.04600 | 0.09900 | 0.36900 | 1.35100 | 0.1008168 | 0.323000 | 1.0513004 | 1.4798442 | 0.0216523 | 2.1837525 | 12795 | 100 |
| CitricAcid | 0.6863150 | 0.6060052 | 0.00000 | 0.28000 | 0.44000 | 0.97000 | 3.86000 | 0.3261720 | 0.690000 | 0.8829841 | 1.6428101 | 0.0216523 | 2.9474433 | 12795 | 100 |
| Density | 0.9942027 | 0.0265376 | 0.88809 | 0.98772 | 0.99449 | 1.00052 | 1.09924 | 0.0093552 | 0.012795 | 0.0266924 | -0.0186938 | 0.0216523 | 1.8999592 | 12795 | 100 |
| FixedAcidity | 8.0632513 | 4.9961186 | 0.00000 | 5.60000 | 7.00000 | 9.80000 | 34.40000 | 2.9652000 | 4.200000 | 0.6196159 | 1.1742806 | 0.0216523 | 1.9666235 | 12795 | 100 |
| FreeSulfurDioxide | 106.2892927 | 108.0521706 | 0.00000 | 28.00000 | 55.00000 | 171.00000 | 623.00000 | 59.3040000 | 143.000000 | 1.0165857 | 1.5351875 | 0.0216523 | 2.4592719 | 12795 | 100 |
| LabelAppeal | 1.9909340 | 0.8910892 | 0.00000 | 1.00000 | 2.00000 | 3.00000 | 4.00000 | 1.4826000 | 2.000000 | 0.4475735 | 0.0084295 | 0.0216523 | -0.2622916 | 12795 | 100 |
| pH | 3.2067104 | 0.6800719 | 0.48000 | 2.95000 | 3.20000 | 3.47000 | 6.13000 | 0.3854760 | 0.515000 | 0.2120777 | 0.0385950 | 0.0216523 | 1.6292482 | 12795 | 100 |
| ResidualSugar | 23.2877061 | 24.8777541 | 0.00000 | 3.60000 | 12.90000 | 38.50000 | 141.15000 | 16.3086000 | 34.900000 | 1.0682784 | 1.4741280 | 0.0216523 | 2.2538118 | 12795 | 100 |
| Sulphates | 0.8460883 | 0.6558266 | 0.00000 | 0.43000 | 0.59000 | 1.10000 | 4.24000 | 0.3261720 | 0.670000 | 0.7751279 | 1.6917630 | 0.0216523 | 3.2294753 | 12795 | 100 |
| TARGET | 3.0290739 | 1.9263682 | 0.00000 | 2.00000 | 3.00000 | 4.00000 | 8.00000 | 1.4826000 | 2.000000 | 0.6359595 | -0.3263010 | 0.0216523 | -0.8772457 | 12795 | 100 |
| TotalSulfurDioxide | 204.9099258 | 163.8645075 | 0.00000 | 100.00000 | 154.00000 | 263.00000 | 1057.00000 | 102.2994000 | 163.000000 | 0.7996904 | 1.6100847 | 0.0216523 | 3.0153943 | 12795 | 100 |
| VolatileAcidity | 0.6410856 | 0.5556141 | 0.00000 | 0.25000 | 0.41000 | 0.91000 | 3.68000 | 0.3261720 | 0.660000 | 0.8666770 | 1.6529782 | 0.0216523 | 3.0833040 | 12795 | 100 |
With our transformations complete, we can now add these into our
cleaned_train dataframe and continue on to build our
models. To better measure each model performance, we split our data into
a training and testing data set. We will train using the first, then
measure model performance again the testing hold out set.
## [1] "Number of Training Samples: 10238"
## [1] "Number of Testing Samples: 2557"
In this first model, we include all available features:
FixedAcidity, VolatileAcidity,
CitricAcid, ResidualSugar,
Chlorides, FreeSulfurDioxide,
TotalSulfurDioxide, Density, pH,
Sulphates, Alcohol, LabelAppeal,
AcidIndex, STARS
##
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = trainingData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2730 -0.6622 -0.0022 0.4590 3.4596
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.88401786 0.22058337 4.008 0.0000613 ***
## FixedAcidity -0.00019300 0.00116712 -0.165 0.868655
## VolatileAcidity -0.03800614 0.01053001 -3.609 0.000307 ***
## CitricAcid 0.00482117 0.00935375 0.515 0.606255
## ResidualSugar 0.00008185 0.00022848 0.358 0.720174
## Chlorides -0.04430243 0.02467192 -1.796 0.072548 .
## FreeSulfurDioxide 0.00006365 0.00005342 1.192 0.233451
## TotalSulfurDioxide 0.00008298 0.00003462 2.397 0.016543 *
## Density -0.38653351 0.21575673 -1.792 0.073209 .
## pH -0.00883178 0.00848151 -1.041 0.297737
## Sulphates -0.00464684 0.00870712 -0.534 0.593561
## Alcohol 0.00307482 0.00153833 1.999 0.045630 *
## LabelAppeal 0.15940496 0.00687464 23.187 < 0.0000000000000002 ***
## AcidIndex -0.07667561 0.00511881 -14.979 < 0.0000000000000002 ***
## STARS1 0.75937762 0.02186553 34.729 < 0.0000000000000002 ***
## STARS2 1.08626089 0.02032575 53.443 < 0.0000000000000002 ***
## STARS3 1.20920722 0.02137683 56.566 < 0.0000000000000002 ***
## STARS4 1.32798791 0.02736105 48.536 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18279 on 10237 degrees of freedom
## Residual deviance: 10978 on 10220 degrees of freedom
## AIC: 36572
##
## Number of Fisher Scoring iterations: 6
## RMSE Rsquared MAE aic bic
## 2.5896243 0.5177046 2.2206236 36571.6286867 36701.8381949
In this Model 2, we only include the most predictive features :
VolatileAcidity, TotalSulfurDioxide,
Alcohol, LabelAppeal, AcidIndex,
STARS
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide +
## Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson,
## data = trainingData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3026 -0.6633 -0.0009 0.4594 3.4943
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.46904729 0.04994358 9.392 < 0.0000000000000002 ***
## VolatileAcidity -0.03830760 0.01052516 -3.640 0.000273 ***
## TotalSulfurDioxide 0.00008305 0.00003461 2.400 0.016413 *
## Alcohol 0.00314054 0.00153790 2.042 0.041142 *
## LabelAppeal 0.15938001 0.00686767 23.207 < 0.0000000000000002 ***
## AcidIndex -0.07702508 0.00504447 -15.269 < 0.0000000000000002 ***
## STARS1 0.76093283 0.02185558 34.816 < 0.0000000000000002 ***
## STARS2 1.08797214 0.02031653 53.551 < 0.0000000000000002 ***
## STARS3 1.21122854 0.02136172 56.701 < 0.0000000000000002 ***
## STARS4 1.32905420 0.02735111 48.592 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18279 on 10237 degrees of freedom
## Residual deviance: 10988 on 10228 degrees of freedom
## AIC: 36566
##
## Number of Fisher Scoring iterations: 6
## RMSE Rsquared MAE aic bic
## 2.5897983 0.5170154 2.2208191 36565.5340896 36637.8727052
Similar to Poisson Model 1, the predictors for the following model
are: FixedAcidity, VolatileAcidity,
CitricAcid, ResidualSugar,
Chlorides, FreeSulfurDioxide,
TotalSulfurDioxide, Density, pH,
Sulphates, Alcohol, LabelAppeal,
AcidIndex, STARS
##
## Call:
## glm.nb(formula = TARGET ~ ., data = trainingData, init.theta = 40727.91209,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2729 -0.6622 -0.0022 0.4590 3.4595
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.88405271 0.22059333 4.008 0.0000613 ***
## FixedAcidity -0.00019303 0.00116717 -0.165 0.868642
## VolatileAcidity -0.03800764 0.01053047 -3.609 0.000307 ***
## CitricAcid 0.00482133 0.00935418 0.515 0.606260
## ResidualSugar 0.00008184 0.00022849 0.358 0.720193
## Chlorides -0.04430344 0.02467302 -1.796 0.072555 .
## FreeSulfurDioxide 0.00006365 0.00005342 1.192 0.233443
## TotalSulfurDioxide 0.00008298 0.00003462 2.397 0.016543 *
## Density -0.38654219 0.21576653 -1.791 0.073216 .
## pH -0.00883271 0.00848189 -1.041 0.297708
## Sulphates -0.00464712 0.00870751 -0.534 0.593556
## Alcohol 0.00307470 0.00153840 1.999 0.045648 *
## LabelAppeal 0.15940399 0.00687495 23.186 < 0.0000000000000002 ***
## AcidIndex -0.07667809 0.00511900 -14.979 < 0.0000000000000002 ***
## STARS1 0.75937649 0.02186599 34.729 < 0.0000000000000002 ***
## STARS2 1.08625971 0.02032621 53.441 < 0.0000000000000002 ***
## STARS3 1.20920687 0.02137741 56.565 < 0.0000000000000002 ***
## STARS4 1.32798842 0.02736226 48.534 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40727.91) family taken to be 1)
##
## Null deviance: 18278 on 10237 degrees of freedom
## Residual deviance: 10978 on 10220 degrees of freedom
## AIC: 36574
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40728
## Std. Err.: 38350
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -36535.96
## RMSE Rsquared MAE aic bic
## 2.5896242 0.5177046 2.2206234 36573.9628866 36711.4062564
Similar to Poisson Model 2, the predictors for the following model
are: VolatileAcidity, FreeSulfurDioxide,
TotalSulfurDioxide, Alcohol,
LabelAppeal, AcidIndex, STARS
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide +
## TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex +
## STARS, data = cleaned_train, init.theta = 40659.48234, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3090 -0.6536 -0.0031 0.4547 3.7836
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.47732250 0.04491006 10.628 < 0.0000000000000002 ***
## VolatileAcidity -0.03706654 0.00939412 -3.946 0.0000796 ***
## FreeSulfurDioxide 0.00005268 0.00004689 1.123 0.26127
## TotalSulfurDioxide 0.00008837 0.00003104 2.847 0.00442 **
## Alcohol 0.00368970 0.00137441 2.685 0.00726 **
## LabelAppeal 0.15942451 0.00612676 26.021 < 0.0000000000000002 ***
## AcidIndex -0.08028377 0.00449971 -17.842 < 0.0000000000000002 ***
## STARS1 0.77095559 0.01953002 39.475 < 0.0000000000000002 ***
## STARS2 1.09236374 0.01820285 60.011 < 0.0000000000000002 ***
## STARS3 1.21265049 0.01917736 63.233 < 0.0000000000000002 ***
## STARS4 1.32851855 0.02428483 54.706 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40659.48) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 13683 on 12784 degrees of freedom
## AIC: 45650
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40659
## Std. Err.: 34189
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -45626.25
## RMSE Rsquared MAE aic bic
## 2.5893189 0.5158105 2.2199834 45650.2463626 45739.7280796
The predictors for the following model are:
FixedAcidity, VolatileAcidity,
CitricAcid, ResidualSugar,
Chlorides, FreeSulfurDioxide,
TotalSulfurDioxide, Density, pH,
Sulphates, Alcohol, LabelAppeal,
AcidIndex, STARS
##
## Call:
## lm(formula = TARGET ~ ., data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8414 -0.8737 0.0236 0.8555 5.6826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.020068489 0.500891225 6.029 0.0000000017 ***
## FixedAcidity 0.000469154 0.002635703 0.178 0.85873
## VolatileAcidity -0.112649856 0.023435095 -4.807 0.0000015548 ***
## CitricAcid 0.019109205 0.021413858 0.892 0.37221
## ResidualSugar 0.000007501 0.000521091 0.014 0.98851
## Chlorides -0.124471650 0.055517825 -2.242 0.02498 *
## FreeSulfurDioxide 0.000204177 0.000121857 1.676 0.09386 .
## TotalSulfurDioxide 0.000249243 0.000079087 3.152 0.00163 **
## Density -1.156597986 0.492177579 -2.350 0.01879 *
## pH -0.018547127 0.019254329 -0.963 0.33543
## Sulphates -0.010122588 0.019789678 -0.512 0.60901
## Alcohol 0.010995114 0.003491237 3.149 0.00164 **
## LabelAppeal 0.465246728 0.015346756 30.316 < 0.0000000000000002 ***
## AcidIndex -0.191380479 0.010214692 -18.736 < 0.0000000000000002 ***
## STARS1 1.350459388 0.037032187 36.467 < 0.0000000000000002 ***
## STARS2 2.395624269 0.035846190 66.831 < 0.0000000000000002 ***
## STARS3 2.974936849 0.041397001 71.864 < 0.0000000000000002 ***
## STARS4 3.688404529 0.067342857 54.771 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.311 on 10220 degrees of freedom
## Multiple R-squared: 0.537, Adjusted R-squared: 0.5362
## F-statistic: 697.3 on 17 and 10220 DF, p-value: < 0.00000000000000022
## RMSE Rsquared MAE aic bic
## 1.309785 0.537013 1.031076 34617.903309 34755.346679
For the final Linear Model, we leverage stepAIC on our
Linear Model 5 to choose the most important features.
##
## Call:
## lm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + Alcohol + LabelAppeal + AcidIndex +
## STARS, data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8773 -0.8745 0.0206 0.8552 5.6964
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.97476244 0.49647332 5.992 0.00000000215 ***
## VolatileAcidity -0.11312314 0.02342445 -4.829 0.00000139018 ***
## Chlorides -0.12555620 0.05549618 -2.262 0.02369 *
## FreeSulfurDioxide 0.00020461 0.00012182 1.680 0.09307 .
## TotalSulfurDioxide 0.00024969 0.00007906 3.158 0.00159 **
## Density -1.17196189 0.49197148 -2.382 0.01723 *
## Alcohol 0.01103529 0.00348994 3.162 0.00157 **
## LabelAppeal 0.46516486 0.01534023 30.323 < 0.0000000000000002 ***
## AcidIndex -0.19022043 0.01002596 -18.973 < 0.0000000000000002 ***
## STARS1 1.35114025 0.03701656 36.501 < 0.0000000000000002 ***
## STARS2 2.39648117 0.03582750 66.889 < 0.0000000000000002 ***
## STARS3 2.97643406 0.04137313 71.941 < 0.0000000000000002 ***
## STARS4 3.68969165 0.06731839 54.810 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.311 on 10225 degrees of freedom
## Multiple R-squared: 0.5369, Adjusted R-squared: 0.5364
## F-statistic: 988 on 12 and 10225 DF, p-value: < 0.00000000000000022
## RMSE Rsquared MAE aic bic
## 1.309912 0.536923 1.030963 34609.892810 34711.166872
| RMSE | Rsquared | MAE | aic | bic | |
|---|---|---|---|---|---|
| Poission_Evaluate1 | 2.589624 | 0.5177046 | 2.220624 | 36571.63 | 36701.84 |
| Poission_Evaluate2 | 2.589798 | 0.5170154 | 2.220819 | 36565.53 | 36637.87 |
| Negative_Binomial_eval3 | 2.589624 | 0.5177046 | 2.220623 | 36573.96 | 36711.41 |
| Negative_Binomial_eval4 | 2.589319 | 0.5158105 | 2.219983 | 45650.25 | 45739.73 |
| Linear_Regression_eval5 | 1.309785 | 0.5370130 | 1.031076 | 34617.90 | 34755.35 |
| Linear_Regression_eval6 | 1.309913 | 0.5369230 | 1.030963 | 34609.89 | 34711.17 |
This table summarizes the RMSE, RSQUARED, MAE, AIC and BIC for all SIX models. The Linear regressions (Linear Model 5 and Linear Model 6) had the overall best performance based on RMSE and RSQUARED; as well as Linear Model 6 had the best performance based on AIC and BIC.
Finally, we chose Multiple Linear Regression Model 6 as our final model since it had a far lower AIC and BIC.
## FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1 5.4 0.860 0.27 10.7 0.092
## 2 12.4 0.385 0.76 19.7 1.169
## 3 7.2 1.750 0.17 33.0 0.065
## 4 6.2 0.100 1.80 1.0 0.179
## 5 11.4 0.210 0.28 1.2 0.038
## 6 17.6 0.040 1.15 1.4 0.535
## FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol
## 1 23 398 0.98527 5.02 0.64 12.30
## 2 37 68 0.99048 3.37 1.09 16.00
## 3 9 76 1.04641 4.61 0.68 8.55
## 4 104 89 0.98877 3.20 2.11 12.30
## 5 70 53 1.02899 2.54 0.07 4.80
## 6 250 140 0.95028 3.06 0.02 11.40
## LabelAppeal AcidIndex STARS TARGET
## 1 1 6 0 1.2748842
## 2 2 6 2 4.0102323
## 3 2 8 1 2.4171591
## 4 1 8 1 2.2559522
## 5 2 10 0 0.8489494
## 6 3 8 4 5.5647071
The histogram shows that our predictions have a similar shape to our training Target variable, the means and medians are almost identical, and the kurtosis values are close.
The predicted file is uploaded to Github: https://github.com/waheeb123/Data-621/blob/main/Homeworks/Homework%205/DATA621_HW5_Predictions.csv
The Linear Regression model that we chose as the best model has an
adjusted R2 value of 0.53. All predictors and levels within each
categorical predictor are significant at the 5% level. The model
coefficients make intuitive sense. Sales of wine decrease with each
increase in the AcidIndex value. Having a
STARS rating results in more sales compared to no ratings,
and higher STARS ratings are associated with higher sales. Finally,
increases in alcohol content are associated with higher sales.The
interpretation of the coefficients aligns with the insights gained
during exploratory data analysis.