Data is the key to making informed decisions and achieving success in modern business. We’ll begin our analysis by examining the dataset for outliers, missing data, potential encoding errors, multicollinearity etc., then, we’ll implement any required data cleaning procedures. Once we’ve prepared a reliable dataset, we’ll construct and assess three distinct linear models to forecast sales. Our dataset comprises both training and evaluation data; we’ll train the models using the primary training dataset and then assess their performance against the separate evaluation dataset. Finally, we’ll choose a best model that strikes the optimal balance between accuracy and simplicity.
## tibble [12,795 × 15] (S3: tbl_df/tbl/data.frame)
## $ TARGET : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : int [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...
## tibble [3,335 × 15] (S3: tbl_df/tbl/data.frame)
## $ TARGET : logi [1:3335] NA NA NA NA NA NA ...
## $ FixedAcidity : num [1:3335] 5.4 12.4 7.2 6.2 11.4 17.6 15.5 15.9 11.6 3.8 ...
## $ VolatileAcidity : num [1:3335] -0.86 0.385 1.75 0.1 0.21 0.04 0.53 1.19 0.32 0.22 ...
## $ CitricAcid : num [1:3335] 0.27 -0.76 0.17 1.8 0.28 -1.15 -0.53 1.14 0.55 0.31 ...
## $ ResidualSugar : num [1:3335] -10.7 -19.7 -33 1 1.2 1.4 4.6 31.9 -50.9 -7.7 ...
## $ Chlorides : num [1:3335] 0.092 1.169 0.065 -0.179 0.038 ...
## $ FreeSulfurDioxide : num [1:3335] 23 -37 9 104 70 -250 10 115 35 40 ...
## $ TotalSulfurDioxide: num [1:3335] 398 68 76 89 53 140 17 381 83 129 ...
## $ Density : num [1:3335] 0.985 0.99 1.046 0.989 1.029 ...
## $ pH : num [1:3335] 5.02 3.37 4.61 3.2 2.54 3.06 3.07 2.99 3.32 4.72 ...
## $ Sulphates : num [1:3335] 0.64 1.09 0.68 2.11 -0.07 -0.02 0.75 0.31 2.18 -0.64 ...
## $ Alcohol : num [1:3335] 12.3 16 8.55 12.3 4.8 11.4 8.5 11.4 -0.5 10.9 ...
## $ LabelAppeal : int [1:3335] -1 0 0 -1 0 1 0 1 0 0 ...
## $ AcidIndex : int [1:3335] 6 6 8 8 10 8 12 7 12 7 ...
## $ STARS : int [1:3335] NA 2 1 1 NA 4 3 NA NA NA ...
The training data set has 12,795 rows and 15 columns: 14 features and
1 response variable, TARGET. The variable INDEX is used for
observed identification. Twelve of the features describe the chemical
properties of wine. The remaining two predictors are rating variables:
LabelAppeal refers to the perceived attractiveness of a wine’s product
label, while STARS is an assessment of wine quality. The
output variable, TARGET, is a count measure indicating the
number of wine case purchases by distributors.
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| TARGET | 12795 | 3 | 1.9 | 0 | 2 | 4 | 8 |
| FixedAcidity | 12795 | 7.1 | 6.3 | -18 | 5.2 | 9.5 | 34 |
| VolatileAcidity | 12795 | 0.32 | 0.78 | -2.8 | 0.13 | 0.64 | 3.7 |
| CitricAcid | 12795 | 0.31 | 0.86 | -3.2 | 0.03 | 0.58 | 3.9 |
| ResidualSugar | 12179 | 5.4 | 34 | -128 | -2 | 16 | 141 |
| Chlorides | 12157 | 0.055 | 0.32 | -1.2 | -0.031 | 0.15 | 1.4 |
| FreeSulfurDioxide | 12148 | 31 | 149 | -555 | 0 | 70 | 623 |
| TotalSulfurDioxide | 12113 | 121 | 232 | -823 | 27 | 208 | 1057 |
| Density | 12795 | 0.99 | 0.027 | 0.89 | 0.99 | 1 | 1.1 |
| pH | 12400 | 3.2 | 0.68 | 0.48 | 3 | 3.5 | 6.1 |
| Sulphates | 11585 | 0.53 | 0.93 | -3.1 | 0.28 | 0.86 | 4.2 |
| Alcohol | 12142 | 10 | 3.7 | -4.7 | 9 | 12 | 26 |
| LabelAppeal | 12795 | -0.0091 | 0.89 | -2 | -1 | 1 | 2 |
| AcidIndex | 12795 | 7.8 | 1.3 | 4 | 7 | 8 | 17 |
| STARS | 9436 | 2 | 0.9 | 1 | 1 | 3 | 4 |
Of the 14 feature columns, 8 of them such as
ResidualSugar, Chlorides,
FreeSulfurDioxide, TotalSulfurDioxide,
pH, Sulphates, Alcohol, and
STARS variables contain several missing values.
LabelAppeal, AcidIndex, and STARS
are discrete variables (i.e categorical) and the rest are continuous. We
also noticed that several numerical features representing chemical
quantities in the wine exhibit negative minimum values. We hypothesize
that the original chemical measurements might have been normalized
(potentially via a log transform), allowing for negative values.
However, from a physical standpoint, negative concentrations shouldn’t
be possible. Despite this, we’ve opted to retain these values as they
are.
We see that the response variable, TARGET value is
always between 0 and 8, which makes sense as this is the “Number of
Cases of Wine Sold”. In addition, the distribution of wine cases sold,
given at least one sale, exhibits symmetry and approximates normality.
The target variable, number of cases, is shown below. The data shows a
large number of zero values.
We observe that continuous variables exhibit a somewhat normal steep
distribution. However, variables such as AcidIndex and
STARS display right skewness.
In the box plot, there are not many outliers in the variables.
However, TotalSulfurDioxide,
FreeSulfurDioxide, and ResidualSugar variables
have large ranges compared to other variables.We can tell a high number
of variables have numerous outliers.
The bar charts compare the three discrete categorical variables to
the TARGET variable. AcidIndex shows large
quantity of wine were sold with the index number 7 and 8.
LabelAppeal shows us generic label does yield higher number
of wine samples per order. Lastly, STARS shows high star
wine bottles have high price tags. For each of these predictors, there
appears to be a significant relationship between the ordered levels and
the number of wine cases sold.
| x | |
|---|---|
| TARGET | 1.0000000 |
| FixedAcidity | -0.0125381 |
| VolatileAcidity | -0.0759979 |
| CitricAcid | 0.0023450 |
| ResidualSugar | 0.0035196 |
| Chlorides | -0.0304301 |
| FreeSulfurDioxide | 0.0226398 |
| TotalSulfurDioxide | 0.0216021 |
| Density | -0.0475989 |
| pH | 0.0002199 |
| Sulphates | -0.0212204 |
| Alcohol | 0.0737771 |
| LabelAppeal | 0.4979465 |
| AcidIndex | -0.1676431 |
| STARS | 0.5546857 |
In the correlation table, we can see that STARS and
LabelAppeal are most positively correlated variables with
the response variable. Also, we some mild negative correlation between
the response variable and AcidIndex variable.
The data has some wine quality measures that are negative that should
not be. We will simply take the absolute value of these for now. The
alternative would be to center by adding the min of each variable. Since
we are given little information about the source of this dataset, and
why these quality measures are so off, it is difficult to ascertain the
best overall approach. For LabelAppeal, we will add the
min.
According to the graph, the data set has multiple variables with
missing variables. The STARS variable has the most NA
values. The Sulphates variable records missing values in
roughly 10% of observations, while the remaining six predictors have
missing values ranging from 3% to 5%. These missing values will be
imputed later on during the data preparation using the MICE package and
random forest prediction method.
We can see that each of the remaining variables with missing values seem to be MAR, as the mice imputation distributions roughly match the existing. We’ll also run the mice imputation again on both the train and test set. Instead of using it for our models, however, we’ll simplify our run and fill in our data. Finally, after our analysis, we can use it in in our model, we’ll update STARS to become a factor variable.
| Mean | Std.Dev | Min | Q1 | Median | Q3 | Max | MAD | IQR | CV | Skewness | SE.Skewness | Kurtosis | N.Valid | Pct.Valid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AcidIndex | 7.7727237 | 1.3239264 | 4.00000 | 7.00000 | 8.00000 | 8.00000 | 17.00000 | 1.4826000 | 1.000000 | 0.1703298 | 1.6484959 | 0.0216523 | 5.1900925 | 12795 | 100 |
| Alcohol | 10.4970736 | 3.7271443 | -4.70000 | 9.00000 | 10.40000 | 12.40000 | 26.50000 | 2.4462900 | 3.400000 | 0.3550651 | -0.0203077 | 0.0216523 | 1.5448710 | 12795 | 100 |
| Chlorides | 0.2227006 | 0.2340474 | 0.00000 | 0.04600 | 0.09900 | 0.36900 | 1.35100 | 0.1008168 | 0.323000 | 1.0509511 | 1.4732261 | 0.0216523 | 2.1533220 | 12795 | 100 |
| CitricAcid | 0.6863150 | 0.6060052 | 0.00000 | 0.28000 | 0.44000 | 0.97000 | 3.86000 | 0.3261720 | 0.690000 | 0.8829841 | 1.6428101 | 0.0216523 | 2.9474433 | 12795 | 100 |
| Density | 0.9942027 | 0.0265376 | 0.88809 | 0.98772 | 0.99449 | 1.00052 | 1.09924 | 0.0093552 | 0.012795 | 0.0266924 | -0.0186938 | 0.0216523 | 1.8999592 | 12795 | 100 |
| FixedAcidity | 8.0632513 | 4.9961186 | 0.00000 | 5.60000 | 7.00000 | 9.80000 | 34.40000 | 2.9652000 | 4.200000 | 0.6196159 | 1.1742806 | 0.0216523 | 1.9666235 | 12795 | 100 |
| FreeSulfurDioxide | 106.8281360 | 108.4394316 | 0.00000 | 28.00000 | 56.00000 | 172.00000 | 623.00000 | 60.7866000 | 144.000000 | 1.0150831 | 1.5281310 | 0.0216523 | 2.4394290 | 12795 | 100 |
| LabelAppeal | 1.9909340 | 0.8910892 | 0.00000 | 1.00000 | 2.00000 | 3.00000 | 4.00000 | 1.4826000 | 2.000000 | 0.4475735 | 0.0084295 | 0.0216523 | -0.2622916 | 12795 | 100 |
| pH | 3.2069050 | 0.6796027 | 0.48000 | 2.95000 | 3.20000 | 3.47000 | 6.13000 | 0.3854760 | 0.520000 | 0.2119186 | 0.0466170 | 0.0216523 | 1.6349942 | 12795 | 100 |
| ResidualSugar | 23.4240953 | 24.9813809 | 0.00000 | 3.60000 | 12.90000 | 38.80000 | 141.15000 | 16.3086000 | 35.200000 | 1.0664822 | 1.4657347 | 0.0216523 | 2.2249837 | 12795 | 100 |
| Sulphates | 0.8446760 | 0.6542444 | 0.00000 | 0.43000 | 0.59000 | 1.09000 | 4.24000 | 0.3261720 | 0.660000 | 0.7745507 | 1.6908843 | 0.0216523 | 3.1944468 | 12795 | 100 |
| TARGET | 3.0290739 | 1.9263682 | 0.00000 | 2.00000 | 3.00000 | 4.00000 | 8.00000 | 1.4826000 | 2.000000 | 0.6359595 | -0.3263010 | 0.0216523 | -0.8772457 | 12795 | 100 |
| TotalSulfurDioxide | 203.8295819 | 162.6289281 | 0.00000 | 100.00000 | 154.00000 | 261.00000 | 1057.00000 | 102.2994000 | 161.000000 | 0.7978672 | 1.6287104 | 0.0216523 | 3.1504058 | 12795 | 100 |
| VolatileAcidity | 0.6410856 | 0.5556141 | 0.00000 | 0.25000 | 0.41000 | 0.91000 | 3.68000 | 0.3261720 | 0.660000 | 0.8666770 | 1.6529782 | 0.0216523 | 3.0833040 | 12795 | 100 |
With our transformations complete, we can now add these into our
cleaned_train dataframe and continue on to build our
models. To better measure each model performance, we split our data into
a training and testing data set. We will train using the first, then
measure model performance again the testing hold out set.
## [1] "Number of Training Samples: 10238"
## [1] "Number of Testing Samples: 2557"
In this first model, we include all available features:
FixedAcidity, VolatileAcidity,
CitricAcid, ResidualSugar,
Chlorides, FreeSulfurDioxide,
TotalSulfurDioxide, Density, pH,
Sulphates, Alcohol, LabelAppeal,
AcidIndex, STARS
##
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = trainingData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7892257665 0.2181427186 3.618 0.000297 ***
## FixedAcidity -0.0000909220 0.0011721364 -0.078 0.938171
## VolatileAcidity -0.0338187850 0.0106085477 -3.188 0.001433 **
## CitricAcid 0.0132612453 0.0092961426 1.427 0.153715
## ResidualSugar 0.0000007492 0.0002263610 0.003 0.997359
## Chlorides -0.0110526480 0.0244656468 -0.452 0.651440
## FreeSulfurDioxide 0.0000887012 0.0000517645 1.714 0.086611 .
## TotalSulfurDioxide 0.0000829237 0.0000347822 2.384 0.017122 *
## Density -0.2817475213 0.2128419603 -1.324 0.185589
## pH -0.0188391082 0.0084099701 -2.240 0.025085 *
## Sulphates -0.0145419498 0.0088317235 -1.647 0.099649 .
## Alcohol 0.0048269578 0.0015330983 3.148 0.001641 **
## LabelAppeal 0.1544563349 0.0068400657 22.581 < 0.0000000000000002 ***
## AcidIndex -0.0772590165 0.0051544902 -14.989 < 0.0000000000000002 ***
## STARS1 0.7771273256 0.0218970731 35.490 < 0.0000000000000002 ***
## STARS2 1.0984642747 0.0204361382 53.751 < 0.0000000000000002 ***
## STARS3 1.2181307614 0.0215291274 56.581 < 0.0000000000000002 ***
## STARS4 1.3319344540 0.0272443113 48.889 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18214 on 10237 degrees of freedom
## Residual deviance: 10883 on 10220 degrees of freedom
## AIC: 36525
##
## Number of Fisher Scoring iterations: 6
## RMSE Rsquared MAE aic bic
## 2.5918810 0.5189506 2.2210284 36524.6844093 36654.8939175
In this Model 2, we only include the most predictive features :
VolatileAcidity, TotalSulfurDioxide,
Alcohol, LabelAppeal, AcidIndex,
STARS
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide +
## Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson,
## data = trainingData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.44778794 0.04993620 8.967 < 0.0000000000000002 ***
## VolatileAcidity -0.03437320 0.01060849 -3.240 0.00119 **
## TotalSulfurDioxide 0.00008357 0.00003477 2.404 0.01623 *
## Alcohol 0.00484363 0.00153248 3.161 0.00157 **
## LabelAppeal 0.15438417 0.00683771 22.578 < 0.0000000000000002 ***
## AcidIndex -0.07690721 0.00507587 -15.152 < 0.0000000000000002 ***
## STARS1 0.77900265 0.02188616 35.593 < 0.0000000000000002 ***
## STARS2 1.10114091 0.02042280 53.917 < 0.0000000000000002 ***
## STARS3 1.22080022 0.02151477 56.742 < 0.0000000000000002 ***
## STARS4 1.33463103 0.02722954 49.014 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18214 on 10237 degrees of freedom
## Residual deviance: 10898 on 10228 degrees of freedom
## AIC: 36523
##
## Number of Fisher Scoring iterations: 6
## RMSE Rsquared MAE aic bic
## 2.5921425 0.5179013 2.2215012 36523.3146236 36595.6532392
Similar to Poisson Model 1, the predictors for the following model
are: FixedAcidity, VolatileAcidity,
CitricAcid, ResidualSugar,
Chlorides, FreeSulfurDioxide,
TotalSulfurDioxide, Density, pH,
Sulphates, Alcohol, LabelAppeal,
AcidIndex, STARS
##
## Call:
## glm.nb(formula = TARGET ~ ., data = trainingData, init.theta = 40596.4241,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7892550644 0.2181525397 3.618 0.000297 ***
## FixedAcidity -0.0000909421 0.0011721890 -0.078 0.938160
## VolatileAcidity -0.0338199312 0.0106090117 -3.188 0.001433 **
## CitricAcid 0.0132615966 0.0092965767 1.427 0.153723
## ResidualSugar 0.0000007457 0.0002263713 0.003 0.997372
## Chlorides -0.0110528191 0.0244667637 -0.452 0.651450
## FreeSulfurDioxide 0.0000887052 0.0000517669 1.714 0.086611 .
## TotalSulfurDioxide 0.0000829281 0.0000347838 2.384 0.017121 *
## Density -0.2817517962 0.2128516115 -1.324 0.185603
## pH -0.0188403506 0.0084103528 -2.240 0.025082 *
## Sulphates -0.0145424359 0.0088321177 -1.647 0.099653 .
## Alcohol 0.0048269046 0.0015331678 3.148 0.001642 **
## LabelAppeal 0.1544552702 0.0068403734 22.580 < 0.0000000000000002 ***
## AcidIndex -0.0772613586 0.0051546889 -14.989 < 0.0000000000000002 ***
## STARS1 0.7771263073 0.0218975340 35.489 < 0.0000000000000002 ***
## STARS2 1.0984632972 0.0204365988 53.750 < 0.0000000000000002 ***
## STARS3 1.2181305180 0.0215297110 56.579 < 0.0000000000000002 ***
## STARS4 1.3319351604 0.0272454888 48.886 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40596.42) family taken to be 1)
##
## Null deviance: 18213 on 10237 degrees of freedom
## Residual deviance: 10882 on 10220 degrees of freedom
## AIC: 36527
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40596
## Std. Err.: 37984
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -36489.02
## RMSE Rsquared MAE aic bic
## 2.5918809 0.5189507 2.2210281 36527.0228900 36664.4662598
Similar to Poisson Model 2, the predictors for the following model
are: VolatileAcidity, FreeSulfurDioxide,
TotalSulfurDioxide, Alcohol,
LabelAppeal, AcidIndex, STARS
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide +
## TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex +
## STARS, data = cleaned_train, init.theta = 40655.73652, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.47792499 0.04482061 10.663 < 0.0000000000000002 ***
## VolatileAcidity -0.03707089 0.00939632 -3.945 0.0000797 ***
## FreeSulfurDioxide 0.00005855 0.00004663 1.256 0.20927
## TotalSulfurDioxide 0.00008273 0.00003122 2.650 0.00805 **
## Alcohol 0.00375994 0.00137412 2.736 0.00621 **
## LabelAppeal 0.15910515 0.00612466 25.978 < 0.0000000000000002 ***
## AcidIndex -0.08030228 0.00449884 -17.850 < 0.0000000000000002 ***
## STARS1 0.77091345 0.01952986 39.474 < 0.0000000000000002 ***
## STARS2 1.09217216 0.01820457 59.994 < 0.0000000000000002 ***
## STARS3 1.21280507 0.01917562 63.247 < 0.0000000000000002 ***
## STARS4 1.32891964 0.02427990 54.733 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40655.74) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 13684 on 12784 degrees of freedom
## AIC: 45651
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40656
## Std. Err.: 34184
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -45626.74
## RMSE Rsquared MAE aic bic
## 2.5916959 0.5191463 2.2224820 45650.7381419 45740.2198589
The predictors for the following model are:
FixedAcidity, VolatileAcidity,
CitricAcid, ResidualSugar,
Chlorides, FreeSulfurDioxide,
TotalSulfurDioxide, Density, pH,
Sulphates, Alcohol, LabelAppeal,
AcidIndex, STARS
##
## Call:
## lm(formula = TARGET ~ ., data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7500 -0.8682 0.0239 0.8519 6.2500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.76395948 0.49318708 5.604 0.0000000215 ***
## FixedAcidity 0.00046808 0.00263790 0.177 0.85916
## VolatileAcidity -0.10072430 0.02355649 -4.276 0.0000192128 ***
## CitricAcid 0.04269888 0.02142398 1.993 0.04628 *
## ResidualSugar -0.00023949 0.00051519 -0.465 0.64204
## Chlorides -0.02668630 0.05551780 -0.481 0.63075
## FreeSulfurDioxide 0.00027371 0.00011899 2.300 0.02145 *
## TotalSulfurDioxide 0.00025870 0.00007956 3.251 0.00115 **
## Density -0.82949711 0.48373374 -1.715 0.08642 .
## pH -0.05238229 0.01912540 -2.739 0.00618 **
## Sulphates -0.03737697 0.01981556 -1.886 0.05929 .
## Alcohol 0.01574741 0.00347893 4.527 0.0000060642 ***
## LabelAppeal 0.45188116 0.01526953 29.594 < 0.0000000000000002 ***
## AcidIndex -0.19406714 0.01029532 -18.850 < 0.0000000000000002 ***
## STARS1 1.38165553 0.03688755 37.456 < 0.0000000000000002 ***
## STARS2 2.41499920 0.03585743 67.350 < 0.0000000000000002 ***
## STARS3 2.98625589 0.04153429 71.899 < 0.0000000000000002 ***
## STARS4 3.66886490 0.06626609 55.366 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.309 on 10220 degrees of freedom
## Multiple R-squared: 0.5386, Adjusted R-squared: 0.5378
## F-statistic: 701.6 on 17 and 10220 DF, p-value: < 0.00000000000000022
## RMSE Rsquared MAE aic bic
## 1.3075630 0.5385528 1.0276638 34583.1334169 34720.5767867
For the final Linear Model, we leverage stepAIC on our
Linear Model 5 to choose the most important features.
##
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + pH + Sulphates + Alcohol +
## LabelAppeal + AcidIndex + STARS, data = trainingData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7399 -0.8674 0.0238 0.8526 6.2554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.75688975 0.49276487 5.595 0.0000000227 ***
## VolatileAcidity -0.10084579 0.02355283 -4.282 0.0000187177 ***
## CitricAcid 0.04273660 0.02141695 1.995 0.04602 *
## FreeSulfurDioxide 0.00027420 0.00011897 2.305 0.02120 *
## TotalSulfurDioxide 0.00025796 0.00007954 3.243 0.00119 **
## Density -0.83195073 0.48361572 -1.720 0.08541 .
## pH -0.05240947 0.01912021 -2.741 0.00613 **
## Sulphates -0.03743763 0.01980990 -1.890 0.05881 .
## Alcohol 0.01575533 0.00347847 4.529 0.0000059825 ***
## LabelAppeal 0.45192875 0.01526602 29.604 < 0.0000000000000002 ***
## AcidIndex -0.19386468 0.01013673 -19.125 < 0.0000000000000002 ***
## STARS1 1.38228824 0.03686744 37.493 < 0.0000000000000002 ***
## STARS2 2.41529060 0.03584458 67.382 < 0.0000000000000002 ***
## STARS3 2.98644267 0.04152478 71.920 < 0.0000000000000002 ***
## STARS4 3.66914480 0.06625020 55.383 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.309 on 10223 degrees of freedom
## Multiple R-squared: 0.5385, Adjusted R-squared: 0.5379
## F-statistic: 852.2 on 14 and 10223 DF, p-value: < 0.00000000000000022
## RMSE Rsquared MAE aic bic
## 1.3075933 0.5385314 1.0277302 34577.6091408 34693.3509259
| RMSE | Rsquared | MAE | aic | bic | |
|---|---|---|---|---|---|
| Poission_Evaluate1 | 2.591881 | 0.5189506 | 2.221028 | 36524.68 | 36654.89 |
| Poission_Evaluate2 | 2.592143 | 0.5179013 | 2.221501 | 36523.31 | 36595.65 |
| Negative_Binomial_eval3 | 2.591881 | 0.5189507 | 2.221028 | 36527.02 | 36664.47 |
| Negative_Binomial_eval4 | 2.591696 | 0.5191463 | 2.222482 | 45650.74 | 45740.22 |
| Linear_Regression_eval5 | 1.307563 | 0.5385528 | 1.027664 | 34583.13 | 34720.58 |
| Linear_Regression_eval6 | 1.307593 | 0.5385314 | 1.027730 | 34577.61 | 34693.35 |
This table summarizes the RMSE, RSQUARED, MAE, AIC and BIC for all SIX models. The Linear regressions (Linear Model 5 and Linear Model 6) had the overall best performance based on RMSE and RSQUARED; as well as Linear Model 6 had the best performance based on AIC and BIC.
Finally, we chose Multiple Linear Regression Model 6 as our final model since it had a far lower AIC and BIC.
## FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1 5.4 0.860 0.27 10.7 0.092
## 2 12.4 0.385 0.76 19.7 1.169
## 3 7.2 1.750 0.17 33.0 0.065
## 4 6.2 0.100 1.80 1.0 0.179
## 5 11.4 0.210 0.28 1.2 0.038
## 6 17.6 0.040 1.15 1.4 0.535
## FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol
## 1 23 398 0.98527 5.02 0.64 12.30
## 2 37 68 0.99048 3.37 1.09 16.00
## 3 9 76 1.04641 4.61 0.68 8.55
## 4 104 89 0.98877 3.20 2.11 12.30
## 5 70 53 1.02899 2.54 0.07 4.80
## 6 250 140 0.95028 3.06 0.02 11.40
## LabelAppeal AcidIndex STARS TARGET
## 1 1 6 0 1.1664538
## 2 2 6 2 4.1448181
## 3 2 8 1 2.3420568
## 4 1 8 1 2.2829844
## 5 2 10 0 0.8295706
## 6 3 8 4 5.6085833
The histogram shows that our predictions have a similar shape to our training Target variable, the means and medians are almost identical, and the kurtosis values are close.
The predicted file is uploaded to Github: https://github.com/waheeb123/Data-621/blob/main/Homeworks/Homework%205/DATA621_HW5_Predictions.csv
The Linear Regression model that we chose as the best model has an
adjusted R2 value of 0.53. All predictors and levels within each
categorical predictor are significant at the 5% level. The model
coefficients make intuitive sense. Sales of wine decrease with each
increase in the AcidIndex value. Having a
STARS rating results in more sales compared to no ratings,
and higher STARS ratings are associated with higher sales. Finally,
increases in alcohol content are associated with higher sales.The
interpretation of the coefficients aligns with the insights gained
during exploratory data analysis.