Homework-5

Introduction

Data is the key to making informed decisions and achieving success in modern business. We’ll begin our analysis by examining the dataset for outliers, missing data, potential encoding errors, multicollinearity etc., then, we’ll implement any required data cleaning procedures. Once we’ve prepared a reliable dataset, we’ll construct and assess three distinct linear models to forecast sales. Our dataset comprises both training and evaluation data; we’ll train the models using the primary training dataset and then assess their performance against the separate evaluation dataset. Finally, we’ll choose a best model that strikes the optimal balance between accuracy and simplicity.

Data Exploration

## tibble [12,795 × 15] (S3: tbl_df/tbl/data.frame)
##  $ TARGET            : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...

## tibble [3,335 × 15] (S3: tbl_df/tbl/data.frame)
##  $ TARGET            : logi [1:3335] NA NA NA NA NA NA ...
##  $ FixedAcidity      : num [1:3335] 5.4 12.4 7.2 6.2 11.4 17.6 15.5 15.9 11.6 3.8 ...
##  $ VolatileAcidity   : num [1:3335] -0.86 0.385 1.75 0.1 0.21 0.04 0.53 1.19 0.32 0.22 ...
##  $ CitricAcid        : num [1:3335] 0.27 -0.76 0.17 1.8 0.28 -1.15 -0.53 1.14 0.55 0.31 ...
##  $ ResidualSugar     : num [1:3335] -10.7 -19.7 -33 1 1.2 1.4 4.6 31.9 -50.9 -7.7 ...
##  $ Chlorides         : num [1:3335] 0.092 1.169 0.065 -0.179 0.038 ...
##  $ FreeSulfurDioxide : num [1:3335] 23 -37 9 104 70 -250 10 115 35 40 ...
##  $ TotalSulfurDioxide: num [1:3335] 398 68 76 89 53 140 17 381 83 129 ...
##  $ Density           : num [1:3335] 0.985 0.99 1.046 0.989 1.029 ...
##  $ pH                : num [1:3335] 5.02 3.37 4.61 3.2 2.54 3.06 3.07 2.99 3.32 4.72 ...
##  $ Sulphates         : num [1:3335] 0.64 1.09 0.68 2.11 -0.07 -0.02 0.75 0.31 2.18 -0.64 ...
##  $ Alcohol           : num [1:3335] 12.3 16 8.55 12.3 4.8 11.4 8.5 11.4 -0.5 10.9 ...
##  $ LabelAppeal       : int [1:3335] -1 0 0 -1 0 1 0 1 0 0 ...
##  $ AcidIndex         : int [1:3335] 6 6 8 8 10 8 12 7 12 7 ...
##  $ STARS             : int [1:3335] NA 2 1 1 NA 4 3 NA NA NA ...

The training data set has 12,795 rows and 15 columns: 14 features and 1 response variable, TARGET. The variable INDEX is used for observed identification. Twelve of the features describe the chemical properties of wine. The remaining two predictors are rating variables: LabelAppeal refers to the perceived attractiveness of a wine’s product label, while STARS is an assessment of wine quality. The output variable, TARGET, is a count measure indicating the number of wine case purchases by distributors.

Summary Statistics
Variable	N	Mean	Std. Dev.	Min	Pctl. 25	Pctl. 75	Max
TARGET	12795	3	1.9	0	2	4	8
FixedAcidity	12795	7.1	6.3	-18	5.2	9.5	34
VolatileAcidity	12795	0.32	0.78	-2.8	0.13	0.64	3.7
CitricAcid	12795	0.31	0.86	-3.2	0.03	0.58	3.9
ResidualSugar	12179	5.4	34	-128	-2	16	141
Chlorides	12157	0.055	0.32	-1.2	-0.031	0.15	1.4
FreeSulfurDioxide	12148	31	149	-555	0	70	623
TotalSulfurDioxide	12113	121	232	-823	27	208	1057
Density	12795	0.99	0.027	0.89	0.99	1	1.1
pH	12400	3.2	0.68	0.48	3	3.5	6.1
Sulphates	11585	0.53	0.93	-3.1	0.28	0.86	4.2
Alcohol	12142	10	3.7	-4.7	9	12	26
LabelAppeal	12795	-0.0091	0.89	-2	-1	1	2
AcidIndex	12795	7.8	1.3	4	7	8	17
STARS	9436	2	0.9	1	1	3	4

Predictor Variables

Of the 14 feature columns, 8 of them such as ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates, Alcohol, and STARS variables contain several missing values. LabelAppeal, AcidIndex, and STARS are discrete variables (i.e categorical) and the rest are continuous. We also noticed that several numerical features representing chemical quantities in the wine exhibit negative minimum values. We hypothesize that the original chemical measurements might have been normalized (potentially via a log transform), allowing for negative values. However, from a physical standpoint, negative concentrations shouldn’t be possible. Despite this, we’ve opted to retain these values as they are.

Response Variables

We see that the response variable, TARGET value is always between 0 and 8, which makes sense as this is the “Number of Cases of Wine Sold”. In addition, the distribution of wine cases sold, given at least one sale, exhibits symmetry and approximates normality. The target variable, number of cases, is shown below. The data shows a large number of zero values.

Variables Distributions

We observe that continuous variables exhibit a somewhat normal steep distribution. However, variables such as AcidIndex and STARS display right skewness.

In the box plot, there are not many outliers in the variables. However, TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar variables have large ranges compared to other variables.We can tell a high number of variables have numerous outliers.

Relationship Between Categorical and Response Variables

The bar charts compare the three discrete categorical variables to the TARGET variable. AcidIndex shows large quantity of wine were sold with the index number 7 and 8. LabelAppeal shows us generic label does yield higher number of wine samples per order. Lastly, STARS shows high star wine bottles have high price tags. For each of these predictors, there appears to be a significant relationship between the ordered levels and the number of wine cases sold.

Multicolinearity

	x
TARGET	1.0000000
FixedAcidity	-0.0125381
VolatileAcidity	-0.0759979
CitricAcid	0.0023450
ResidualSugar	0.0035196
Chlorides	-0.0304301
FreeSulfurDioxide	0.0226398
TotalSulfurDioxide	0.0216021
Density	-0.0475989
pH	0.0002199
Sulphates	-0.0212204
Alcohol	0.0737771
LabelAppeal	0.4979465
AcidIndex	-0.1676431
STARS	0.5546857

In the correlation table, we can see that STARS and LabelAppeal are most positively correlated variables with the response variable. Also, we some mild negative correlation between the response variable and AcidIndex variable.

Data Preparation

Negative Values

The data has some wine quality measures that are negative that should not be. We will simply take the absolute value of these for now. The alternative would be to center by adding the min of each variable. Since we are given little information about the source of this dataset, and why these quality measures are so off, it is difficult to ascertain the best overall approach. For LabelAppeal, we will add the min.

Missing Data

According to the graph, the data set has multiple variables with missing variables. The STARS variable has the most NA values. The Sulphates variable records missing values in roughly 10% of observations, while the remaining six predictors have missing values ranging from 3% to 5%. These missing values will be imputed later on during the data preparation using the MICE package and random forest prediction method.

We can see that each of the remaining variables with missing values seem to be MAR, as the mice imputation distributions roughly match the existing. We’ll also run the mice imputation again on both the train and test set. Instead of using it for our models, however, we’ll simplify our run and fill in our data. Finally, after our analysis, we can use it in in our model, we’ll update STARS to become a factor variable.

Descriptive Summaries and Correlation Review

	Mean	Std.Dev	Min	Q1	Median	Q3	Max	MAD	IQR	CV	Skewness	SE.Skewness	Kurtosis	N.Valid	Pct.Valid
AcidIndex	7.7727237	1.3239264	4.00000	7.00000	8.00000	8.00000	17.00000	1.4826000	1.000000	0.1703298	1.6484959	0.0216523	5.1900925	12795	100
Alcohol	10.4970736	3.7271443	-4.70000	9.00000	10.40000	12.40000	26.50000	2.4462900	3.400000	0.3550651	-0.0203077	0.0216523	1.5448710	12795	100
Chlorides	0.2227006	0.2340474	0.00000	0.04600	0.09900	0.36900	1.35100	0.1008168	0.323000	1.0509511	1.4732261	0.0216523	2.1533220	12795	100
CitricAcid	0.6863150	0.6060052	0.00000	0.28000	0.44000	0.97000	3.86000	0.3261720	0.690000	0.8829841	1.6428101	0.0216523	2.9474433	12795	100
Density	0.9942027	0.0265376	0.88809	0.98772	0.99449	1.00052	1.09924	0.0093552	0.012795	0.0266924	-0.0186938	0.0216523	1.8999592	12795	100
FixedAcidity	8.0632513	4.9961186	0.00000	5.60000	7.00000	9.80000	34.40000	2.9652000	4.200000	0.6196159	1.1742806	0.0216523	1.9666235	12795	100
FreeSulfurDioxide	106.8281360	108.4394316	0.00000	28.00000	56.00000	172.00000	623.00000	60.7866000	144.000000	1.0150831	1.5281310	0.0216523	2.4394290	12795	100
LabelAppeal	1.9909340	0.8910892	0.00000	1.00000	2.00000	3.00000	4.00000	1.4826000	2.000000	0.4475735	0.0084295	0.0216523	-0.2622916	12795	100
pH	3.2069050	0.6796027	0.48000	2.95000	3.20000	3.47000	6.13000	0.3854760	0.520000	0.2119186	0.0466170	0.0216523	1.6349942	12795	100
ResidualSugar	23.4240953	24.9813809	0.00000	3.60000	12.90000	38.80000	141.15000	16.3086000	35.200000	1.0664822	1.4657347	0.0216523	2.2249837	12795	100
Sulphates	0.8446760	0.6542444	0.00000	0.43000	0.59000	1.09000	4.24000	0.3261720	0.660000	0.7745507	1.6908843	0.0216523	3.1944468	12795	100
TARGET	3.0290739	1.9263682	0.00000	2.00000	3.00000	4.00000	8.00000	1.4826000	2.000000	0.6359595	-0.3263010	0.0216523	-0.8772457	12795	100
TotalSulfurDioxide	203.8295819	162.6289281	0.00000	100.00000	154.00000	261.00000	1057.00000	102.2994000	161.000000	0.7978672	1.6287104	0.0216523	3.1504058	12795	100
VolatileAcidity	0.6410856	0.5556141	0.00000	0.25000	0.41000	0.91000	3.68000	0.3261720	0.660000	0.8666770	1.6529782	0.0216523	3.0833040	12795	100

Split the Sample data Set

With our transformations complete, we can now add these into our cleaned_train dataframe and continue on to build our models. To better measure each model performance, we split our data into a training and testing data set. We will train using the first, then measure model performance again the testing hold out set.

## [1] "Number of Training Samples:  10238"

## [1] "Number of Testing Samples:  2557"

Build Models

Poisson Regression Model 1

In this first model, we include all available features: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = trainingData)
## 
## Coefficients:
##                         Estimate    Std. Error z value             Pr(>|z|)    
## (Intercept)         0.7892257665  0.2181427186   3.618             0.000297 ***
## FixedAcidity       -0.0000909220  0.0011721364  -0.078             0.938171    
## VolatileAcidity    -0.0338187850  0.0106085477  -3.188             0.001433 ** 
## CitricAcid          0.0132612453  0.0092961426   1.427             0.153715    
## ResidualSugar       0.0000007492  0.0002263610   0.003             0.997359    
## Chlorides          -0.0110526480  0.0244656468  -0.452             0.651440    
## FreeSulfurDioxide   0.0000887012  0.0000517645   1.714             0.086611 .  
## TotalSulfurDioxide  0.0000829237  0.0000347822   2.384             0.017122 *  
## Density            -0.2817475213  0.2128419603  -1.324             0.185589    
## pH                 -0.0188391082  0.0084099701  -2.240             0.025085 *  
## Sulphates          -0.0145419498  0.0088317235  -1.647             0.099649 .  
## Alcohol             0.0048269578  0.0015330983   3.148             0.001641 ** 
## LabelAppeal         0.1544563349  0.0068400657  22.581 < 0.0000000000000002 ***
## AcidIndex          -0.0772590165  0.0051544902 -14.989 < 0.0000000000000002 ***
## STARS1              0.7771273256  0.0218970731  35.490 < 0.0000000000000002 ***
## STARS2              1.0984642747  0.0204361382  53.751 < 0.0000000000000002 ***
## STARS3              1.2181307614  0.0215291274  56.581 < 0.0000000000000002 ***
## STARS4              1.3319344540  0.0272443113  48.889 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18214  on 10237  degrees of freedom
## Residual deviance: 10883  on 10220  degrees of freedom
## AIC: 36525
## 
## Number of Fisher Scoring iterations: 6

##          RMSE      Rsquared           MAE           aic           bic 
##     2.5918810     0.5189506     2.2210284 36524.6844093 36654.8939175

Poisson Regression Model 2

In this Model 2, we only include the most predictive features : VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson, 
##     data = trainingData)
## 
## Coefficients:
##                       Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)         0.44778794  0.04993620   8.967 < 0.0000000000000002 ***
## VolatileAcidity    -0.03437320  0.01060849  -3.240              0.00119 ** 
## TotalSulfurDioxide  0.00008357  0.00003477   2.404              0.01623 *  
## Alcohol             0.00484363  0.00153248   3.161              0.00157 ** 
## LabelAppeal         0.15438417  0.00683771  22.578 < 0.0000000000000002 ***
## AcidIndex          -0.07690721  0.00507587 -15.152 < 0.0000000000000002 ***
## STARS1              0.77900265  0.02188616  35.593 < 0.0000000000000002 ***
## STARS2              1.10114091  0.02042280  53.917 < 0.0000000000000002 ***
## STARS3              1.22080022  0.02151477  56.742 < 0.0000000000000002 ***
## STARS4              1.33463103  0.02722954  49.014 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18214  on 10237  degrees of freedom
## Residual deviance: 10898  on 10228  degrees of freedom
## AIC: 36523
## 
## Number of Fisher Scoring iterations: 6

##          RMSE      Rsquared           MAE           aic           bic 
##     2.5921425     0.5179013     2.2215012 36523.3146236 36595.6532392

Negative Binomial Regression Model 3

Similar to Poisson Model 1, the predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = trainingData, init.theta = 40596.4241, 
##     link = log)
## 
## Coefficients:
##                         Estimate    Std. Error z value             Pr(>|z|)    
## (Intercept)         0.7892550644  0.2181525397   3.618             0.000297 ***
## FixedAcidity       -0.0000909421  0.0011721890  -0.078             0.938160    
## VolatileAcidity    -0.0338199312  0.0106090117  -3.188             0.001433 ** 
## CitricAcid          0.0132615966  0.0092965767   1.427             0.153723    
## ResidualSugar       0.0000007457  0.0002263713   0.003             0.997372    
## Chlorides          -0.0110528191  0.0244667637  -0.452             0.651450    
## FreeSulfurDioxide   0.0000887052  0.0000517669   1.714             0.086611 .  
## TotalSulfurDioxide  0.0000829281  0.0000347838   2.384             0.017121 *  
## Density            -0.2817517962  0.2128516115  -1.324             0.185603    
## pH                 -0.0188403506  0.0084103528  -2.240             0.025082 *  
## Sulphates          -0.0145424359  0.0088321177  -1.647             0.099653 .  
## Alcohol             0.0048269046  0.0015331678   3.148             0.001642 ** 
## LabelAppeal         0.1544552702  0.0068403734  22.580 < 0.0000000000000002 ***
## AcidIndex          -0.0772613586  0.0051546889 -14.989 < 0.0000000000000002 ***
## STARS1              0.7771263073  0.0218975340  35.489 < 0.0000000000000002 ***
## STARS2              1.0984632972  0.0204365988  53.750 < 0.0000000000000002 ***
## STARS3              1.2181305180  0.0215297110  56.579 < 0.0000000000000002 ***
## STARS4              1.3319351604  0.0272454888  48.886 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40596.42) family taken to be 1)
## 
##     Null deviance: 18213  on 10237  degrees of freedom
## Residual deviance: 10882  on 10220  degrees of freedom
## AIC: 36527
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40596 
##           Std. Err.:  37984 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -36489.02

##          RMSE      Rsquared           MAE           aic           bic 
##     2.5918809     0.5189507     2.2210281 36527.0228900 36664.4662598

Negative Binomial Regression Model 4

Similar to Poisson Model 2, the predictors for the following model are: VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, data = cleaned_train, init.theta = 40655.73652, link = log)
## 
## Coefficients:
##                       Estimate  Std. Error z value             Pr(>|z|)    
## (Intercept)         0.47792499  0.04482061  10.663 < 0.0000000000000002 ***
## VolatileAcidity    -0.03707089  0.00939632  -3.945            0.0000797 ***
## FreeSulfurDioxide   0.00005855  0.00004663   1.256              0.20927    
## TotalSulfurDioxide  0.00008273  0.00003122   2.650              0.00805 ** 
## Alcohol             0.00375994  0.00137412   2.736              0.00621 ** 
## LabelAppeal         0.15910515  0.00612466  25.978 < 0.0000000000000002 ***
## AcidIndex          -0.08030228  0.00449884 -17.850 < 0.0000000000000002 ***
## STARS1              0.77091345  0.01952986  39.474 < 0.0000000000000002 ***
## STARS2              1.09217216  0.01820457  59.994 < 0.0000000000000002 ***
## STARS3              1.21280507  0.01917562  63.247 < 0.0000000000000002 ***
## STARS4              1.32891964  0.02427990  54.733 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40655.74) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 13684  on 12784  degrees of freedom
## AIC: 45651
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40656 
##           Std. Err.:  34184 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -45626.74

##          RMSE      Rsquared           MAE           aic           bic 
##     2.5916959     0.5191463     2.2224820 45650.7381419 45740.2198589

Multiple Linear Regression Model 5

The predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

## 
## Call:
## lm(formula = TARGET ~ ., data = trainingData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7500 -0.8682  0.0239  0.8519  6.2500 
## 
## Coefficients:
##                       Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)         2.76395948  0.49318708   5.604         0.0000000215 ***
## FixedAcidity        0.00046808  0.00263790   0.177              0.85916    
## VolatileAcidity    -0.10072430  0.02355649  -4.276         0.0000192128 ***
## CitricAcid          0.04269888  0.02142398   1.993              0.04628 *  
## ResidualSugar      -0.00023949  0.00051519  -0.465              0.64204    
## Chlorides          -0.02668630  0.05551780  -0.481              0.63075    
## FreeSulfurDioxide   0.00027371  0.00011899   2.300              0.02145 *  
## TotalSulfurDioxide  0.00025870  0.00007956   3.251              0.00115 ** 
## Density            -0.82949711  0.48373374  -1.715              0.08642 .  
## pH                 -0.05238229  0.01912540  -2.739              0.00618 ** 
## Sulphates          -0.03737697  0.01981556  -1.886              0.05929 .  
## Alcohol             0.01574741  0.00347893   4.527         0.0000060642 ***
## LabelAppeal         0.45188116  0.01526953  29.594 < 0.0000000000000002 ***
## AcidIndex          -0.19406714  0.01029532 -18.850 < 0.0000000000000002 ***
## STARS1              1.38165553  0.03688755  37.456 < 0.0000000000000002 ***
## STARS2              2.41499920  0.03585743  67.350 < 0.0000000000000002 ***
## STARS3              2.98625589  0.04153429  71.899 < 0.0000000000000002 ***
## STARS4              3.66886490  0.06626609  55.366 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.309 on 10220 degrees of freedom
## Multiple R-squared:  0.5386, Adjusted R-squared:  0.5378 
## F-statistic: 701.6 on 17 and 10220 DF,  p-value: < 0.00000000000000022

##          RMSE      Rsquared           MAE           aic           bic 
##     1.3075630     0.5385528     1.0276638 34583.1334169 34720.5767867

Multiple Linear Regression Model 6

For the final Linear Model, we leverage stepAIC on our Linear Model 5 to choose the most important features.

## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS, data = trainingData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7399 -0.8674  0.0238  0.8526  6.2554 
## 
## Coefficients:
##                       Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)         2.75688975  0.49276487   5.595         0.0000000227 ***
## VolatileAcidity    -0.10084579  0.02355283  -4.282         0.0000187177 ***
## CitricAcid          0.04273660  0.02141695   1.995              0.04602 *  
## FreeSulfurDioxide   0.00027420  0.00011897   2.305              0.02120 *  
## TotalSulfurDioxide  0.00025796  0.00007954   3.243              0.00119 ** 
## Density            -0.83195073  0.48361572  -1.720              0.08541 .  
## pH                 -0.05240947  0.01912021  -2.741              0.00613 ** 
## Sulphates          -0.03743763  0.01980990  -1.890              0.05881 .  
## Alcohol             0.01575533  0.00347847   4.529         0.0000059825 ***
## LabelAppeal         0.45192875  0.01526602  29.604 < 0.0000000000000002 ***
## AcidIndex          -0.19386468  0.01013673 -19.125 < 0.0000000000000002 ***
## STARS1              1.38228824  0.03686744  37.493 < 0.0000000000000002 ***
## STARS2              2.41529060  0.03584458  67.382 < 0.0000000000000002 ***
## STARS3              2.98644267  0.04152478  71.920 < 0.0000000000000002 ***
## STARS4              3.66914480  0.06625020  55.383 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.309 on 10223 degrees of freedom
## Multiple R-squared:  0.5385, Adjusted R-squared:  0.5379 
## F-statistic: 852.2 on 14 and 10223 DF,  p-value: < 0.00000000000000022

##          RMSE      Rsquared           MAE           aic           bic 
##     1.3075933     0.5385314     1.0277302 34577.6091408 34693.3509259

Select Models

	RMSE	Rsquared	MAE	aic	bic
Poission_Evaluate1	2.591881	0.5189506	2.221028	36524.68	36654.89
Poission_Evaluate2	2.592143	0.5179013	2.221501	36523.31	36595.65
Negative_Binomial_eval3	2.591881	0.5189507	2.221028	36527.02	36664.47
Negative_Binomial_eval4	2.591696	0.5191463	2.222482	45650.74	45740.22
Linear_Regression_eval5	1.307563	0.5385528	1.027664	34583.13	34720.58
Linear_Regression_eval6	1.307593	0.5385314	1.027730	34577.61	34693.35

This table summarizes the RMSE, RSQUARED, MAE, AIC and BIC for all SIX models. The Linear regressions (Linear Model 5 and Linear Model 6) had the overall best performance based on RMSE and RSQUARED; as well as Linear Model 6 had the best performance based on AIC and BIC.

Finally, we chose Multiple Linear Regression Model 6 as our final model since it had a far lower AIC and BIC.

Prediction

##   FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1          5.4           0.860       0.27          10.7     0.092
## 2         12.4           0.385       0.76          19.7     1.169
## 3          7.2           1.750       0.17          33.0     0.065
## 4          6.2           0.100       1.80           1.0     0.179
## 5         11.4           0.210       0.28           1.2     0.038
## 6         17.6           0.040       1.15           1.4     0.535
##   FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
## 1                23                398 0.98527 5.02      0.64   12.30
## 2                37                 68 0.99048 3.37      1.09   16.00
## 3                 9                 76 1.04641 4.61      0.68    8.55
## 4               104                 89 0.98877 3.20      2.11   12.30
## 5                70                 53 1.02899 2.54      0.07    4.80
## 6               250                140 0.95028 3.06      0.02   11.40
##   LabelAppeal AcidIndex STARS    TARGET
## 1           1         6     0 1.1664538
## 2           2         6     2 4.1448181
## 3           2         8     1 2.3420568
## 4           1         8     1 2.2829844
## 5           2        10     0 0.8295706
## 6           3         8     4 5.6085833

The histogram shows that our predictions have a similar shape to our training Target variable, the means and medians are almost identical, and the kurtosis values are close.

The predicted file is uploaded to Github: https://github.com/waheeb123/Data-621/blob/main/Homeworks/Homework%205/DATA621_HW5_Predictions.csv

Conclusions

The Linear Regression model that we chose as the best model has an adjusted R2 value of 0.53. All predictors and levels within each categorical predictor are significant at the 5% level. The model coefficients make intuitive sense. Sales of wine decrease with each increase in the AcidIndex value. Having a STARS rating results in more sales compared to no ratings, and higher STARS ratings are associated with higher sales. Finally, increases in alcohol content are associated with higher sales.The interpretation of the coefficients aligns with the insights gained during exploratory data analysis.