Homework #5: Count Regression (Wines)

Overview

In this homework assignment we explored, analyzed and modeled a data set containing information on approximately 12,000 commercially available wines. The variables were mostly related to the chemical properties of the wine being sold. The response variable was the number of sample cases of wine that were purchased by wine distribution companies after a sample.

Initially, we examined the data for any problems that may have existed, such as missing data, outliers, and multi-collinearity. Next we took the necessary steps to clean the data, and built two poisson regressions, two negative binomial regressions, and two multivariate linear regression models using the training dataset.

We trained the models and evaluated them based on how well they performed against the provided evaluation data. Finally, we selected a final model that provided the best balance between accuracy and simplicity to predict the number of cases of wine sold given certain properties of the wine.

1. Data Exploration

The training dataset contained 12795 observations of 17 predictor variables, where each record represented a commercially available wine.

These variables included measures of acidity and amounts of various chemical compounds, as well as qualitative and marketing-related data such as reviewer starts and consumer responses to label design.

The prediction dataset contained 3335 observations over the same predictor variables.

Target: Number of Cases Purchased
AcidIndex: Proprietary method of testing total acidity of wine by using a weighted average
Alcohol: Alcohol Content
Chlorides: Chloride content of wine
CitricAcid: Citric Acid Content
Density: Density of Wine
FixedAcidity: Fixed Acidity of Wine
FreeSulfurDioxide: Sulfur Dioxide content of wine
LabelAppeal: Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don’t like the design.
ResidualSugar: Residual Sugar of wine
Stars: Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor. A high number of stars suggests high sales
Sulphates: Sulfate content of wine
TotalSulfurDioxide: Total Sulfur Dioxide of Wine
VolatileAcidity: Volatile Acid content of wine
pH: pH of wine

1.1 Summary Statistics

In order to explore summary stats and distribution characteristics of our dataset, we needed to first conduct some basic transformations and cleanup:

The ‘training’ dataset contained a single response variable target, a numeric variable indicating the number of cases purchased.
The ‘prediction’ dataset contained no values for target, suggesting this data might be used for prediction rather than validation and evaluation of model performance. For clarity we’ll renamed this this dataset ‘prediction’ instead and created a separate validation hold-out from the training data.
There was a numeric index column labeling the observations which could be excluded from the models.

While exploring this data, we made the following observations:

Data contained only numeric values.
4 variables were discrete (stars, labelappeal, acidindex, target).
8 variables out of 14 contained missing values.
target (number of cases purchased) varied between 0 and 8.

Summary Statistics
variable	complete_rate	n_missing	min	max
acidindex	1.00	0	4.00	17.00
alcohol	0.95	838	-4.70	26.50
chlorides	0.95	776	-1.17	1.35
citricacid	1.00	0	-3.24	3.86
density	1.00	0	0.89	1.10
fixedacidity	1.00	0	-18.20	34.40
freesulfurdioxide	0.95	799	-563.00	623.00
labelappeal	1.00	0	-2.00	2.00
ph	0.97	499	0.48	6.21
residualsugar	0.95	784	-128.30	145.40
stars	0.74	4200	1.00	4.00
sulphates	0.91	1520	-3.13	4.24
totalsulfurdioxide	0.95	839	-823.00	1057.00
volatileacidity	1.00	0	-2.83	3.68

1.2 Distribution

One of the first characteristics that stood out was the presence of negative values for many chemical compounds, and the relative normality of their distributions. This suggested they had already been power-transformed to produce normal distributions for modeling.

Variables related to sugars, chlorides, acidity, sulfides and sulfates all seemed to fall in this category. Considering that were are analyzing very tiny amounts of chemical compounds, we might correctly assume their natural distributions may be highly skewed.

The variables acidindex (proprietary method of testing total acidity of wine by using a weighted average) seemed to be slightly right-skewed.

The histogram of the target variable showed a mostly normal distribution, though we considered the possibility of using zero-inflated models to handle the high frequency of zero values.

1.3 Boxplots

Graphing our variables’ distributions with boxplots didn’t identify any outliers we needed to deal with.

We did see a relationship between stars/labelappealand the target variable. As labelappeal increased, the target variable also went up - suggesting a positive relationship between label design appeal and cases purchased.

We also noticed that missing/NA values for stars was associated with low values for target. Using the hint from from the assignment that “sometimes, the fact that a variable is missing is actually predictive of the target”, we didn’t discard these values nor impute new values for them.

1.4 Scatter Plots

Graphing our variables distributions with scatterplots helped us understand relationships between our independent variables and the target variable.

With the target being discrete, it was difficult to check linearity of relationships in the data. However, stars and labelappeal variables seemed to have some positive association with the target.

1.5 Correlation Matrix

In the correlation plot below, we observed that variables Star and LabelAppeal were the most positively correlated to the response variable. There was a slightly negative correlation between AcidIndex and the target variable. In terms of multicollinearity, we didn’t see high correlation between variables and there was a chance that we wouldn’t need to deal with it in our models.

2. Data Preparation

2.1 Transformations, Outliers

The field index was removed from the data as it didn’t affect anything in the models, simply being a table index of an observation.

We tried exponentiation of variables related to sugars, chlorides, acidity, sulfides and sulfates by the natural log and other values, but did not arrive at an obvious or consistent transformation approach - we may not be to interpret model results on the scale of the original values for these variables.

A number of chemical variables had negative values that seem somewhat implausible. However, we decided to trust that these variables had undergone log transformation and that the values could be considered valid.

2.2 Missing Data

Next we decided to find and impute any missing data. There were 8 predictor variables that contained NAs:

Missing Data
	is_na	pct
stars	4200	0.26
sulphates	1520	0.09
residualsugar	784	0.05
chlorides	776	0.05
freesulfurdioxide	799	0.05
totalsulfurdioxide	839	0.05
alcohol	838	0.05
ph	499	0.03

Heeding the warning in the assignment, “sometimes, the fact that a variable is missing is actually predictive of the target”, we considered each of these variables carefully. While there could be data “missing completely at random” (MCAR) that we wished to impute, this may not always have been the case.

2.2.1 Missing Data - Stars

The predictor Stars suggested that out of 16,000 wine samples, about 25% had never been professionally reviewed. If we assumed the existence of a review had some impact on the sales of a wine brand (whatever the reviewer’s sentiment), then imputing mean or predicted values here might distort our model.

Therefore, we simply preserved the NAs for Stars, transformed the variable to a factor. Further, we converted these NA values to zeroes when applying zero-inflated models, below.

2.2.2 Missing Data - Chemical Compounds

Next we considered some of the missing chemical compounds in our wines; alcohol, sugars, chlorides, sulfites and sulfates, and measures such as ph.

First, we assumed that all wines in this dataset had an actual ph score greater than zero (which would represent the most acidic rank, such as powerful industrial acids). We decided to impute more reasonable values instead.

Based on some reading into the organic wines segment, there is a growing demand in the market for specialty products such as low-sulfite, low-sugar and low-alcohol wines. However, this still represents a very small segment of the overall market, and chemically it was not likely for these compounds to be completely absent from the final product.

Additionally, the predictors freesulfurdioxide and totalsulfurdioxide were linked - the amount of ‘Free’ SO2 in wine is always a subset of the ‘Total’ S02 presented. We only identified 59 cases where both these values were NA, while over 1500 cases had missing values for only one or the other.

Based on these observations, we used the MICE imputation method to predict and impute the missing values for residualsugar, chlorides, freesulphurdioxide, totalsulfurdioxide, sulphates, alchohol and ph.

Target/source labels and non-chemical predictors labelappeal and stars were excluded as predictors for the imputation.

2.3 Data Sparseness - Label Appeal

labelappeal was a numeric score of consumer ratings for a wine brand’s label design. It had also been pre-transformed to produce a normal distribution for modeling; however this was a very sparse variable with nearly half the cases having a value of zero.

This could have been candidate for handling with zero-Inflated models as well, so we didn’t change the values but csimply onverted labelappeal from a numeric to a factor.

2.4 Examine Final Dataset

After completing the previous operations, we observed reasonably-imputed values and nearly-normal distributions for our numeric predictors, taking special note of the frequency of zero values for labelappeal and stars.

Final Dataset - Missing and Zeros
variable	n_missing	n_zero
acidindex	0	0
alcohol	0	5
chlorides	0	8
citricacid	0	151
density	0	0
fixedacidity	0	47
freesulfurdioxide	0	12
labelappeal	0	7087
ph	0	0
residualsugar	0	6
stars	0	4200
sulphates	0	28
totalsulfurdioxide	0	11
volatileacidity	0	22

2.5 Split Datasets

With transformations complete, we split our data back into training and prediction datasets based on our source_flag, and created a 15% validation hold-out from the training data.

3. Build Models

3.1 Poisson Regression 1

Poisson Regression assumes that the variance and mean of the dependent variable (target) are roughly equal, otherwise we may be looking at over- or under-dispersion. The data used for the models had no missing values (stars=NA was substituted with zeroes), while Label Appeal and Acid Index were transformed to factors.

pr1 <- glm(target ~ ., family = 'poisson', data = try_train)

## 
## Call:
## glm(formula = target ~ ., family = "poisson", data = try_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2338  -0.6656  -0.0058   0.4475   3.6797  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.441e+00  4.942e-01   2.917  0.00354 ** 
## fixedacidity        2.069e-04  8.935e-04   0.232  0.81691    
## volatileacidity    -2.918e-02  7.114e-03  -4.102 4.09e-05 ***
## citricacid          5.365e-03  6.388e-03   0.840  0.40101    
## residualsugar      -1.716e-05  1.650e-04  -0.104  0.91718    
## chlorides          -3.300e-02  1.744e-02  -1.892  0.05844 .  
## freesulfurdioxide   8.500e-05  3.726e-05   2.281  0.02254 *  
## totalsulfurdioxide  7.665e-05  2.410e-05   3.180  0.00147 ** 
## density            -2.640e-01  2.078e-01  -1.271  0.20388    
## ph                 -1.029e-02  8.230e-03  -1.250  0.21114    
## sulphates          -1.174e-02  5.935e-03  -1.978  0.04796 *  
## alcohol             4.148e-03  1.497e-03   2.770  0.00560 ** 
## labelappeal-1       2.340e-01  4.089e-02   5.722 1.05e-08 ***
## labelappeal0        4.230e-01  3.987e-02  10.612  < 2e-16 ***
## labelappeal1        5.542e-01  4.055e-02  13.667  < 2e-16 ***
## labelappeal2        7.004e-01  4.580e-02  15.292  < 2e-16 ***
## acidindex5         -5.730e-01  4.528e-01  -1.265  0.20570    
## acidindex6         -5.196e-01  4.483e-01  -1.159  0.24644    
## acidindex7         -5.601e-01  4.481e-01  -1.250  0.21128    
## acidindex8         -5.894e-01  4.481e-01  -1.315  0.18841    
## acidindex9         -7.053e-01  4.484e-01  -1.573  0.11572    
## acidindex10        -8.614e-01  4.492e-01  -1.918  0.05515 .  
## acidindex11        -1.207e+00  4.519e-01  -2.671  0.00756 ** 
## acidindex12        -1.235e+00  4.568e-01  -2.704  0.00684 ** 
## acidindex13        -1.030e+00  4.591e-01  -2.243  0.02487 *  
## acidindex14        -1.141e+00  4.717e-01  -2.418  0.01561 *  
## acidindex15        -9.209e-01  5.401e-01  -1.705  0.08820 .  
## acidindex16        -1.102e+00  6.332e-01  -1.740  0.08185 .  
## acidindex17        -1.614e+00  6.334e-01  -2.548  0.01083 *  
## stars2              3.267e-01  1.559e-02  20.949  < 2e-16 ***
## stars3              4.427e-01  1.695e-02  26.124  < 2e-16 ***
## stars4              5.613e-01  2.352e-02  23.863  < 2e-16 ***
## stars0             -7.480e-01  2.122e-02 -35.257  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 19539  on 10865  degrees of freedom
## Residual deviance: 11596  on 10833  degrees of freedom
## AIC: 38714
## 
## Number of Fisher Scoring iterations: 6

AIC	38713.99
Dispersion	0.89
Log-Lik	-19323.99

Stepwise selection helped to reduce the AIC metric.

Observations	10866
Dependent variable	target
Type	Generalized linear model
Family	poisson
Link	log

χ(27)	7938.37
Pseudo-R (Cragg-Uhler)	0.53
Pseudo-R (McFadden)	0.17
AIC	38708.01
BIC	38912.22

	Est.	S.E.	z val.	p
(Intercept)	1.14	0.45	2.54	0.01
volatileacidity	-0.03	0.01	-4.15	0.00
chlorides	-0.03	0.02	-1.91	0.06
freesulfurdioxide	0.00	0.00	2.29	0.02
totalsulfurdioxide	0.00	0.00	3.16	0.00
sulphates	-0.01	0.01	-1.96	0.05
alcohol	0.00	0.00	2.82	0.00
labelappeal-1	0.23	0.04	5.72	0.00
labelappeal0	0.42	0.04	10.60	0.00
labelappeal1	0.55	0.04	13.67	0.00
labelappeal2	0.70	0.05	15.27	0.00
acidindex5	-0.57	0.45	-1.26	0.21
acidindex6	-0.51	0.45	-1.15	0.25
acidindex7	-0.55	0.45	-1.24	0.22
acidindex8	-0.58	0.45	-1.30	0.19
acidindex9	-0.70	0.45	-1.56	0.12
acidindex10	-0.85	0.45	-1.90	0.06
acidindex11	-1.20	0.45	-2.65	0.01
acidindex12	-1.23	0.46	-2.69	0.01
acidindex13	-1.02	0.46	-2.23	0.03
acidindex14	-1.13	0.47	-2.40	0.02
acidindex15	-0.91	0.54	-1.68	0.09
acidindex16	-1.09	0.63	-1.72	0.08
acidindex17	-1.61	0.63	-2.55	0.01
stars2	0.33	0.02	20.98	0.00
stars3	0.44	0.02	26.17	0.00
stars4	0.56	0.02	23.88	0.00
stars0	-0.75	0.02	-35.28	0.00
Standard errors: MLE

AIC	38708.01
Dispersion	0.89
Log-Lik	-19326.00

Model Performance

We noted that our model generated ‘dummies’ from our categorical variables labelappeal and stars, and of the 14 total predictors, all but 5 had statistical significance.

The following variables had highly positive coefficients in our model, indicating a higher number of cases purchased. High reviewer ratings and label appeal seemed to lead to better sales:

Positive Coefficients
	Est	p
(Intercept)	1.141	0.011
labelappeal2	0.699	0.000
stars4	0.562	0.000
labelappeal1	0.554	0.000
stars3	0.443	0.000
labelappeal0	0.423	0.000
stars2	0.327	0.000
labelappeal-1	0.234	0.000

Variables with highly negative coefficients indicated a lower number of cases purchased. Lack of reviewer ratings and high levels of acidity seemed to lead to lower sales:

Negative Coefficients
	Est	p
acidindex17	-1.614	0.011
acidindex12	-1.226	0.007
acidindex11	-1.198	0.008
acidindex14	-1.132	0.016
acidindex13	-1.021	0.026
acidindex10	-0.853	0.057
stars0	-0.748	0.000

The null deviance of 1.953867^{4} defined how well the target variable could be predicted by a model with only an intercept term.

The residual deviance of 1.160029^{4} defined how well the target variable could be predicted by the AIC model that we fit with the predictor variables listed above. The lower the value, the better the model’s predictions of the response variable.

The p-value associated with this Chi-Square Statistic was 0 (less than .05), so the model could be useful.

The Akaike information criterion (AIC) was 3.870801^{4}. The lower the AIC value, the better the model’s ability to fit the data. We used this metric to compare the relative performance of the six models, below.

Notably, our Dispersion Parameter was 0.89, which suggested a degree of under-dispersion in the data.

Model Assumptions

We evaluated the modeling assumptions using standard diagnostic plots, and a Variance Inflation Factor (VIF) to assess collinearity.

The resulting plot 1 below (predicted values vs. residuals) showed that many of the standardized residuals fell outside the range of -2 and 2. Examining the Standardized Pearson Residuals plot, the model seemed good until the middle range but there were extremes on the right. In general, this model didn’t fit our data as there were zero counts in the response variable. We had to consider models that could address this issue.

All variables had a VIF less than 5. As a result, th multi-collinearity may not be a problem for this model.

Variance Inflation Factor
	GVIF	df	adj
stars	1.175	4	1.020
labelappeal	1.139	4	1.016
acidindex	1.051	13	1.002
alcohol	1.011	1	1.006
volatileacidity	1.006	1	1.003
totalsulfurdioxide	1.004	1	1.002
freesulfurdioxide	1.004	1	1.002
chlorides	1.003	1	1.002
sulphates	1.002	1	1.001

3.2 Poisson Regression 2

Sometimes the number of zeroes / missing values in dataset can distort the results of a normal count regression. We built a zero-Inflated Poisson model to handle these values in our labelappeal and stars predictors and the target response, to see if we can improve model accuracy. By default, a binomial model with a logit link was used for the zero component and a Poisson with a log link was used for the count component.

pr2 <- zeroinfl(target ~ .| ., data=try_train, dist = 'poisson')

## 
## Call:
## zeroinfl(formula = target ~ . | ., data = try_train, dist = "poisson")
## 
## Pearson residuals:
##       Min        1Q    Median        3Q       Max 
## -2.280483 -0.427363 -0.001335  0.382451  4.806834 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         8.536e-01  5.282e-01   1.616   0.1060    
## fixedacidity        3.491e-04  9.204e-04   0.379   0.7044    
## volatileacidity    -1.328e-02  7.299e-03  -1.819   0.0689 .  
## citricacid          1.045e-03  6.513e-03   0.160   0.8726    
## residualsugar      -3.229e-05  1.693e-04  -0.191   0.8487    
## chlorides          -1.901e-02  1.788e-02  -1.063   0.2878    
## freesulfurdioxide   1.822e-05  3.773e-05   0.483   0.6291    
## totalsulfurdioxide -7.294e-06  2.398e-05  -0.304   0.7610    
## density            -2.421e-01  2.146e-01  -1.128   0.2592    
## ph                  3.791e-03  8.469e-03   0.448   0.6544    
## sulphates          -5.903e-04  6.083e-03  -0.097   0.9227    
## alcohol             7.090e-03  1.527e-03   4.643 3.43e-06 ***
## labelappeal-1       4.238e-01  4.446e-02   9.531  < 2e-16 ***
## labelappeal0        7.118e-01  4.345e-02  16.381  < 2e-16 ***
## labelappeal1        8.998e-01  4.417e-02  20.373  < 2e-16 ***
## labelappeal2        1.065e+00  4.919e-02  21.640  < 2e-16 ***
## acidindex5         -1.837e-01  4.832e-01  -0.380   0.7038    
## acidindex6         -1.474e-01  4.789e-01  -0.308   0.7583    
## acidindex7         -1.715e-01  4.787e-01  -0.358   0.7201    
## acidindex8         -1.865e-01  4.788e-01  -0.390   0.6968    
## acidindex9         -2.207e-01  4.790e-01  -0.461   0.6450    
## acidindex10        -2.846e-01  4.799e-01  -0.593   0.5531    
## acidindex11        -2.811e-01  4.827e-01  -0.582   0.5603    
## acidindex12        -1.959e-01  4.878e-01  -0.402   0.6879    
## acidindex13        -1.157e-01  4.898e-01  -0.236   0.8133    
## acidindex14        -1.237e-01  5.034e-01  -0.246   0.8059    
## acidindex15        -1.059e-01  5.706e-01  -0.186   0.8527    
## acidindex16         5.334e-03  6.560e-01   0.008   0.9935    
## acidindex17        -3.414e-01  6.562e-01  -0.520   0.6029    
## stars2              1.249e-01  1.629e-02   7.666 1.78e-14 ***
## stars3              2.190e-01  1.756e-02  12.466  < 2e-16 ***
## stars4              3.154e-01  2.404e-02  13.123  < 2e-16 ***
## stars0             -5.830e-02  2.293e-02  -2.543   0.0110 *  
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -6.507e+00  2.126e+00  -3.060 0.002211 ** 
## fixedacidity        1.544e-03  6.049e-03   0.255 0.798595    
## volatileacidity     1.563e-01  4.657e-02   3.356 0.000791 ***
## citricacid         -3.469e-02  4.288e-02  -0.809 0.418424    
## residualsugar      -3.322e-04  1.103e-03  -0.301 0.763305    
## chlorides           8.908e-02  1.166e-01   0.764 0.444853    
## freesulfurdioxide  -7.064e-04  2.539e-04  -2.782 0.005399 ** 
## totalsulfurdioxide -9.312e-04  1.602e-04  -5.814 6.10e-09 ***
## density             7.630e-01  1.421e+00   0.537 0.591369    
## ph                  1.800e-01  5.452e-02   3.302 0.000961 ***
## sulphates           1.243e-01  3.968e-02   3.134 0.001725 ** 
## alcohol             2.800e-02  1.003e-02   2.792 0.005241 ** 
## labelappeal-1       1.395e+00  3.402e-01   4.100 4.13e-05 ***
## labelappeal0        2.145e+00  3.376e-01   6.353 2.10e-10 ***
## labelappeal1        2.850e+00  3.440e-01   8.286  < 2e-16 ***
## labelappeal2        3.228e+00  4.007e-01   8.057 7.82e-16 ***
## acidindex5          9.117e-01  1.619e+00   0.563 0.573287    
## acidindex6          5.455e-01  1.555e+00   0.351 0.725722    
## acidindex7          7.854e-01  1.551e+00   0.506 0.612620    
## acidindex8          9.631e-01  1.551e+00   0.621 0.534713    
## acidindex9          1.664e+00  1.554e+00   1.071 0.284115    
## acidindex10         1.960e+00  1.558e+00   1.258 0.208458    
## acidindex11         3.195e+00  1.569e+00   2.036 0.041718 *  
## acidindex12         3.414e+00  1.593e+00   2.143 0.032115 *  
## acidindex13         3.658e+00  1.630e+00   2.244 0.024809 *  
## acidindex14         2.872e+00  1.628e+00   1.764 0.077813 .  
## acidindex15         2.883e+00  1.961e+00   1.470 0.141514    
## acidindex16         1.525e+01  7.389e+02   0.021 0.983529    
## acidindex17         1.476e+01  6.166e+02   0.024 0.980900    
## stars2             -3.696e+00  3.467e-01 -10.660  < 2e-16 ***
## stars3             -2.673e+01  8.194e+02  -0.033 0.973978    
## stars4             -1.960e+01  1.295e+03  -0.015 0.987921    
## stars0              2.061e+00  8.268e-02  24.925  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 72 
## Log-likelihood: -1.724e+04 on 66 Df

AIC	34615.12
Dispersion	0.45
Log-Lik	-17241.56

Model Performance

The results for the positive/negative coefficients seemed strange in the zero-inflation model. We should remember that our zero-inflated model produced two sets of coefficients — one for the truncated Poisson (count) portion of the model and one for the binary response portion of the model.

The Poisson portion of the model showed that high reviewer ratings and label appeal seemed to lead to better sales, while high acidity reduced them (which agreed with our first Poisson model).

The binary response (zero-inflated) portion of the model indicated the lack of reviewer ratings (stars) decreased the chances of purchase.

The p-value associated with this Chi-Square Statistic was 0 (less than .05), so the model could be useful.

Using a Zero-Inflated model, the dispersion parameter dropped significantly, but we were getting a better overall result for counts of 3 or more.

Model Assumptions

The resulting plot below (predicted values vs. residuals) showed that many of the standardized residuals fell outside the range of -2 and 2.

To use Poisson models, we assume the mean and variance of the target variable are roughly equal - however we observed that assumption was violated here. A Negative Binomial model seemed more appropriate since it assumes the variance to be larger than the mean.

Mean	Variance
3.021	3.724

3.3 Negative Binomial Regression 1

Generally, we would use Negative Binomial Regression in cases of over-dispersion (where the variance of our dependent variable is significantly greater than the mean). This did not appear to be the case with our dataset, where there was a small difference between the mean and the variance, but we applied it here and examined the results:

nb1 <- glm.nb(target ~ ., data = try_train)

## 
## Call:
## glm.nb(formula = target ~ ., data = try_train, init.theta = 40448.60337, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2337  -0.6655  -0.0058   0.4475   3.6795  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.441e+00  4.942e-01   2.917  0.00354 ** 
## fixedacidity        2.069e-04  8.935e-04   0.232  0.81692    
## volatileacidity    -2.918e-02  7.114e-03  -4.102 4.09e-05 ***
## citricacid          5.365e-03  6.389e-03   0.840  0.40102    
## residualsugar      -1.715e-05  1.650e-04  -0.104  0.91721    
## chlorides          -3.300e-02  1.744e-02  -1.892  0.05844 .  
## freesulfurdioxide   8.501e-05  3.726e-05   2.281  0.02254 *  
## totalsulfurdioxide  7.666e-05  2.410e-05   3.180  0.00147 ** 
## density            -2.640e-01  2.078e-01  -1.271  0.20390    
## ph                 -1.029e-02  8.231e-03  -1.250  0.21113    
## sulphates          -1.174e-02  5.935e-03  -1.978  0.04796 *  
## alcohol             4.148e-03  1.497e-03   2.770  0.00560 ** 
## labelappeal-1       2.340e-01  4.089e-02   5.722 1.05e-08 ***
## labelappeal0        4.230e-01  3.987e-02  10.612  < 2e-16 ***
## labelappeal1        5.542e-01  4.055e-02  13.666  < 2e-16 ***
## labelappeal2        7.004e-01  4.581e-02  15.291  < 2e-16 ***
## acidindex5         -5.730e-01  4.528e-01  -1.265  0.20571    
## acidindex6         -5.196e-01  4.483e-01  -1.159  0.24646    
## acidindex7         -5.601e-01  4.481e-01  -1.250  0.21130    
## acidindex8         -5.894e-01  4.481e-01  -1.315  0.18842    
## acidindex9         -7.053e-01  4.484e-01  -1.573  0.11573    
## acidindex10        -8.614e-01  4.492e-01  -1.918  0.05516 .  
## acidindex11        -1.207e+00  4.519e-01  -2.671  0.00756 ** 
## acidindex12        -1.235e+00  4.568e-01  -2.704  0.00684 ** 
## acidindex13        -1.030e+00  4.591e-01  -2.243  0.02487 *  
## acidindex14        -1.141e+00  4.717e-01  -2.418  0.01562 *  
## acidindex15        -9.209e-01  5.401e-01  -1.705  0.08821 .  
## acidindex16        -1.102e+00  6.333e-01  -1.740  0.08185 .  
## acidindex17        -1.614e+00  6.334e-01  -2.548  0.01083 *  
## stars2              3.267e-01  1.559e-02  20.949  < 2e-16 ***
## stars3              4.427e-01  1.695e-02  26.123  < 2e-16 ***
## stars4              5.613e-01  2.352e-02  23.861  < 2e-16 ***
## stars0             -7.480e-01  2.122e-02 -35.256  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40448.6) family taken to be 1)
## 
##     Null deviance: 19538  on 10865  degrees of freedom
## Residual deviance: 11596  on 10833  degrees of freedom
## AIC: 38716
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40449 
##           Std. Err.:  36874 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38648.35

AIC	38716.35
Dispersion	0.89
Log-Lik	-19324.17

Model Performance

Of the total predictors, all but 5 had statistical significance.

The results for the positive/negative coefficients seemed reasonable; high reviewer ratngs and label appeal increased sales while high acidcity levels and low reviewer ratings decreased them.

Positive Coefficients
	Est	p
(Intercept)	1.441	0.004
labelappeal2	0.700	0.000
stars4	0.561	0.000
labelappeal1	0.554	0.000
stars3	0.443	0.000
labelappeal0	0.423	0.000
stars2	0.327	0.000
labelappeal-1	0.234	0.000

Negative Coefficients
	Est	p
acidindex17	-1.614	0.011
acidindex12	-1.235	0.007
acidindex11	-1.207	0.008
acidindex14	-1.141	0.016
acidindex13	-1.030	0.025
acidindex10	-0.861	0.055
stars0	-0.748	0.000

The residual deviance was 1.159582^{4} Though residual deviance was greater than the degrees of freedom that meant over-dispersion (the estimates were correct, but the standard errors were wrong and unaccounted for by the model).

The p-value associated with this Chi-Square Statistic was 0 (less than .05), so the model could be useful.

The Akaike information criterion (AIC) was 3.871635^{4}. The lower the AIC value, the better the model’s ability to fit the data.

The Theta parameter, however, seemed too large (40419). This occurred because the data was almost equidispersed (variance == mean), which couldn’t be achieved by a negative binomial distribution. As a result, the optimizer gave up while it was trying to send the dispersion parameter to infinity.

As expected, the Negative Binomial Regression did not outperform the Poisson. There was little difference between poisson and negative binomial models due to the small difference between mean and variance.

Model Assumptions

The resulting plot below (predicted values vs. residuals) showed that many of the standardized residuals fell outside the range of -2 and 2. Examining the Standardized Pearson Residuals plot, the model seemed good in the middle range with extremes on the ends. In general, this model didn’t fit our data as there were zero counts in the response variable. We had to consider other models that could address this issue instead.

All variables had a VIF less than 5. As a result, multi-collinearity may not have been a problem for this model.

Variance Inflation Factor
	GVIF	df	adj
stars	1.178	4	1.021
labelappeal	1.142	4	1.017
acidindex	1.090	13	1.003
fixedacidity	1.024	1	1.012
alcohol	1.013	1	1.006
ph	1.007	1	1.004
residualsugar	1.007	1	1.003
citricacid	1.006	1	1.003
volatileacidity	1.006	1	1.003
totalsulfurdioxide	1.005	1	1.003
freesulfurdioxide	1.004	1	1.002
density	1.004	1	1.002
chlorides	1.004	1	1.002
sulphates	1.003	1	1.002

3.4 Negative Binomial Regression 2

We built a Zero-Inflated Negative Binomial model to handle the large number of zero values in our labelappeal and stars predictors, to see if we could improve model accuracy.

nb2 <- zeroinfl(target ~ . | ., data=try_train, dist = 'negbin')

## 
## Call:
## zeroinfl(formula = target ~ . | ., data = try_train, dist = "negbin")
## 
## Pearson residuals:
##       Min        1Q    Median        3Q       Max 
## -2.280486 -0.427363 -0.001335  0.382451  4.806844 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         8.537e-01  5.282e-01   1.616   0.1060    
## fixedacidity        3.491e-04  9.204e-04   0.379   0.7044    
## volatileacidity    -1.328e-02  7.299e-03  -1.819   0.0689 .  
## citricacid          1.045e-03  6.513e-03   0.160   0.8726    
## residualsugar      -3.229e-05  1.693e-04  -0.191   0.8487    
## chlorides          -1.901e-02  1.788e-02  -1.063   0.2878    
## freesulfurdioxide   1.822e-05  3.773e-05   0.483   0.6291    
## totalsulfurdioxide -7.294e-06  2.398e-05  -0.304   0.7610    
## density            -2.421e-01  2.146e-01  -1.128   0.2592    
## ph                  3.791e-03  8.469e-03   0.448   0.6544    
## sulphates          -5.903e-04  6.083e-03  -0.097   0.9227    
## alcohol             7.090e-03  1.527e-03   4.643 3.43e-06 ***
## labelappeal-1       4.238e-01  4.446e-02   9.531  < 2e-16 ***
## labelappeal0        7.118e-01  4.345e-02  16.381  < 2e-16 ***
## labelappeal1        8.998e-01  4.417e-02  20.373  < 2e-16 ***
## labelappeal2        1.065e+00  4.919e-02  21.640  < 2e-16 ***
## acidindex5         -1.837e-01  4.832e-01  -0.380   0.7038    
## acidindex6         -1.474e-01  4.789e-01  -0.308   0.7583    
## acidindex7         -1.716e-01  4.787e-01  -0.358   0.7201    
## acidindex8         -1.865e-01  4.787e-01  -0.390   0.6968    
## acidindex9         -2.207e-01  4.790e-01  -0.461   0.6450    
## acidindex10        -2.847e-01  4.799e-01  -0.593   0.5531    
## acidindex11        -2.811e-01  4.827e-01  -0.582   0.5603    
## acidindex12        -1.960e-01  4.878e-01  -0.402   0.6879    
## acidindex13        -1.157e-01  4.898e-01  -0.236   0.8133    
## acidindex14        -1.237e-01  5.034e-01  -0.246   0.8059    
## acidindex15        -1.059e-01  5.706e-01  -0.186   0.8527    
## acidindex16         5.325e-03  6.560e-01   0.008   0.9935    
## acidindex17        -3.414e-01  6.562e-01  -0.520   0.6029    
## stars2              1.249e-01  1.629e-02   7.666 1.78e-14 ***
## stars3              2.190e-01  1.756e-02  12.466  < 2e-16 ***
## stars4              3.154e-01  2.404e-02  13.123  < 2e-16 ***
## stars0             -5.830e-02  2.293e-02  -2.543   0.0110 *  
## Log(theta)          1.730e+01        NaN     NaN      NaN    
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -6.507e+00  2.126e+00  -3.060 0.002211 ** 
## fixedacidity        1.543e-03  6.049e-03   0.255 0.798606    
## volatileacidity     1.563e-01  4.657e-02   3.356 0.000791 ***
## citricacid         -3.469e-02  4.288e-02  -0.809 0.418418    
## residualsugar      -3.321e-04  1.103e-03  -0.301 0.763312    
## chlorides           8.908e-02  1.166e-01   0.764 0.444859    
## freesulfurdioxide  -7.064e-04  2.539e-04  -2.782 0.005399 ** 
## totalsulfurdioxide -9.312e-04  1.602e-04  -5.814 6.10e-09 ***
## density             7.629e-01  1.421e+00   0.537 0.591410    
## ph                  1.800e-01  5.452e-02   3.302 0.000961 ***
## sulphates           1.243e-01  3.968e-02   3.134 0.001725 ** 
## alcohol             2.800e-02  1.003e-02   2.792 0.005241 ** 
## labelappeal-1       1.395e+00  3.402e-01   4.100 4.13e-05 ***
## labelappeal0        2.145e+00  3.376e-01   6.353 2.11e-10 ***
## labelappeal1        2.850e+00  3.440e-01   8.285  < 2e-16 ***
## labelappeal2        3.228e+00  4.007e-01   8.057 7.82e-16 ***
## acidindex5          9.117e-01  1.619e+00   0.563 0.573297    
## acidindex6          5.455e-01  1.555e+00   0.351 0.725724    
## acidindex7          7.854e-01  1.551e+00   0.506 0.612622    
## acidindex8          9.631e-01  1.551e+00   0.621 0.534714    
## acidindex9          1.664e+00  1.554e+00   1.071 0.284115    
## acidindex10         1.960e+00  1.558e+00   1.258 0.208458    
## acidindex11         3.195e+00  1.569e+00   2.036 0.041718 *  
## acidindex12         3.414e+00  1.593e+00   2.143 0.032115 *  
## acidindex13         3.658e+00  1.630e+00   2.244 0.024808 *  
## acidindex14         2.872e+00  1.628e+00   1.764 0.077809 .  
## acidindex15         2.883e+00  1.961e+00   1.470 0.141513    
## acidindex16         1.525e+01  7.389e+02   0.021 0.983529    
## acidindex17         1.476e+01  6.166e+02   0.024 0.980900    
## stars2             -3.696e+00  3.467e-01 -10.660  < 2e-16 ***
## stars3             -2.673e+01  8.194e+02  -0.033 0.973978    
## stars4             -1.960e+01  1.295e+03  -0.015 0.987921    
## stars0              2.061e+00  8.268e-02  24.925  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 32608481.9495 
## Number of iterations in BFGS optimization: 72 
## Log-likelihood: -1.724e+04 on 67 Df

AIC	34617.12
Dispersion	0.45
Log-Lik	-17241.56

Model Performance

The results for the positive/negative coefficients seemed reasonable and similar to the poisson models. Our zero-inflated model produced two sets of coefficients — one for the truncated Poisson (count) portion of the model and one for the zero-inflated binary response (probability) portion of the model.

The Poisson portion of the model showed that high reviewer ratings and label appeal seemed to lead to better sales, while high acidity reduced them.

The binary response portion of the model indicated the lack of reviewer ratings (stars) decreased the chances of purchase.

The p-value associated with this Chi-Square Statistic was 0 (less than .05), so the model could be useful.

The Zero-Inflated Negative Binomial model showed similar improvement as with the Zero-Inflated Poisson, but as before didn’t outperform the Poisson. Using a Zero-Inflated model, the dispersion parameter drops significantly, but we are getting a better overall result for counts of 3 or more.

Model Assumptions

The resulting plot below (predicted values vs. residuals) showed that many of the standardized residuals fell outside the range of -2 and 2 but seemed better than for the Poisson models.

3.5 Multiple Linear Regression 1

For our first Multiple Linear Regression, we used all predictors.

lm1 <- lm(target ~ ., data=try_train)

## 
## Call:
## lm(formula = target ~ ., data = try_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9867 -0.8555  0.0357  0.8405  6.0794 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.544e+00  1.038e+00   3.413 0.000645 ***
## fixedacidity        1.015e-03  2.032e-03   0.499 0.617459    
## volatileacidity    -9.190e-02  1.606e-02  -5.724 1.07e-08 ***
## citricacid          1.875e-02  1.458e-02   1.286 0.198501    
## residualsugar      -9.677e-06  3.746e-04  -0.026 0.979392    
## chlorides          -1.078e-01  3.954e-02  -2.726 0.006426 ** 
## freesulfurdioxide   2.521e-04  8.480e-05   2.973 0.002958 ** 
## totalsulfurdioxide  2.196e-04  5.439e-05   4.037 5.45e-05 ***
## density            -7.830e-01  4.733e-01  -1.654 0.098068 .  
## ph                 -2.964e-02  1.859e-02  -1.594 0.110873    
## sulphates          -2.893e-02  1.345e-02  -2.150 0.031550 *  
## alcohol             1.332e-02  3.387e-03   3.932 8.49e-05 ***
## labelappeal-1       3.653e-01  6.811e-02   5.364 8.33e-08 ***
## labelappeal0        8.289e-01  6.642e-02  12.480  < 2e-16 ***
## labelappeal1        1.289e+00  6.937e-02  18.584  < 2e-16 ***
## labelappeal2        1.904e+00  9.157e-02  20.792  < 2e-16 ***
## acidindex5         -8.276e-01  9.396e-01  -0.881 0.378429    
## acidindex6         -6.783e-01  9.268e-01  -0.732 0.464237    
## acidindex7         -7.991e-01  9.261e-01  -0.863 0.388242    
## acidindex8         -9.000e-01  9.262e-01  -0.972 0.331202    
## acidindex9         -1.220e+00  9.267e-01  -1.317 0.187937    
## acidindex10        -1.534e+00  9.279e-01  -1.653 0.098357 .  
## acidindex11        -1.965e+00  9.298e-01  -2.113 0.034618 *  
## acidindex12        -2.021e+00  9.343e-01  -2.163 0.030592 *  
## acidindex13        -1.950e+00  9.417e-01  -2.071 0.038392 *  
## acidindex14        -1.791e+00  9.503e-01  -1.885 0.059476 .  
## acidindex15        -1.466e+00  1.050e+00  -1.396 0.162636    
## acidindex16        -1.928e+00  1.134e+00  -1.700 0.089115 .  
## acidindex17        -2.414e+00  1.050e+00  -2.298 0.021600 *  
## stars2              1.060e+00  3.543e-02  29.911  < 2e-16 ***
## stars3              1.613e+00  4.088e-02  39.450  < 2e-16 ***
## stars4              2.285e+00  6.486e-02  35.226  < 2e-16 ***
## stars0             -1.327e+00  3.579e-02 -37.087  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.307 on 10833 degrees of freedom
## Multiple R-squared:  0.5424, Adjusted R-squared:  0.5411 
## F-statistic: 401.3 on 32 and 10833 DF,  p-value: < 2.2e-16

AIC	36694.00
Adj R2	0.54

Model Performance

The F-statistic was 401, and out of the 14 variables, 5 had statistically significant p-values.

The Adjusted R-Squared of 0.54 explained about 54% of the total variance in the response variable target.

The RMSE was 1.31.

Positive coefficients indicated an increase in the purchases. In this model, wines with a better rating, good label, high alcohol tended to have more purchases.

Similarly, negative coefficients lead less purchases. In this model, wines with high acid level, no rating,high concentration of sulphates/chlorides tended to have less purchases.

Model Assumptions

An examination of the residuals indicated that the Residuals vs Fitted plot didn’t show a constant variability of the residuals, but the Q-Q plot indicated a much better level of normality compared to the previous models.

The variables included in our model had a VIF of 1 (or slightly above), indicating the variables were not correlated, for the most part.

3.6 Multiple Linear Regression 2

For our second Multiple Linear Regression, we added a stepwise feature selection.

lm2_all <- lm(target ~ ., data=try_train)
lm2 <- stepAIC(lm2_all, trace=FALSE, direction='both')

## 
## Call:
## lm(formula = target ~ volatileacidity + chlorides + freesulfurdioxide + 
##     totalsulfurdioxide + density + ph + sulphates + alcohol + 
##     labelappeal + acidindex + stars, data = try_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9820 -0.8568  0.0367  0.8435  6.0692 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.546e+00  1.038e+00   3.417 0.000635 ***
## volatileacidity    -9.232e-02  1.605e-02  -5.751 9.09e-09 ***
## chlorides          -1.084e-01  3.953e-02  -2.743 0.006104 ** 
## freesulfurdioxide   2.530e-04  8.477e-05   2.985 0.002842 ** 
## totalsulfurdioxide  2.195e-04  5.436e-05   4.038 5.42e-05 ***
## density            -7.912e-01  4.732e-01  -1.672 0.094563 .  
## ph                 -2.999e-02  1.859e-02  -1.613 0.106731    
## sulphates          -2.904e-02  1.344e-02  -2.160 0.030791 *  
## alcohol             1.341e-02  3.386e-03   3.959 7.56e-05 ***
## labelappeal-1       3.646e-01  6.810e-02   5.355 8.75e-08 ***
## labelappeal0        8.284e-01  6.641e-02  12.474  < 2e-16 ***
## labelappeal1        1.289e+00  6.936e-02  18.584  < 2e-16 ***
## labelappeal2        1.904e+00  9.155e-02  20.792  < 2e-16 ***
## acidindex5         -8.150e-01  9.388e-01  -0.868 0.385357    
## acidindex6         -6.621e-01  9.259e-01  -0.715 0.474565    
## acidindex7         -7.806e-01  9.252e-01  -0.844 0.398853    
## acidindex8         -8.807e-01  9.252e-01  -0.952 0.341165    
## acidindex9         -1.198e+00  9.257e-01  -1.294 0.195548    
## acidindex10        -1.511e+00  9.268e-01  -1.631 0.102981    
## acidindex11        -1.940e+00  9.287e-01  -2.089 0.036743 *  
## acidindex12        -1.994e+00  9.331e-01  -2.137 0.032658 *  
## acidindex13        -1.922e+00  9.403e-01  -2.044 0.040992 *  
## acidindex14        -1.758e+00  9.489e-01  -1.852 0.064032 .  
## acidindex15        -1.438e+00  1.049e+00  -1.371 0.170466    
## acidindex16        -1.896e+00  1.133e+00  -1.674 0.094188 .  
## acidindex17        -2.370e+00  1.049e+00  -2.260 0.023860 *  
## stars2              1.060e+00  3.542e-02  29.921  < 2e-16 ***
## stars3              1.613e+00  4.087e-02  39.458  < 2e-16 ***
## stars4              2.285e+00  6.485e-02  35.236  < 2e-16 ***
## stars0             -1.328e+00  3.578e-02 -37.115  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.307 on 10836 degrees of freedom
## Multiple R-squared:  0.5424, Adjusted R-squared:  0.5411 
## F-statistic: 442.8 on 29 and 10836 DF,  p-value: < 2.2e-16

AIC	36689.91
Adj R2	0.54

Model Performance

The F-statistic was 458, and out of the 14 variables, 5 had statistically significant p-values.

The Adjusted R-Squared of 0.54only explained about 54% of the total variance in the response variable target.

The RMSE was 1.31.

Positive coefficients indicated an increase in the purchases. In this model, wines with a higher reviewer rating, label appeal and high alcohol contenttended to have more purchases.

Wines with no reviewer ratings, high acidity levels, and high concentrations of sulphates/chlorides tended to have fewer purchases.

Model Assumptions

The variables included in our model had a VIF of 1 (or slightly above), indicating the variables were not correlated, for the most part.

4. Model Selection

The following table summarized performance metrics for each model on the training dataset (RMSE, AIC, BIC, mae, mase, smape for all 6 models).

Results suggested that the performance improved when using zero-inflated models.

Model performance also improved after applying transformations and methods for selecting significant predictor variables, such as Lasso and Variable Inflation Factor.

Poisson model 2 and Negative Binomial model 2 (zero-inflated) had similar results for all metrics.
AIC/BIC was lowest for the Poisson model 2, and the worst results were for Poisson/Negative Binomial models without modeling an excess of zero counts.
MAE (mean absolute error between observed and predicted values) was the lowest for the Poisson model 2, while Multiple linear models performed with higher mean error.
RMSE (Root Mean Square Error) was alsmost similar for all the models with an exception for zero-inflated Poisson and Negative Binomial models, they preformed the best in terms of this metric.

Overall, the Zero-inflated Poisson model appeared to produce the best fit.

Model	smape	mase	mae	RMSE	AIC	BIC	Adjusted R2	F-statistic
Poisson Regression 1	62.0858	0.4637	0.9987	1.2708	38708.01	38912.22	NA	NA
Poisson Regression 2	60.5765	0.4344	0.9358	1.2296	34615.12	35096.48	NA	NA
Negative Binomial 1	62.1109	0.4644	1.0004	1.2714	38716.35	38964.32	NA	NA
Negative Binomial 2	60.5765	0.4344	0.9358	1.2296	34617.12	35105.77	NA	NA
Multiple Linear Model 1	62.6345	0.4653	1.0023	1.2714	36694.00	36941.97	0.5411	401.3394
Multiple Linear Model 2	62.6335	0.4653	1.0022	1.2753	36689.91	36916.01	0.5411	442.8361

5. Predictions

Using the Zero-Inflated Poisson regression model (no missing values, some variables to factors), we predicted target values for the 0 observations in our evaluation dataset.

We applied all training data transformations to the evaluation data as well.

We wrote the results to a file “eval_predictions.csv” and saved it to the GitHub repo.

6. Conclusion

The most important predictive factors for the number of purchases - by far - were the number of reviewer stars (stars), the label design appeal for consumers (labelappeal), and the total acidity of wine (acidindex).

Based on our models, a good wine should have good reviews, a low acid index, and an attractive label.

There were some issues with the accuracy of predictions of zero cases purchased (target=0). This could be a candidate for further research and analysis, together with the high Theta being produced for the Negative Binomial models.

For future consideration, we might also wish to look into advanced variable selection, as it could improve the performance of our models.

References

Faraway, J. J. (2014). Linear Models with R, Second Edition. CRC Press.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated
Faraway, J. J. (2016). Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. CRC Press.
Admin. (2018, February 27). Total Sulfur Dioxide – Why it Matters, Too! Midwest Grape and Wine Industry Institute. https://www.extension.iastate.edu/wine/total-sulfur-dioxide-why-it-matters-too
Zero-Inflated Poisson Regression | R Data Analysis Examples. (n.d.). https://stats.oarc.ucla.edu/r/dae/zip/

Appendix

R Code

## ---- include=FALSE-------------------------------------------------------------------------------------------------------------------------------------------
# if error for kableExtra, do
# devtools::install_github("kupietz/kableExtra")


## ----setup, include=FALSE-------------------------------------------------------------------------------------------------------------------------------------
# chunks
knitr::opts_chunk$set(echo=FALSE, eval=TRUE, include=TRUE, 
message=FALSE, warning=FALSE, fig.height=5, fig.align='center')


# libraries
library(tidyverse)
library(kableExtra)
library(MASS) # glm.nb()
library(mice)
library(pscl) # zeroinfl()
library(skimr)
library(sjPlot)
library(mpath)
library(yardstick)
library(caret)
library(corrplot)
library(Hmisc)
library(jtools)
library(car)
library(knitr)

# ggplot
theme_set(theme_light())



## # only required for final knit

## #h4, h5 {margin-top: 20px;}


## ----common functions-----------------------------------------------------------------------------------------------------------------------------------------
nice_table <- function(df, cap=NULL, cols=NULL, dig=3, fw=F){
  if (is.null(cols)) {c <- colnames(df)} else {c <- cols}
  table <- df %>% 
    kable(caption=cap, col.names=c, digits=dig) %>% 
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed"),
      html_font = 'monospace',
      full_width = fw)
  return(table)
}

model_diag <- function(model){
  model_sum <- summary(model)
  aic <- AIC(model)
  ar2 <- model_sum$adj.r.squared
  disp <- sum(resid(model,'pearson')^2)/model$df.residual
  loglik <- logLik(model)
  
  vec <- c(ifelse(is.null(aic), NA, aic),
           ifelse(is.null(ar2), NA, ar2),
           ifelse(is.null(disp), NA, disp),
           ifelse(is.null(loglik), NA, loglik))
  
  names(vec) <- c('AIC','Adj R2','Dispersion','Log-Lik')
  return(vec)
}


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# insert chunk & run with rmarkdown::render('DATA621_HW5_Group1.Rmd') to see knit vars in local environment.
#knitr::knit_exit()


## ----data_load------------------------------------------------------------------------------------------------------------------------------------------------
# load data
df_train <- read.csv('https://raw.githubusercontent.com/cliftonleesps/data621_group1/main/hw5/data/wine-training-data.csv')
df_predict <- read.csv('https://raw.githubusercontent.com/cliftonleesps/data621_group1/main/hw5/data/wine-evaluation-data.csv')


## ----clean_data, message=FALSE, warning=FALSE, error=FALSE----------------------------------------------------------------------------------------------------
# fix index name in predict for bind
df_predict <- df_predict %>% rename(INDEX=IN)
df_train <- df_train %>% rename(INDEX=Рї.С—INDEX)

# union for exploration and prep
df_train <- df_train %>% mutate(source='train')
df_predict <- df_predict %>% mutate(source='predict')
df_all <- bind_rows(df_train, df_predict)

# fix column labels
names(df_all) <- str_to_lower(str_replace_all(names(df_all), c(" " = "_" , "," = "", "\\*" = "", "\\(" = "", "\\)" = "", "`" = "", "\\/" = "_")))

# drop un-necessary columns
df_all <- df_all %>% dplyr::select(!index)


## ----summary--------------------------------------------------------------------------------------------------------------------------------------------------
# summary stats for all non-dummy variables
df_all %>% 
  dplyr::select(!c(source, target)) %>%
  skim() %>%
  dplyr::select(skim_variable, complete_rate, n_missing, 
                numeric.p0, numeric.p100) %>%
  rename(variable=skim_variable, min=numeric.p0, max=numeric.p100) %>%
  mutate(complete_rate=round(complete_rate,2), 
         min=round(min,2), max=round(max,2)) %>%
  arrange(variable) %>%
  nice_table(cap='Summary Statistics')


## ----distrib--------------------------------------------------------------------------------------------------------------------------------------------------
numeric_cols <- sapply(df_all, is.numeric)

df_all[,numeric_cols] %>%
  dplyr::select(!target) %>%
  pivot_longer(everything(),names_to = c('variables'),values_to = c('values')) %>% 
  ggplot() +
  geom_histogram(aes(x=values, y = ..density..), alpha=0.5, colour='black', size=0.2) +
  geom_density(aes(x=values), color='purple') +
  facet_wrap(vars(variables), scales="free")


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
df_all%>%
  drop_na(target) %>%
  ggplot() +
  geom_histogram(aes(x=target), colour='black', size=0.2) 


## ----boxplots-------------------------------------------------------------------------------------------------------------------------------------------------
box_plots <- df_train %>% 
  dplyr::select(!c(INDEX, source))
  
box_plots %>% 
  gather(key = 'variables', value = 'values')  %>%
  ggplot(aes(variables, values)) + 
  geom_boxplot(fill = 'coral4') + 
    labs(title = 'Boxplot: Training data',
       x = 'Variables',
       y = 'Values')+
  facet_wrap(. ~variables, scales='free', ncol=6)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
box_plots %>% 
  dplyr::select(c(TARGET, STARS, LabelAppeal, AcidIndex)) %>%
  pivot_longer(!c(TARGET), names_to = 'variables', values_to = 'values') %>% 
  arrange(variables, values) %>%
  ggplot(mapping = aes(x = factor(values), y = TARGET)) + 
  geom_boxplot(fill = 'coral4') +   
  labs(title = 'Boxplot: Discrete variables',
       x = 'Variables',
       y = 'Values')+
  facet_wrap(~variables, scales = 'free',  ncol = 1)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
featurePlot(box_plots[,2:ncol(box_plots)], box_plots[,1], plot = "scatter")


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
rcore <- rcorr(as.matrix(box_plots %>% dplyr::select(where(is.numeric))))
coeff <- rcore$r
corrplot(coeff, tl.cex = .5, tl.col="black", method = 'color', addCoef.col = "black",
         type="upper", order="hclust", number.cex=0.7, diag=FALSE)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# working copy
df_prep <- df_all

# find columns with NAs
x <- df_prep %>%
  dplyr::select(!target) %>%
  apply(2, function(col) sum(is.na(col)))

x[x>0] %>% as.data.frame() %>% 
  setNames('is_na') %>% 
  mutate(pct = round(is_na/nrow(df_prep),2)) %>%
  arrange(desc(pct)) %>%
  nice_table(cap = 'Missing Data')


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# substitute 0 instead of NA later in the code
df_prep <- df_prep %>%
  mutate(stars = as.factor(stars))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# MICE setup
init = mice(df_prep, maxit=0) 
meth = init$method
predM = init$predictorMatrix

# omit from MICE predictors
predM[, c('target','labelappeal','stars','source')] = 0

# omit from MICE imputation
meth[c('target','labelappeal','stars','source')] = ''

impute = mice(df_prep, method=meth, predictorMatrix=predM, m=5, printFlag=FALSE)

df_prep_imputed <- complete(impute)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
df_prep_imputed <- df_prep_imputed %>%
  mutate(labelappeal = as.factor(labelappeal)) %>%
  mutate(acidindex = as.factor(acidindex)) 


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
#dataframe for models without NA
try_df <- df_prep_imputed

# Get levels and add "0"
levels <- levels(try_df$stars)
levels[length(levels) + 1] <- "0"

# refactor Stars to include "0" as a factor level
# and replace NA with "0"
try_df$stars <- factor(try_df$stars, levels = levels)
try_df$stars[is.na(try_df$stars)] <- "0"


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
n_missing <- apply(try_df, 2, function(col) sum(is.na(col)))
n_zero <- colSums(try_df==0)

df <- data.frame(n_missing, n_zero)
df <- cbind(variable=rownames(df),df)
rownames(df) <- NULL

df %>% 
  dplyr::filter(variable != 'source' & variable != 'target') %>%
  arrange(variable) %>% 
  nice_table(cap = 'Final Dataset - Missing and Zeros')


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
numeric_cols <- sapply(try_df, is.numeric)

try_df[,numeric_cols] %>%
  dplyr::select(!target) %>%
  pivot_longer(everything(),names_to = c('variables'),values_to = c('values')) %>% 
  ggplot() +
  geom_histogram(aes(x=values, y = ..density..), alpha=0.5, colour='black', size=0.2) +
  geom_density(aes(x=values), color='purple') +
  facet_wrap(vars(variables), scales="free")


## ----eval=FALSE-----------------------------------------------------------------------------------------------------------------------------------------------
## # random seed
## set.seed(42)
## # split back to train and test, remove source flag
## df_train_all <- df_prep_imputed %>% filter(source == 'train') %>% dplyr::select(!source)
## 
## df_predict <- df_prep_imputed %>% filter(source == 'predict') %>% dplyr::select(!source)
## 
## # create a validation holdout from the training dataset
## row_sample <- sample(c(TRUE,FALSE), nrow(df_train_all), replace=TRUE, prob=c(0.85,0.15))
## 
## df_train <- df_train_all[row_sample,]
## df_valid <- df_train_all[!row_sample,]


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# random seed
set.seed(42)
# split back to train and test, remove source flag
try_df_all <- try_df %>% filter(source == 'train') %>% dplyr::select(!source)

try_df_predict <- try_df %>% filter(source == 'predict') %>% dplyr::select(!source)

# create a validation holdout from the training dataset
row_samples <- sample(c(TRUE,FALSE), nrow(try_df_all), replace=TRUE, prob=c(0.85,0.15))

try_train <- try_df_all[row_samples,]
try_valid <- try_df_all[!row_samples,]


## ---- echo=TRUE-----------------------------------------------------------------------------------------------------------------------------------------------
pr1 <- glm(target ~ ., family = 'poisson', data = try_train)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
summary(pr1)

pr1_diag <- model_diag(pr1)

pr1_diag[c('AIC','Dispersion','Log-Lik')] %>% 
  round(2) %>% nice_table()


## ----pr_aic, warning=FALSE, message=FALSE---------------------------------------------------------------------------------------------------------------------
pr1_aic <- pr1 %>% stepAIC(trace = FALSE)
summ(pr1_aic)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
pr1_diag <- model_diag(pr1_aic)

pr1_diag[c('AIC','Dispersion','Log-Lik')] %>% 
  round(2) %>% nice_table()


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
as.data.frame(summ(pr1_aic)$coeftable) %>%
  dplyr::select(Est.,p) %>%
  filter(p < 0.06 & Est. > 0.2) %>%
  arrange(desc(Est.)) %>%
  nice_table(cap='Positive Coefficients', cols=c('Est','p'))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
as.data.frame(summ(pr1_aic)$coeftable) %>%
  dplyr::select(Est.,p) %>%
  filter(p < 0.06 & Est. < -0.2) %>%
  arrange(Est.) %>%
  nice_table(cap='Negative Coefficients', cols=c('Est','p'))


## ---- fig.align='center', out.width='50%', fig.height=4-------------------------------------------------------------------------------------------------------
# predict based on our validation holdout
pr1_valid <- predict(pr1_aic, newdata = try_valid, type = "response")

# bind the predicted and actuals for comparison
pr1_eval <- bind_cols(target = try_valid$target, predicted= pr1_valid)
#pr1_eval[is.na(pr1_eval)] <- 0

# xy plot of validation predictions and targets
pr1_eval %>%
  ggplot(aes(x = target, y = predicted)) +
  geom_point(alpha = .3) +
  geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE)

# density plot of validation predictions and targets
#pr1_eval %>%
 # ggplot() +
  #geom_density(aes(x=target), fill='green', alpha=0.25) +
  #geom_density(aes(x=round(predicted,0)), fill='blue', alpha=0.25)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
par(mfrow=c(2,2))
plot(pr1_aic)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
as.data.frame(car::vif(pr1_aic)) %>% 
  arrange(desc(GVIF)) %>%
  nice_table(cap='Variance Inflation Factor', cols=c('GVIF','df','adj'))


## ---- echo=TRUE-----------------------------------------------------------------------------------------------------------------------------------------------
pr2 <- zeroinfl(target ~ .| ., data=try_train, dist = 'poisson')


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
summary(pr2)

pr2_diag <- model_diag(pr2)

pr2_diag[c('AIC','Dispersion','Log-Lik')] %>% 
  round(2) %>% nice_table()


## ---- fig.align='center', out.width='50%', fig.height=4-------------------------------------------------------------------------------------------------------
# predict based on our validation holdout
pr2_valid <- predict(pr2, newdata = try_valid, type="response")

# bind the predicted and actuals for comparison
pr2_eval <- bind_cols(target = try_valid$target, predicted= pr2_valid)
#pr2_eval[is.na(pr2_eval)] <- 0

# xy plot of validation predictions and targets
pr2_eval %>%
  ggplot(aes(x = target, y = predicted)) +
  geom_point(alpha = .3) +
  geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE)

# density plot of validation predictions and targets
#pr2_eval %>%
 # ggplot() +
  #geom_density(aes(x=target), fill='green', alpha=0.25) +
  #geom_density(aes(x=round(predicted,0)), fill='blue', alpha=0.25)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# Diagnostic plots

# xy plot of model fitted and residuals
bind_cols(fitted=unname(fitted(pr2)), resid=unname(resid(pr2))) %>%
  ggplot(aes(x=fitted, y=resid)) +
  geom_point(alpha = .3) +
  geom_hline(yintercept=2, linetype='dashed') +
  geom_hline(yintercept=-2, linetype='dashed')


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
varian_mean <- tibble(
                      "Mean" = numeric(), 
                      "Variance" = numeric()
                )

varian_mean <- varian_mean %>% add_row(tibble_row(
                      "Mean" = mean(try_train$target), 
                      "Variance" = var(try_train$target)
                     ))

kable(varian_mean, digits=3) %>% 
  kable_styling(bootstrap_options = "basic", position = "center")


## ---- echo=TRUE,  warning=FALSE, message=FALSE----------------------------------------------------------------------------------------------------------------
nb1 <- glm.nb(target ~ ., data = try_train)


## ----warning=FALSE, message=FALSE-----------------------------------------------------------------------------------------------------------------------------
summary(nb1)

nb1_diag <- model_diag(nb1)

nb1_diag[c('AIC','Dispersion','Log-Lik')] %>% 
  round(2) %>% nice_table()


## ----warning=FALSE, message=FALSE, error=FALSE----------------------------------------------------------------------------------------------------------------
a <- as.data.frame(summary(nb1)$coefficients[,1])

b <- as.data.frame(summary(nb1)$coefficients[,4])
c <- cbind(a,b)
colnames(c) <- c("Est", "p")

c %>%
  filter(p < 0.06 & Est > 0.2) %>%
  arrange(desc(Est)) %>%
  nice_table(cap='Positive Coefficients', cols=c('Est','p'))


## ----warning=FALSE, message=FALSE, error=FALSE----------------------------------------------------------------------------------------------------------------
c %>%
  filter(p < 0.06 & Est < -0.2) %>%
  arrange(Est) %>%
  nice_table(cap='Negative Coefficients', cols=c('Est','p'))


## ---- fig.align='center', out.width='50%', fig.height=4-------------------------------------------------------------------------------------------------------
# predict based on our validation holdout
nb1_valid <- predict(nb1, newdata = try_valid, type="response")

# bind the predicted and actuals for comparison
nb1_eval <- bind_cols(target = try_valid$target, predicted= nb1_valid)
#nb1_eval[is.na(nb1_eval)] <- 0

# xy plot of validation predictions and targets
nb1_eval %>%
  ggplot(aes(x = target, y = predicted)) +
  geom_point(alpha = .3) +
  geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE)

# density plot of validation predictions and targets
#nb1_eval %>%
 # ggplot() +
  #geom_density(aes(x=target), fill='green', alpha=0.25) +
 # geom_density(aes(x=round(predicted,0)), fill='blue', alpha=0.25)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
par(mfrow=c(2,2))
plot(nb1)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
as.data.frame(car::vif(nb1)) %>% 
  arrange(desc(GVIF)) %>%
  nice_table(cap='Variance Inflation Factor', cols=c('GVIF','df','adj'))


## ---- echo=TRUE-----------------------------------------------------------------------------------------------------------------------------------------------
nb2 <- zeroinfl(target ~ . | ., data=try_train, dist = 'negbin')


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
summary(nb2)

nb2_diag <- model_diag(nb2)

nb2_diag[c('AIC','Dispersion','Log-Lik')] %>% 
  round(2) %>% nice_table()


## ---- fig.align='center', out.width='50%', fig.height=4-------------------------------------------------------------------------------------------------------
# predict based on our validation holdout
nb2_valid <- predict(nb2, newdata = try_valid, type="response")

# bind the predicted and actuals for comparison
nb2_eval <- bind_cols(target = try_valid$target, predicted= nb2_valid)
#nb2_eval[is.na(nb2_eval)] <- 0

# xy plot of validation predictions and targets
nb2_eval %>%
  ggplot(aes(x = target, y = predicted)) +
  geom_point(alpha = .3) +
  geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE)

# density plot of validation predictions and targets
#nb2_eval %>%
 # ggplot() +
  #geom_density(aes(x=target), fill='green', alpha=0.25) +
  #geom_density(aes(x=round(predicted,0)), fill='blue', alpha=0.25)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# Diagnostic plots

# xy plot of model fitted and residuals
bind_cols(fitted=unname(fitted(nb2)), resid=unname(resid(nb2))) %>%
  ggplot(aes(x=fitted, y=resid)) +
  geom_point(alpha = .3) +
  geom_hline(yintercept=2, linetype='dashed') +
  geom_hline(yintercept=-2, linetype='dashed')


## ---- echo=TRUE-----------------------------------------------------------------------------------------------------------------------------------------------
lm1 <- lm(target ~ ., data=try_train)


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
summary(lm1)

lm1_diag <- model_diag(lm1)

lm1_diag[c('AIC','Adj R2')] %>% 
  round(2) %>% nice_table()


## ----fig.align='center', fig.height=3, fig.width=4------------------------------------------------------------------------------------------------------------
# validate and calculate RMSE
lm1_valid <- predict(lm1, newdata = try_valid)
lm1_eval <- bind_cols(target = try_valid$target, predicted=lm1_valid)
lm1_rmse <- sqrt(mean((lm1_eval$target - lm1_eval$predicted)^2)) 

# plot targets vs predicted
lm1_eval %>%
  ggplot(aes(x = target, y = predicted)) +
  geom_point(alpha = .3) +
  geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE) +
  labs(title=paste('RMSE:',round(lm1_rmse,1)))


## ---- fig.align='default', out.width='50%', fig.height=5------------------------------------------------------------------------------------------------------
# residual plots
plot(lm1,ask=FALSE)


## ----fig.height=4---------------------------------------------------------------------------------------------------------------------------------------------

vif_values <- vif(lm1)
vif_values <- rownames_to_column(as.data.frame(vif_values), var = "var")

vif_values %>%
  ggplot(aes(y=GVIF, x=var)) +
  coord_flip() + 
  geom_hline(yintercept=5, linetype="dashed", color = "red") +
  geom_bar(stat = 'identity', width=0.3 ,position=position_dodge()) 


## ---- echo=TRUE-----------------------------------------------------------------------------------------------------------------------------------------------
lm2_all <- lm(target ~ ., data=try_train)
lm2 <- stepAIC(lm2_all, trace=FALSE, direction='both')


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
summary(lm2)

lm2_diag <- model_diag(lm2)

lm2_diag[c('AIC','Adj R2')] %>% 
  round(2) %>% nice_table()


## ----fig.align='center', fig.height=3, fig.width=4------------------------------------------------------------------------------------------------------------
# validate and calculate RMSE
lm2_valid <- predict(lm2, newdata = try_valid)
lm2_eval <- bind_cols(target = try_valid$target, predicted=lm2_valid)
lm2_rmse <- sqrt(mean((lm2_eval$target - lm2_eval$predicted)^2)) 

# plot targets vs predicted
lm2_eval %>%
  ggplot(aes(x = target, y = predicted)) +
  geom_point(alpha = .3) +
  geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE) +
  labs(title=paste('RMSE:',round(lm2_rmse,1)))


## ---- fig.align='default', out.width='50%', fig.height=5------------------------------------------------------------------------------------------------------
# residual plots
plot(lm2,ask=FALSE)


## ----fig.height=4---------------------------------------------------------------------------------------------------------------------------------------------

vif_values <- vif(lm2)
vif_values <- rownames_to_column(as.data.frame(vif_values), var = "var")

vif_values %>%
  ggplot(aes(y=GVIF, x=var)) +
  coord_flip() + 
  geom_hline(yintercept=5, linetype="dashed", color = "red") +
  geom_bar(stat = 'identity', width=0.3 ,position=position_dodge()) 


## ----eval=FALSE-----------------------------------------------------------------------------------------------------------------------------------------------
## 
## ------------------------------------------------------------------------
## 
## ### 3.7 Lasso
## 
## #We used the zipath() to fit zero-inflated poisson regression models with variable selection using lasso regularization.
## #The count and the logit models both start with all the predictor variables and we used the coefficients parameters that generated the smallest AIC value for the model.
## 
## fit.lasso <- zipath(target~.|.,data = try_train, family = "negbin")
## 
## minBic <- which.min(BIC(fit.lasso))
## coef(fit.lasso, minBic)
## cat("theta estimate", fit.lasso$theta[minBic])
## 
## minAic <- which.min(AIC(fit.lasso))
## as.data.frame(coef(fit.lasso, minAic)) %>% nice_table(cap='Zero-Inflated Model Coefficients', dig=4)


## ----eval=FALSE, fig.height=4---------------------------------------------------------------------------------------------------------------------------------
## #### Model Performance
## 
## #**Theta Estimate:** `r cat("theta estimate", fit.lasso$theta[minAic])`
## 
## #The coefficients for the count model that survive the regularization process include the dummy variables for `labelappeal`, `stars`, and the variables `density`.
## 
## #-   `labelappeal` - the dummy variables derived from label appeal are strong indicators of the number of cases that will be purchased\
## #-   `stars` - the dummy variables derived from wine ratings are strong indicators of number of cases that will be purchased
## #-   `density` - is a negative indicator of the number of cases purchased by distributors suggesting that lighter wines are more popular than full-bodied wines
## #-   the following variables drop out of the final model `residualsugar`, `totalsulfurdioxide`, `freesulfurdioxide`, and `fixedacidity`
## 
## c_df <- data.frame(enframe(coef(fit.lasso, minAic)$count[-1]))  %>%
##             arrange(value) %>% mutate_if(is.numeric,round, digits = 4)
## c_df %>%
## ggplot() +
##   geom_col(aes(y = reorder(name,value), x=value, fill = {value > 0})) +
##   xlab(label = "") +
##   ggtitle(expression(paste("Lasso Coefficients (Count)"))) +
##   theme_bw() +
##   theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5),legend.position = "none")


## ----eval=FALSE, fig.height=4---------------------------------------------------------------------------------------------------------------------------------
## 
## # The coefficients for the zero inflation model that survived the regularization process include `stars`, `labelappeal-1`, `labelappeal1` ,and `volatileacidity`.
## #
## # -   `stars` - the dummy variables derived from wine ratings are strong indicators of number of cases that will be purchased
## # -   `labelappeal-1` is negative and `labelappeal1` is positive with no other label related dummy variable being included in the model. This would suggest that label aesthetics only count at the margins between positive and negative customer sentiment.
## # -   `volatileacidity` - is the only other variable coefficient that is included in the final model. With a negative coefficient lower volatile acid content is preferred when making a purchasing decision.
## 
## 
## c_df <- data.frame(enframe(coef(fit.lasso, minAic)$zero[-1])) %>%
##             arrange(value) %>% mutate_if(is.numeric,round, digits = 4)
## c_df %>%
## ggplot() +
##   geom_col(aes(y = reorder(name,value), x=value, fill = {value > 0})) +
##   xlab(label = "") +
##   ggtitle(expression(paste("Lasso Coefficients (Zero-Inflation)"))) +
##   theme_bw() +
##   theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5),legend.position = "none")


## ----eval=FALSE, fig.align='default', out.width='50%', fig.height=4-------------------------------------------------------------------------------------------
## #The model still over-predicted counts of 1-2, and greatly under-predicted counts of zero.
## 
## #df_valid_cleaned <- df_valid %>% filter(acidindex != 4)
## lasso_eval <- predict(fit.lasso, newdata=try_valid, which=minAic)
## lasso_eval <- data.frame(lasso_eval)
## colnames(lasso_eval) <- c('predicted')
## lasso_eval$target <- try_valid$target
## #lasso_eval$target <- df_valid_cleaned$target
## lasso_rmse <- sqrt(mean((lasso_eval$target - lasso_eval$predicted)^2))
## 
## # Diagnostic plots
## 
## # xy plot of model fitted and residuals
## lasso_eval %>%
##   ggplot(aes(x=predicted, y=resid)) +
##   geom_point(alpha = .3) +
##   geom_hline(yintercept=2, linetype='dashed') +
##   geom_hline(yintercept=-2, linetype='dashed')
## 
## # xy plot of validation predictions and targets
## lasso_eval %>%
##   ggplot(aes(x = target, y = predicted)) +
##   geom_point(alpha = .3) +
##   geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE)
## 
## # density plot of validation predictions and targets
## lasso_eval %>%
##   ggplot() +
##   geom_density(aes(x=target), fill='green', alpha=0.25) +
##   geom_density(aes(x=round(predicted,0)), fill='blue', alpha=0.25)
## 
## 
## # Lasso
## m_df <- lasso_eval %>% multi_metric(truth=target, estimate=predicted)
## 
## results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
##                       Model = "Lasso",
##                       smape = m_df[[1,3]],
##                       mase = m_df[[2,3]],
##                       "RMSE" = m_df[[3,3]],
##                       AIC = as.numeric(AIC(fit.lasso)[minAic]),
##                       "Adjusted R2" = NA,
##                       "F-statistic" = NA
##                      ))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------

results_lm_tbl <- tibble(
                      Model = character(),
                      smape = numeric(), 
                      mase = numeric(), 
                      mae = numeric(),
                      "RMSE" = numeric(),
                      AIC = numeric(),
                      BIC = numeric(),
                      "Adjusted R2" = numeric(),
                      "F-statistic" = numeric()
                )


multi_metric <- metric_set(smape, mase,mae, yardstick::rmse)




## -------------------------------------------------------------------------------------------------------------------------------------------------------------
## Poisson Regression  1 StepAIC
pr1_df <-pr1_eval %>% multi_metric(truth=target, estimate=predicted)
b <- summary(pr1_aic)

results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
                      Model = "Poisson Regression 1",
                      smape = pr1_df[[1,3]],
                      mase = pr1_df[[2,3]],
                      mae = pr1_df[[3,3]],
                      "RMSE" = pr1_df[[4,3]],
                      AIC = b$aic,
                      BIC = BIC(pr1_aic),
                      "Adjusted R2" = NA,
                      "F-statistic" = NA
                     ))



## -------------------------------------------------------------------------------------------------------------------------------------------------------------
## Poisson Regression 2
pr2_df <-pr2_eval %>% multi_metric(truth=target, estimate=predicted)

results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
                      Model = "Poisson Regression 2",
                      smape = pr2_df[[1,3]],
                      mase = pr2_df[[2,3]],
                      mae = pr2_df[[3,3]],
                      "RMSE" = pr2_df[[4,3]],
                      AIC = AIC(pr2),
                      BIC = BIC(pr2),
                      "Adjusted R2" = NA,
                      "F-statistic" = NA
                     ))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
## Negative Binomial 1
nb1_df <-nb1_eval %>% multi_metric(truth=target, estimate=predicted)
b <- summary(nb1)

results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
                      Model = "Negative Binomial 1",
                      smape = nb1_df[[1,3]],
                      mase = nb1_df[[2,3]],
                      mae = nb1_df[[3,3]],
                      "RMSE" = nb1_df[[4,3]],
                      AIC = b$aic,
                      BIC = BIC(nb1),
                      "Adjusted R2" = NA,
                      "F-statistic" = NA
                     ))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
## Negative Binomial 2
nb2_df <-nb2_eval %>% multi_metric(truth=target, estimate=predicted)

results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
                      Model = "Negative Binomial 2",
                      smape = nb2_df[[1,3]],
                      mase = nb2_df[[2,3]],
                      mae = nb2_df[[3,3]],
                      "RMSE" = nb2_df[[4,3]],
                      AIC = AIC(nb2),
                      BIC = BIC(nb2),
                      "Adjusted R2" = NA,
                      "F-statistic" = NA
                     ))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# Multiple Linear Regression 1
lm1_df <-lm1_eval %>% multi_metric(truth=target, estimate=predicted)
b <- summary(lm1)

results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
                      Model = "Multiple Linear Model 1",
                      smape = lm1_df[[1,3]],
                      mase = lm1_df[[2,3]],
                      mae = lm1_df[[3,3]],
                      "RMSE" = nb1_df[[4,3]],
                      AIC = AIC(lm1),
                      BIC = BIC(lm1),
                      "Adjusted R2" = b$adj.r.squared,
                      "F-statistic" = b$fstatistic[[1]]
                     ))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# Multiple Linear Regression 2
lm2_df <-lm2_eval %>% multi_metric(truth=target, estimate=predicted)
b <- summary(lm2)

results_lm_tbl <- results_lm_tbl %>% add_row(tibble_row(
                      Model = "Multiple Linear Model 2",
                      smape = lm2_df[[1,3]],
                      mase = lm2_df[[2,3]],
                      mae = lm2_df[[3,3]],
                      "RMSE" = lm2_df[[4,3]],
                      AIC = AIC(lm2),
                      BIC = BIC(lm2),
                      "Adjusted R2" = b$adj.r.squared,
                      "F-statistic" = b$fstatistic[[1]]
                     ))


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
kable(results_lm_tbl, digits=4) %>% 
  kable_styling(bootstrap_options = "basic", position = "center")


## ---- eval=FALSE----------------------------------------------------------------------------------------------------------------------------------------------
## 
## #tab_model(pr1_aic, nb1,
##        #   dv.labels = c('Poisson','Negative Binomial'),
##         #  show.df = FALSE, show.aic = TRUE, show.fstat=TRUE, show.se = TRUE,
##         #  show.ci=FALSE, show.p = TRUE,show.stat=TRUE, digits.p=4)
## 
## 
## #tab_model(lm1, lm2,
##    #       dv.labels = c('Multiple Linear Regression 1','Multiple Linear Regression 2'),
##   #        show.df = FALSE, show.aic = TRUE, show.fstat=TRUE, show.se = TRUE,
##    #       show.ci=FALSE, show.p = TRUE,show.stat=TRUE, digits.p=4)
## 
## 
## #tab_model(pr2, nb2,
##          # dv.labels = c( 'Poisson zeroinfl', 'Negative Binomial zeroinfl'),
##          # show.df = FALSE, show.aic = TRUE, show.fstat=TRUE, show.se = TRUE,
##           #show.zeroinf = TRUE, show.p = TRUE, show.ci=FALSE, show.stat=TRUE,
##          # digits.p=4)
## 


## -------------------------------------------------------------------------------------------------------------------------------------------------------------
eval_df <- try_df_predict %>% dplyr::select(-target)

# predict based on our validation holdout
zero_pois_pred <- predict(pr2, newdata = eval_df, type="response")

eval_df$target_pred <- zero_pois_pred



## -------------------------------------------------------------------------------------------------------------------------------------------------------------
# reorder predictions to front
eval_df <- eval_df %>% 
  relocate(target_pred)

predictions <- eval_df %>% 
  dplyr::select(target_pred)


## ----results_table--------------------------------------------------------------------------------------------------------------------------------------------
DT::datatable(
      eval_df,
      extensions = c('Scroller'),
      options = list(scrollY = 350,
                     scrollX = 500,
                     deferRender = TRUE,
                     scroller = TRUE,
                     dom = 'lBfrtip',
                     fixedColumns = TRUE, 
                     searching = FALSE), 
      rownames = FALSE) 


## ----eval=FALSE-----------------------------------------------------------------------------------------------------------------------------------------------
## write.csv(eval_df, 'eval_predictions.csv', row.names=F)
## write.csv(predictions, 'predictions_only.csv', row.names=F)

Homework #5: Count Regression (Wines)

Critical Thinking Group 1

Ben Inbar, Cliff Lee, Daria Dubovskaia, David Simbandumwe, Jeff Parks

Overview

1. Data Exploration

1.1 Summary Statistics

1.2 Distribution

1.3 Boxplots

1.4 Scatter Plots

1.5 Correlation Matrix

2. Data Preparation

2.1 Transformations, Outliers

2.2 Missing Data

2.2.1 Missing Data - Stars

2.2.2 Missing Data - Chemical Compounds

2.3 Data Sparseness - Label Appeal

2.4 Examine Final Dataset

2.5 Split Datasets

3. Build Models

3.1 Poisson Regression 1

Model Performance

Model Assumptions

3.2 Poisson Regression 2

Model Performance

Model Assumptions

3.3 Negative Binomial Regression 1

Model Performance

Model Assumptions

3.4 Negative Binomial Regression 2

Model Performance

Model Assumptions

3.5 Multiple Linear Regression 1

Model Performance

Model Assumptions

3.6 Multiple Linear Regression 2

Model Performance

Model Assumptions

4. Model Selection

5. Predictions

6. Conclusion

References

Appendix

R Code