1 Introduction

A large wine manufacturer (LWM) is considering revisions to its current lineup of product offerings to improve total sales revenue. To aid in the decision-making process, LWM acquired a data set that identifies physical characteristics and ratings for more than 12,000 commercial wines. The data set set also quantifies the number of wine cases purchased by distributors after sampling each product.

LWM has contracted with the data analytics team (DAT) to build a model that that predicts the number of wine cases sold to distributors, given a set of unique wine characteristics. Because distributor purchases are strongly correlated with sales at restaurants and other retail outlets–where LWM derives the bulk of its revenue–the model output could help LWM determine wine qualities that are associated with strong sales.

2 Data Exploration

2.1 Overview

The training data set comprises 12,795 observations and 16 variables and is approximately 1.4 MB in size. The variable INDEX is used for observation identification purposes only. The other variables include 14 features and 1 response variable, TARGET.

Twelve of the features describe the chemical properties of wine. The remaining two predictors are rating variables: LabelAppeal refers to the perceived attractiveness of a wine’s product label, while STARS is a subjective assessment of wine quality.

Finally, the output variable, TARGET, is a count measure indicating the number of wine case purchases by distributors.

For a detailed listing of all variable names and descriptions, please refer to Table , located at the end of this report. Subsequent tables and figures will also provided after the body of this document.

(#tab:vars_defn) Variable Names and Descriptions in Wine Data Set

Variable Name	Definition	Theoretical Effect
INDEX	Identification only
TARGET	Number of Wine Cases Purchased
AcidIndex	Proprietary method of testing total acidity
Alcohol	Alcohol Content
Chlorides	Chloride Content
Citric Acid	Citric Acid Content
Density	Density of wine
FixedAcidity	Fixed acidity of wine
FreeSulfurDioxide	Sulfur Dioxide content of wine
LabelAppeal	Marketing score of label attractiveness	High value suggests high sales
ResidualSugar	Residual Sugar of wine
STARS	Expert Rating: 1=Poor through 4=Excellent	High value implies high sales
Sulphates	Sulfate content
TotalSulfurDioxide	Total Sulfur Dioxide
VolatileAcidity	Volatile acid content of wine
pH	pH of wine

2.2 Categorical Features

There are three categorical features in the wine data set: LabelAppeal, STARS, and AcidIndex. The first two of these features are rating variables–see the Overview section and for detailed descriptions. AcidIndex is related to the acidic quality of wine.

LabelAppeal scores range from -2 through 2. Based on the data set documentation, a negative score suggests a poor impression of the label design, a zero score is neutral, and a positive scores implies an attractive design. The distribution of scores in the training data are symmetrical and roughly bell-shaped.

STARS scores range from 1 through 4, with experts characterizing more wines as low quality (1-2) than high quality (3-4). A significant number of wines in the training data–over 25%–have no STARS score.

AcidIndex is a discrete variable with observed values ranging from 4 through 17. The data documentation describes the calculation of the index value using a proprietary method involving weighted averages. The values are clearly ordinal, but we have little additional information for understanding the meaning of this variable. For example, we cannot determine, based on the description provided, whether an index value of 10 reflects a wine with twice the acidic content of another wine with a value of 5. Unlike the other variables in the unadjusted data set, AcidIndex has a moderate, right skew.

Refer to Table for a numerical summary of each categorical variable. Table provides additional, descriptive statistics, and Figure contains graphical summaries.

(#tab:cat_stat_tbl) Categorical Features Summary

Variable	Levels	Count	Pct	Cumul.Pct
LabelAppeal	-2	504	3.9	3.9
	-1	3136	24.5	28.4
	0	5617	43.9	72.3
	1	3048	23.8	96.2
	2	490	3.8	100
STARS	1	3042	23.8	23.8
	2	3570	27.9	51.7
	3	2212	17.3	69
	4	612	4.8	73.7
		3359	26.3	100
AcidIndex	4	3	0	0
	5	75	0.6	0.6
	6	1197	9.4	10
	7	4878	38.1	48.1
	8	4142	32.4	80.5
	9	1427	11.2	91.6
	10	551	4.3	95.9
	11	258	2	97.9
	12	128	1	98.9
	13	69	0.5	99.5
	14	47	0.4	99.8
	15	8	0.1	99.9
	16	5	0	99.9
	17	7	0.1	100

2.3 Discrete Features

We made a judgment call in classifying AcidIndex as a categorical variable–based on the limited information provided in the variable documentation. The variable is clearly discrete, with only 14 unique values in the training data. Another analyst may reasonably argue that this feature should be treated like a continuous variable for modeling purposes.
There are two additional, discrete predictors in the wine data set: TotalSulfurDioxide and FreeSulfurDioxide. Both variables measure similar chemical properties. Plots in Figure indicate that the variables are symmetric, but have rather fat tails. We are concerned about the negative values exhibited by these variables: Sulfur Dioxide is typically measured in mg/l (or equivalently, ppm). We also note that both variables have missing values–see for more details. We’ll address these issues in subsequent sections.

For modeling purposes, we will treat the two sulfur dioxide features like any other continuous predictor. This treatment seems reasonable, given that both predictors take on a very wide array of discrete values.

2.4 Continuous Features

Nine predictors in the wine data set are continuous, with each variable corresponding to a particular chemical property: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, Density, pH, Sulphates, and Alcohol.

The three predictors related to acidity and the Density variable have a significant number of outlier observations–roughly 2% of each feature’s observations are classified as outliers.

All nine predictors are roughly symmetrical and leptokurtic. Finally, all continuous features–with the exception of pH and Density–exhibit negative values. Based on our cursory research, none of these predictors are measured in units that would include negative values. We will address these issues in a later section.

Descriptive statistics and graphical summaries for the continuous features can be found in Table and Figure , respectively.

2.5 Response Variable

The output variable TARGET, has a zero-inflated distribution, with more than 20% of observations indicating no sales. Counts in the training data range from 0 through 8, although the values are theoretically unbounded above. Finally, the distribution of cases sold, given at least one sale, is symmetric and approximately normal.

Refer to Table and Figure for additional details.

(#tab:desc_stat) Descriptive Statistics Summary

	Min	Max	Range	Mean	Med	Stdev	Skew	Kurt	NAs	Outliers
TARGET	0.0	8.0	8.0	3.0	3.0	1.9	-0.3	2.1	0	0
FixedAcidity	-18.1	34.4	52.5	7.1	6.9	6.3	0.0	4.7	0	195
VolatileAcidity	-2.8	3.7	6.5	0.3	0.3	0.8	0.0	4.8	0	216
CitricAcid	-3.2	3.9	7.1	0.3	0.3	0.9	0.0	4.8	0	215
ResidualSugar	-127.8	141.2	268.9	5.4	3.9	33.7	-0.1	4.9	616	0
Chlorides	-1.2	1.4	2.5	0.1	0.0	0.3	0.0	4.8	638	0
FreeSulfurDioxide	-555.0	623.0	1,178.0	30.8	30.0	148.7	0.0	4.8	647	0
TotalSulfurDioxide	-823.0	1,057.0	1,880.0	120.7	123.0	231.9	0.0	4.7	682	0
Density	0.9	1.1	0.2	1.0	1.0	0.0	0.0	4.9	0	226
pH	0.5	6.1	5.7	3.2	3.2	0.7	0.0	4.6	395	0
Sulphates	-3.1	4.2	7.4	0.5	0.5	0.9	0.0	4.8	1,210	0
Alcohol	-4.7	26.5	31.2	10.5	10.4	3.7	0.0	4.5	653	0
LabelAppeal	-2.0	2.0	4.0	0.0	0.0	0.9	0.0	2.7	0	0
AcidIndex	4.0	17.0	13.0	7.8	8.0	1.3	1.6	8.2	0	264
STARS	1.0	4.0	3.0	2.0	2.0	0.9	0.4	2.3	3,359	0

Note. Outliers are defined as observations that are more than 3 standard deviations from the mean

(ref:hist-caption) Histogram of Wine Training Data Variables

 [1] "INDEX"              "TARGET"             "FixedAcidity"      
 [4] "VolatileAcidity"    "CitricAcid"         "ResidualSugar"     
 [7] "Chlorides"          "FreeSulfurDioxide"  "TotalSulfurDioxide"
[10] "Density"            "pH"                 "Sulphates"         
[13] "Alcohol"            "LabelAppeal"        "AcidIndex"         
[16] "STARS"

(ref:hist-caption)

2.6 Relationship Between Features and Response

In Figure we compare the three categorical variables to the TARGET variable. For each of these predictors, there appears to be a significant relationship between the ordered levels and the number of wine cases sold. Unit increases to the LabelAppeal metric are associated with consistent increases in the response. Similarly, increases in the STARS rating correspond with higher wine sales. Interestingly enough, wines that have not been rated tend to have low sales compared to their rated counterparts. Finally, AcidIndex score appear to be negatively correlated with the response.

The relationship between the TARGET variable and the continuous variables is less clear. We initially created scatterplots to explore these relationship. Unfortunately, there is significant variability in response values across narrow bands of similar predictor values; so patterns are not easily discerned. As an alternative we plotted Loess curves to get a general sense of the relationship between the predictors and the response–see Figure . The plotted curves are fairly complex but we noted several mild or moderate patterns:

Higher Alcohol content is associated with higher sales.
The concentration of Chlorides appears to be negatively associated with sales.
Higher acidity tends to be associated with lower sales.
Sulfur Dioxide concentration appears to have a positive association with sales.

(ref:bp-caption) Boxplots of TARGET vs. Categorical Variables

(ref:bp-caption)

(ref:lc-caption) Loess Curves of TARGET vs. Continuous Variables

(ref:lc-caption)

2.7 Correlation Between Variables

In Figure , we produced a correlation matrix of all candidate variables. For computational purposes, we converted all categorical variables back to their original numerical form.

None of the predictors have a particularly strong correlation with the TARGET variable. The STARS predictor has a moderate positive association with response. LabelAppeal has a moderately weak positive correlation with TARGET, while AcidIndex has a fairly weak negative association with the response. All other predictors have extremely weak linear associations with TARGET.

Finally, the correlations between predictor variables are all weak. In other words, there there are no immediately obvious collinearity issues.

(ref:cor-caption) Correlation of Wine Data Variables

(ref:cor-caption)

3 Data Preparation

3.1 Negative Values

As discussed previously, none of the predictors related to chemical composition should have negative values, as the units of measurement for these variables are always non-negative.

Based on the descriptive statistics and histogram plots in Figure and Table , respectively, we see a similarity in the magnitudes of the negative and positive ranges for each of thee predictors of concern. In other words, the distributions are all symmetric.

It is possible that the minus signs in the data reflect data entry errors or some other identifying characteristic, but we have no data documentation to support these hypotheses. On the other hand, we found a resource online–see the References section–that analyzes the distributions of various wine chemical properties. In that resource, most of the wine distributions featured all positive values distributions and were mostly right skewed.

Based on this information, we decided to apply simple absolute value transformation for the predictors of concern. In total, we applied this transformation to nine predictors: FixedAcidity, VolatileACidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Sulphate, and Alcohol.

3.2 Missing Values

Missing values impact roughly 50% of records in the training data. The values are present in varying degrees across eight of the predictor candidates: STARS, Sulphates, TotalSulfurDioxide, Alcohol, FreeSulfurDioxide, Chlorides, ResidualSugar, and pH. For a summary description of missing elements, refer to Figure .

The STARS variable has the highest number of missing values, with roughly one quarter of observations coded as NA. Given the large number of missing records, and the apparent significance of the NA records relationship with TARGET vis-a-vis the other variable levels, we decided to leave the NA records as is.

The Sulphates variable has missing values in approximately 10% of observations, while the six other predictors have missing values in the 3% - 5% range. Given the relatively small percentages of missing values for the remaining predictors, we will impute values using R’s MICE package. The use of the MICE procedure results in less bias in model regression coefficients compared to simpler methods like mean substitution.

Refer to Figure for a graphical depiction of the MICE imputation procedure. The blue curves represent the distribution of non-missing values, while the red curves are the imputed values. The red and blue distributions are very similar. This result suggests that the imputation procedure approximated the original distribution accurately.

(ref:missing-caption) Plot of Missing Observations in Wine Training Data

(ref:missing-caption)

(ref:mice-caption) Graphical Summary of MICE Imputation Procedure

(ref:mice-caption)

3.3 Revised Descriptive Statistics and Graphical Summaries

We have made substantial revisions to our predictor variables; so we will review the revised variables based on the transformations applied thus far.

In Table , we recalculated basic descriptive statistics for all variables. We also generated new histograms, correlation plots, and loess curve plots–see Figures , , and , respectively.

We see that many of the predictor variables now have a moderate right skew. The majority of the predictors are still leptokurtic. Also, our transformed variables indicate a higher prevalence of outlier observations. The revised Loess curve plots indicate some significant changes in the relationship between the predictors and the TARGET variable. For instance, sulfur dioxide content now seems to be negatively correlated with wine sales. Interestingly, the loess curves are almost linear or slightly curved for typical predictor values. However, the curves are very complex and unstable for predictor values that are are sparsely populated. Finally, the revised correlation plot indicates no material changes to the original correlations.

(#tab:desc_statrev) Revised Descriptive Statistics Summary

	Min	Max	Range	Mean	Med	Stdev	Skew	Kurt	NAs	Outliers
TARGET	0.0	8.0	8.0	3.0	3.0	1.9	-0.3	2.1	0	0
FixedAcidity	0.0	34.4	34.4	8.1	7.0	5.0	1.2	5.0	0	156
VolatileAcidity	0.0	3.7	3.7	0.6	0.4	0.6	1.7	6.1	0	239
CitricAcid	0.0	3.9	3.9	0.7	0.4	0.6	1.6	5.9	0	230
ResidualSugar	0.0	141.2	141.2	23.4	12.9	24.9	1.5	5.2	0	242
Chlorides	0.0	1.4	1.4	0.2	0.1	0.2	1.5	5.2	0	235
FreeSulfurDioxide	0.0	623.0	623.0	107.3	56.0	108.8	1.5	5.4	0	246
TotalSulfurDioxide	0.0	1,057.0	1,057.0	203.9	153.0	162.9	1.6	6.0	0	202
Density	0.9	1.1	0.2	1.0	1.0	0.0	0.0	4.9	0	226
pH	0.5	6.1	5.7	3.2	3.2	0.7	0.0	4.6	0	189
Sulphates	0.0	4.2	4.2	0.8	0.6	0.7	1.7	6.2	0	211
Alcohol	0.0	26.5	26.5	10.5	10.4	3.6	0.2	4.0	0	85
LabelAppeal	-2.0	2.0	4.0	0.0	0.0	0.9	0.0	2.7	0	0
AcidIndex	4.0	17.0	13.0	7.8	8.0	1.3	1.6	8.2	0	264
STARS	1.0	4.0	3.0	2.0	2.0	0.9	0.4	2.3	3,359	0

Note. Outliers are defined as observations that are more than 3 standard deviations from the mean

(ref:revhisto) Revised Histogram of Wine Training Data Variables

(ref:revhisto)

(ref:cor-caption-rev) Revised Correlation of Wine Data Variables

(ref:cor-caption-rev)

(ref:lccaprev) Revised Loess Curves of TARGET vs. Continuous Variables

(ref:lccaprev)

3.4 Bin Acid Index Predictor

We previously decided to treat AcidIndex as an ordered, categorical variable. Because the index values are sparsely populated at the extreme low and high ends, we will bin the categories into fewer groupings. Specifically, we will merge values of 4 and 5 into the 6 index measure. We will also merge indices 11 and higher into the original 10 index value. This decision is also supported by the boxplot in Figure , where extreme low and high index values appear to have similar relationship with the TARGET variable.

3.5 Other Transformations Considered

We contemplated applying Box-Cox suggested power transformations to some of the predictor variables. Because many of the predictor values have a moderate right skew, we could reasonably apply log, square root, and/or quarter-root transformations. These transformations may make sense in a basic OLS model; however, we are building a variety of glm models in addition to OLS models. Most glms do not require such transformations. We ultimately decided to refrain from applying any power transformations at this point in the modeling process.

We also debated whether or not to winsorize some of the candidate predictors, given the prevalence of outlier observations. While this procedure can soften the influence of outlier observations, it can also introduce bias in the point estimates and inference mechanisms. We therefore decided to forgo winsorizing the predictors.

Finally, we noted that our client, LWM, requires an interpretable model. By refraining from applying complex predictor transformations, our model should remain relatively easy to explain.

4 Build Models

4.1 Multiple Linear Regression Model 1

For our first linear regression model, we decided to include the following four predictors: AcidIndex, STARS, LabelAppeal, and Alcohol. We included each of the categorical predictors because of their clear relationship with TARGET–see the boxplots in Figure . Also the correlation plot (Figure ) indicated higher magnitude correlation values compared to the other candidate predictors.

We included the Alcohol variable for a variety of reasons. First, we recognize that our variable transformation involving the absolute value may not have been the best approach to handling negative values. Fortunately, the raw Alcohol values were all positive. Secondly, this variable is also not significantly skewed and has only mildly fat tails. We also observe fewer outliers with this variable compared to many other predictors. Finally, Alochol's relationship with TARGET–refer to Figure –appears to be more straightforward compared to other relationships.

See Table for regression output. The model has an adjusted $R^2$ value of 0.54. All predictors and levels within each categorical predictor are significant at the 5% level. The model coefficients make intuitive sense. Sales of wine decrease with each increase in the AcidIndex value. Having a STARS rating results in more sales compared to no ratings, and higher STARS ratings are associated with higher sales. Finally, increases in alcohol content are associated with higher sales. This is consistent with observations from our EDA work.

(#tab:mod1) Model 1 Multiple Regression Output

Predictor	$b$	95% CI	$t(12781)$	$p$
Intercept	0.62	$[0.47$, $0.78]$	7.93	< .001
AcidIndex7	-0.09	$[-0.17$, $-0.01]$	-2.14	.032
AcidIndex8	-0.20	$[-0.28$, $-0.12]$	-4.70	< .001
AcidIndex9	-0.51	$[-0.61$, $-0.41]$	-10.04	< .001
AcidIndex10	-1.06	$[-1.17$, $-0.95]$	-19.27	< .001
STARS1	1.36	$[1.30$, $1.43]$	41.27	< .001
STARS2	2.41	$[2.34$, $2.47]$	75.05	< .001
STARS3	2.97	$[2.90$, $3.04]$	79.95	< .001
STARS4	3.66	$[3.55$, $3.78]$	61.73	< .001
LabelAppeal-1	0.36	$[0.24$, $0.49]$	5.76	< .001
LabelAppeal0	0.83	$[0.71$, $0.95]$	13.49	< .001
LabelAppeal1	1.29	$[1.17$, $1.42]$	20.15	< .001
LabelAppeal2	1.88	$[1.71$, $2.05]$	22.25	< .001
Alcohol	0.01	$[0.01$, $0.02]$	3.92	< .001

4.2 Multiple Linear Regression Model 2

For our second multiple linear regression model, we we used stepwise regression with variable selection occurring in both directions. Model output can be found in Table . This second model has an adjusted $R^2$ of 0.54, and has 11 out of the 15 potential candidate predictors. Only FixedAcidity, ResidualSugar, and FreeSulfureDioxide excluded from the model.

(#tab:mod2) Model 2 Multiple Regression Output

Predictor	$b$	95% CI	$t(12773)$	$p$
Intercept	1.69	$[0.82$, $2.56]$	3.80	< .001
STARS1	1.36	$[1.29$, $1.42]$	41.13	< .001
STARS2	2.39	$[2.33$, $2.46]$	74.66	< .001
STARS3	2.96	$[2.88$, $3.03]$	79.68	< .001
STARS4	3.65	$[3.54$, $3.77]$	61.65	< .001
LabelAppeal-1	0.36	$[0.23$, $0.48]$	5.67	< .001
LabelAppeal0	0.82	$[0.70$, $0.94]$	13.45	< .001
LabelAppeal1	1.29	$[1.16$, $1.41]$	20.09	< .001
LabelAppeal2	1.88	$[1.71$, $2.04]$	22.25	< .001
AcidIndex7	-0.09	$[-0.17$, $-0.01]$	-2.25	.024
AcidIndex8	-0.20	$[-0.28$, $-0.12]$	-4.81	< .001
AcidIndex9	-0.51	$[-0.61$, $-0.41]$	-10.00	< .001
AcidIndex10	-1.05	$[-1.16$, $-0.94]$	-19.02	< .001
VolatileAcidity	-0.11	$[-0.15$, $-0.07]$	-5.22	< .001
Alcohol	0.01	$[0.01$, $0.02]$	4.08	< .001
TotalSulfurDioxide	0.00	$[0.00$, $0.00]$	2.73	.006
Sulphates	-0.04	$[-0.07$, $-0.01]$	-2.25	.024
Density	-0.92	$[-1.78$, $-0.07]$	-2.11	.035
FreeSulfurDioxide	0.00	$[0.00$, $0.00]$	1.87	.062
Chlorides	-0.09	$[-0.19$, $0.01]$	-1.82	.069
PH	-0.03	$[-0.06$, $0.00]$	-1.74	.082
CitricAcid	0.03	$[-0.01$, $0.07]$	1.53	.127

Three predictors have p-values over 5%: pH, CitricAcid, and Sulphates. Let’s create a reduced model that removes these variables–see Table for regression output for this smaller model. The adjusted $R^2$ is still 0.54, but now all model p-values are below 5%.

(#tab:modtwoone) Model 2.1 Multiple Regression Output

Predictor	$b$	95% CI	$t(12776)$	$p$
Intercept	1.60	$[0.73$, $2.46]$	3.62	< .001
STARS1	1.36	$[1.29$, $1.42]$	41.22	< .001
STARS2	2.40	$[2.33$, $2.46]$	74.75	< .001
STARS3	2.96	$[2.89$, $3.03]$	79.77	< .001
STARS4	3.65	$[3.54$, $3.77]$	61.69	< .001
LabelAppeal-1	0.36	$[0.24$, $0.48]$	5.71	< .001
LabelAppeal0	0.83	$[0.71$, $0.95]$	13.47	< .001
LabelAppeal1	1.29	$[1.16$, $1.41]$	20.13	< .001
LabelAppeal2	1.88	$[1.71$, $2.04]$	22.26	< .001
AcidIndex7	-0.09	$[-0.17$, $-0.01]$	-2.19	.029
AcidIndex8	-0.20	$[-0.28$, $-0.12]$	-4.71	< .001
AcidIndex9	-0.50	$[-0.60$, $-0.40]$	-9.93	< .001
AcidIndex10	-1.04	$[-1.15$, $-0.94]$	-18.98	< .001
VolatileAcidity	-0.11	$[-0.15$, $-0.07]$	-5.27	< .001
Alcohol	0.01	$[0.01$, $0.02]$	4.09	< .001
TotalSulfurDioxide	0.00	$[0.00$, $0.00]$	2.74	.006
Density	-0.95	$[-1.80$, $-0.09]$	-2.17	.030
FreeSulfurDioxide	0.00	$[0.00$, $0.00]$	1.91	.057
Chlorides	-0.09	$[-0.19$, $0.00]$	-1.90	.058

Let’s perform an ANOVA test to see if the full model is significantly different than the reduced model–refer to Table .

(#tab:anv) ANOVA of Model 2 and Model 2.1

Res.Df	RSS	Df	Sum.of.Sq	F	Pr..F.
12,776.00	21,800.44	NA	NA	NA	NA
12,773.00	21,782.73	3.00	17.71	3.46	0.02

The p-value of the ANOVA test is greater than 5%, We proceed with the reduced Model 2.1.

All coefficients included in Model 2 that were also in Model 1 have similar signs and magnitudes as those in the prior model. Furthermore, the signs of the four additional predictors are consistent with results from our EDA. For instance, we expected the number of wine cases sold to decrease with increases in VolatileAcidity. We also expected some improvement in TARGET output with increases to TotalSulfurDioxide. Finally, our EDA–see Figure –indicated a general decrease in TARGET values with increases with Chloride and Density. This is consistent with the negative model coefficients ovserved in Model 2.1.

4.3 Poission Regression Model 3

A Poisson model is potentially more appropriate for our modeling problem because our output variable is count-based measure.

In a Poisson-based model, we assume the dependent variable’s mean is identical to its variance. Let’s briefly check this assumption by comparing the unconditional mean and variance of TARGET variable. The variable’s mean is 3.03 and the variance is 3.71. Because the variance is somewhat higher than the mean, there could possibly be an issue with overdispersion.

Let’s start by building a Poisson glm using the same variables from Model 1. Model output is provided in Table . The AIC value is 45604. All coefficients are statistically significant, with signs in and magnitudes that are consistent with our understanding of the the predictor relationship with TARGET.

(#tab:mod3) Model 3 Poisson Regression Output

Predictor	$b$	95% CI	$z$	$p$
Intercept	-0.14	$[-0.23$, $-0.06]$	-3.26	.001
AcidIndex7	-0.03	$[-0.06$, $0.00]$	-1.71	.088
AcidIndex8	-0.06	$[-0.10$, $-0.03]$	-3.48	.001
AcidIndex9	-0.17	$[-0.22$, $-0.13]$	-7.70	< .001
AcidIndex10	-0.48	$[-0.53$, $-0.42]$	-16.49	< .001
STARS1	0.76	$[0.72$, $0.80]$	38.95	< .001
STARS2	1.08	$[1.05$, $1.12]$	59.38	< .001
STARS3	1.20	$[1.16$, $1.24]$	62.58	< .001
STARS4	1.32	$[1.28$, $1.37]$	54.50	< .001
LabelAppeal-1	0.24	$[0.16$, $0.31]$	6.27	< .001
LabelAppeal0	0.43	$[0.36$, $0.50]$	11.56	< .001
LabelAppeal1	0.56	$[0.49$, $0.64]$	14.88	< .001
LabelAppeal2	0.70	$[0.61$, $0.78]$	16.44	< .001
Alcohol	0.00	$[0.00$, $0.01]$	2.76	.006

4.4 Poisson Regression Model 4

We will now fit a Poisson Regression model using stepwise regression. Model output can be found in Table . The coefficients in this model are identical to those in our reduced Model 2 (i.e. Model 2.1). The coefficient signs and magnitudes are consistent with our understanding from our earlier EDA work. However, we note that three variables in this variable have p-values above 5%: TotalSulfurDioxide, Chlorides, and Density.
(#tab:mod4) Model 4 Poisson Regression Output

Predictor	$b$	95% CI	$z$	$p$
Intercept	0.23	$[-0.16$, $0.62]$	1.16	.244
STARS1	0.76	$[0.72$, $0.80]$	38.84	< .001
STARS2	1.08	$[1.04$, $1.12]$	59.11	< .001
STARS3	1.20	$[1.16$, $1.24]$	62.31	< .001
STARS4	1.32	$[1.27$, $1.37]$	54.35	< .001
LabelAppeal-1	0.24	$[0.16$, $0.31]$	6.22	< .001
LabelAppeal0	0.43	$[0.36$, $0.50]$	11.52	< .001
LabelAppeal1	0.56	$[0.49$, $0.63]$	14.84	< .001
LabelAppeal2	0.70	$[0.61$, $0.78]$	16.42	< .001
AcidIndex7	-0.03	$[-0.06$, $0.00]$	-1.79	.073
AcidIndex8	-0.06	$[-0.10$, $-0.03]$	-3.55	< .001
AcidIndex9	-0.17	$[-0.22$, $-0.13]$	-7.64	< .001
AcidIndex10	-0.47	$[-0.53$, $-0.42]$	-16.32	< .001
VolatileAcidity	-0.04	$[-0.05$, $-0.02]$	-3.84	< .001
Alcohol	0.00	$[0.00$, $0.01]$	2.85	.004
Sulphates	-0.02	$[-0.03$, $0.00]$	-1.93	.053
TotalSulfurDioxide	0.00	$[0.00$, $0.00]$	2.00	.045
Density	-0.31	$[-0.68$, $0.07]$	-1.60	.110
Chlorides	-0.03	$[-0.08$, $0.01]$	-1.50	.134
PH	-0.01	$[-0.03$, $0.00]$	-1.43	.152

We’ll build a reduced model with these three variables removed. Output can be found in Table .

(#tab:modfourone) Model 4.1 Poisson Regression Output

Predictor	$b$	95% CI	$z$	$p$
Intercept	-0.07	$[-0.17$, $0.03]$	-1.32	.187
STARS1	0.76	$[0.72$, $0.80]$	38.87	< .001
STARS2	1.08	$[1.04$, $1.12]$	59.17	< .001
STARS3	1.20	$[1.16$, $1.24]$	62.39	< .001
STARS4	1.32	$[1.27$, $1.37]$	54.36	< .001
LabelAppeal-1	0.24	$[0.16$, $0.31]$	6.22	< .001
LabelAppeal0	0.43	$[0.35$, $0.50]$	11.51	< .001
LabelAppeal1	0.56	$[0.49$, $0.63]$	14.84	< .001
LabelAppeal2	0.70	$[0.61$, $0.78]$	16.39	< .001
AcidIndex7	-0.03	$[-0.06$, $0.00]$	-1.82	.069
AcidIndex8	-0.06	$[-0.10$, $-0.03]$	-3.58	< .001
AcidIndex9	-0.17	$[-0.22$, $-0.13]$	-7.72	< .001
AcidIndex10	-0.48	$[-0.53$, $-0.42]$	-16.46	< .001
VolatileAcidity	-0.04	$[-0.06$, $-0.02]$	-3.90	< .001
Alcohol	0.00	$[0.00$, $0.01]$	2.81	.005
Sulphates	-0.02	$[-0.03$, $0.00]$	-1.98	.048
PH	-0.01	$[-0.03$, $0.00]$	-1.44	.151

We’ll now perform an analysis of deviance test to compare the full and reduced models-see Table . The p-value for this test roughly 3%, so we will proceed with the full model.

(#tab:anvtwo) Analysis of Deviance of Models 4 and Model 4.1

Resid..Df	Resid..Dev	Df	Deviance	Pr..Chi.
12,778.00	13,614.22	NA	NA	NA
12,775.00	13,605.48	3.00	8.74	0.03

4.5 Zero-Inflated Poisson Regression Model 5

We noted earlier that the TARGET variable is zero-inflated. Therefore, it makes sense to attempt to fit a zero-inflated Poisson regression model.

We’ll use the same set of predictors from Model 2.1 and 4. Model output is located in Table 13.

While the coefficients and signs of our variables are consistent with our understanding of these variables, we see four variables in the Poisson with log link component of the model with p-values greater than 5%.

4.6 Zero-Inflated Poisson Regression Model 6

Model 6 simply removes the insignificant predictors from model 5. Output can be found in Table 15.

All coefficients in the resulting model–with the exception of one dummy variable relating to AcidIndex–are significant.

4.7 Negative Binomial Model 7

In model 7, we build a negative binomial regression model. This model type seems appropriate given that the data is possibly overdispersed.

We’ll begin by using the small subset of predictors using in Model 1. Output is located in Table 17.

All coefficients are significant. The values of the coefficients are consistent with previous models examined.

4.8 Negative Binomial Model 8

In the next model, we fit a negative binomial model using stepwise regression. Model output is in Table 18. All categorical predictors are included in the final model, as well as VolatileAcidity, Alcohol, TotalSulfurDioxide, Chlorides. and Density. Of these predictors, TotalSulfurDioxide, Chlorides. and Density are not significant at the 5% threshold.

We’ll produce a reduced model with the three predictors removed. Output can be found in Table 19.

Now we’ll do an analysis of deviance test to compare the full and reduced models–see Table 20 for details.

(#tab:anveight) Analysis of Deviance of Models 8 and Model 8.1

theta	Resid..df	X…2.x.log.lik.	Test	X…df	LR.stat.	Pr.Chi.
40,747.48	12,778.00	-45,556.66		NA	NA	NA
40,786.10	12,775.00	-45,547.92	1 vs 2	3.00	8.74	0.03

We proceed the full model, given the p-value of 3%.

4.9 Zero Inflated Negative Binomial Model 9

Next, we build a zero-inflated negative binomial model. We’ll start with the set of candidate predictors used in model 8–refer to Table 21 for detailed output.

All variables are significant except VolatileAcidity, TotalSulfurDioxed, Chlorides, and Density.

4.10 Zero-inflated Negative Binomial Model 10

For our last model, we’ll start with Model 9, but remove the predictors that were not statistically significant. Full model output is in Table 23.

All predictors are statistically significant in this model.

5 Select Models

5.1 Model Evaluations

We have 10 models in total to compare. We need a common set of statistics that we can use to compare each model. We are able to calculate BIC for 6 out of the 10 models, and AIC for all 10 models. Refer to Table 25 for a full model comparison summary.

(#tab:comp) Compare Models

	AIC	BIC	loglik
mod1_out	43,207.25	43,319.10	-21,588.63
mod2.1_out	43,168.78	43,317.92	-21,564.39
mod3_out	45,605.86	45,710.25	-22,788.93
mod4_out	45,587.50	45,736.64	-22,773.75
mod5_out	40,818.21	NA	-20,373.10
mod6_out	40,833.28	NA	-20,386.64
mod7_out	45,608.28	45,720.13	-22,789.14
mod8_out	45,589.92	45,746.51	-22,773.96
mod9_out	40,820.21	NA	-20,373.10
mod10_out	40,846.96	NA	-20,394.48

Using AIC, we see that the multiple linear regression models (Model 1 and 2.1) outperform the basic Poisson (Models 3 and 4) and Negative Binomial Models (Models 7 and 8).
This result is somewhat surprising, as we suspected that the negative binomial models would be superior to the other model types. Interestingly, the basic Poisson and Negative Binomial performed about the same.

On the other hand, the zero inflated Poisson (Models 5 and 6) and Zero inflated Negative Binomial models performed the best. The models where we included in the statistically insignificant predictors (Models 5 and 9) performed the best.

While Models 5 and 9 have virtually identical AIC measures, we select Model 9, the zero inflated Negative Binomial model, as our model to use going forward, due to possible overdispersion issues with the data.

5.2 Make Predictions

We applied the same imputation and transformation procedures used for the training data on our evaluation data set.

Next, we predicted TARGET values using Model 9, one of our zero-inflated Negative Binomial models.

Finally uploaded the prediction file to Github:
https://github.com/niteen11/MSDS/blob/master/DATA621/Week%205/test_data_with_predictions.csv

5.3 Compare Training TARGET and Predicted Values

As a final check, we’ll compare the distributions of the training data TARGET variable and the predicted response values in the test data set–refer to Table for the summary table.

(#tab:compt) Compare Training and Test TARGET Variable

	Min	Max	Range	Mean	Med	Stdev	Skew	Kurt	NAs	Outliers
train.TARGET	0.00	8.00	8.00	3.03	3.00	1.93	-0.33	2.12	0.00	0.00
test.TARGET	0.14	7.09	6.95	3.06	3.02	1.48	0.12	2.25	0.00	0.00

The distributions are fairly similar. The means and medians are almost identical, and the kurtosis values are close. The standard deviation and range of the predicted TARGET values are slightly lower than the comparable statistics in the training data.

6 Conclusion

We produced an interpretable model that LWM can use for potentially modifying its current wine offerings.

Based on our modeling process, we’ve determined that high expert ratings and attractive label design are very important factors that are routinely associated with high wine sales. Wines that are less acidic tend to sell better, and higher alcohol content is associated above average sales.

Additional attributes that are common in high-selling wines include low concentrations of volatile acidic content, chlorides, and total sulfur dioxide. Finally, low density wines tend to sell better than high density wines.

DATA621_HW_5

Niteen Kumar

July 14, 2018