A large wine manufacturer (LWM) is considering revisions to its current lineup of product offerings to improve total sales revenue. To aid in the decision-making process, LWM acquired a data set that identifies physical characteristics and ratings for more than 12,000 commercial wines. The data set set also quantifies the number of wine cases purchased by distributors after sampling each product.
LWM has contracted with the data analytics team (DAT) to build a model that that predicts the number of wine cases sold to distributors, given a set of unique wine characteristics. Because distributor purchases are strongly correlated with sales at restaurants and other retail outlets–where LWM derives the bulk of its revenue–the model output could help LWM determine wine qualities that are associated with strong sales.
The training data set comprises 12,795 observations and 16 variables and is approximately 1.4 MB in size. The variable INDEX is used for observation identification purposes only. The other variables include 14 features and 1 response variable, TARGET.
Twelve of the features describe the chemical properties of wine. The remaining two predictors are rating variables: LabelAppeal refers to the perceived attractiveness of a wine’s product label, while STARS is a subjective assessment of wine quality.
Finally, the output variable, TARGET, is a count measure indicating the number of wine case purchases by distributors.
For a detailed listing of all variable names and descriptions, please refer to Table , located at the end of this report. Subsequent tables and figures will also provided after the body of this document.
| Variable Name | Definition | Theoretical Effect |
|---|---|---|
| INDEX | Identification only | |
| TARGET | Number of Wine Cases Purchased | |
| AcidIndex | Proprietary method of testing total acidity | |
| Alcohol | Alcohol Content | |
| Chlorides | Chloride Content | |
| Citric Acid | Citric Acid Content | |
| Density | Density of wine | |
| FixedAcidity | Fixed acidity of wine | |
| FreeSulfurDioxide | Sulfur Dioxide content of wine | |
| LabelAppeal | Marketing score of label attractiveness | High value suggests high sales |
| ResidualSugar | Residual Sugar of wine | |
| STARS | Expert Rating: 1=Poor through 4=Excellent | High value implies high sales |
| Sulphates | Sulfate content | |
| TotalSulfurDioxide | Total Sulfur Dioxide | |
| VolatileAcidity | Volatile acid content of wine | |
| pH | pH of wine |
There are three categorical features in the wine data set: LabelAppeal, STARS, and AcidIndex. The first two of these features are rating variables–see the Overview section and for detailed descriptions. AcidIndex is related to the acidic quality of wine.
LabelAppeal scores range from -2 through 2. Based on the data set documentation, a negative score suggests a poor impression of the label design, a zero score is neutral, and a positive scores implies an attractive design. The distribution of scores in the training data are symmetrical and roughly bell-shaped.
STARS scores range from 1 through 4, with experts characterizing more wines as low quality (1-2) than high quality (3-4). A significant number of wines in the training data–over 25%–have no STARS score.
AcidIndex is a discrete variable with observed values ranging from 4 through 17. The data documentation describes the calculation of the index value using a proprietary method involving weighted averages. The values are clearly ordinal, but we have little additional information for understanding the meaning of this variable. For example, we cannot determine, based on the description provided, whether an index value of 10 reflects a wine with twice the acidic content of another wine with a value of 5. Unlike the other variables in the unadjusted data set, AcidIndex has a moderate, right skew.
Refer to Table for a numerical summary of each categorical variable. Table provides additional, descriptive statistics, and Figure contains graphical summaries.
| Variable | Levels | Count | Pct | Cumul.Pct |
|---|---|---|---|---|
| LabelAppeal | -2 | 504 | 3.9 | 3.9 |
| -1 | 3136 | 24.5 | 28.4 | |
| 0 | 5617 | 43.9 | 72.3 | |
| 1 | 3048 | 23.8 | 96.2 | |
| 2 | 490 | 3.8 | 100 | |
| STARS | 1 | 3042 | 23.8 | 23.8 |
| 2 | 3570 | 27.9 | 51.7 | |
| 3 | 2212 | 17.3 | 69 | |
| 4 | 612 | 4.8 | 73.7 | |
| 3359 | 26.3 | 100 | ||
| AcidIndex | 4 | 3 | 0 | 0 |
| 5 | 75 | 0.6 | 0.6 | |
| 6 | 1197 | 9.4 | 10 | |
| 7 | 4878 | 38.1 | 48.1 | |
| 8 | 4142 | 32.4 | 80.5 | |
| 9 | 1427 | 11.2 | 91.6 | |
| 10 | 551 | 4.3 | 95.9 | |
| 11 | 258 | 2 | 97.9 | |
| 12 | 128 | 1 | 98.9 | |
| 13 | 69 | 0.5 | 99.5 | |
| 14 | 47 | 0.4 | 99.8 | |
| 15 | 8 | 0.1 | 99.9 | |
| 16 | 5 | 0 | 99.9 | |
| 17 | 7 | 0.1 | 100 |
We made a judgment call in classifying AcidIndex as a categorical variable–based on the limited information provided in the variable documentation. The variable is clearly discrete, with only 14 unique values in the training data. Another analyst may reasonably argue that this feature should be treated like a continuous variable for modeling purposes.
There are two additional, discrete predictors in the wine data set: TotalSulfurDioxide and FreeSulfurDioxide. Both variables measure similar chemical properties. Plots in Figure indicate that the variables are symmetric, but have rather fat tails. We are concerned about the negative values exhibited by these variables: Sulfur Dioxide is typically measured in mg/l (or equivalently, ppm). We also note that both variables have missing values–see for more details. We’ll address these issues in subsequent sections.
For modeling purposes, we will treat the two sulfur dioxide features like any other continuous predictor. This treatment seems reasonable, given that both predictors take on a very wide array of discrete values.
Nine predictors in the wine data set are continuous, with each variable corresponding to a particular chemical property: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, Density, pH, Sulphates, and Alcohol.
The three predictors related to acidity and the Density variable have a significant number of outlier observations–roughly 2% of each feature’s observations are classified as outliers.
All nine predictors are roughly symmetrical and leptokurtic. Finally, all continuous features–with the exception of pH and Density–exhibit negative values. Based on our cursory research, none of these predictors are measured in units that would include negative values. We will address these issues in a later section.
Descriptive statistics and graphical summaries for the continuous features can be found in Table and Figure , respectively.
The output variable TARGET, has a zero-inflated distribution, with more than 20% of observations indicating no sales. Counts in the training data range from 0 through 8, although the values are theoretically unbounded above. Finally, the distribution of cases sold, given at least one sale, is symmetric and approximately normal.
Refer to Table and Figure for additional details.
| Min | Max | Range | Mean | Med | Stdev | Skew | Kurt | NAs | Outliers | |
|---|---|---|---|---|---|---|---|---|---|---|
| TARGET | 0.0 | 8.0 | 8.0 | 3.0 | 3.0 | 1.9 | -0.3 | 2.1 | 0 | 0 |
| FixedAcidity | -18.1 | 34.4 | 52.5 | 7.1 | 6.9 | 6.3 | 0.0 | 4.7 | 0 | 195 |
| VolatileAcidity | -2.8 | 3.7 | 6.5 | 0.3 | 0.3 | 0.8 | 0.0 | 4.8 | 0 | 216 |
| CitricAcid | -3.2 | 3.9 | 7.1 | 0.3 | 0.3 | 0.9 | 0.0 | 4.8 | 0 | 215 |
| ResidualSugar | -127.8 | 141.2 | 268.9 | 5.4 | 3.9 | 33.7 | -0.1 | 4.9 | 616 | 0 |
| Chlorides | -1.2 | 1.4 | 2.5 | 0.1 | 0.0 | 0.3 | 0.0 | 4.8 | 638 | 0 |
| FreeSulfurDioxide | -555.0 | 623.0 | 1,178.0 | 30.8 | 30.0 | 148.7 | 0.0 | 4.8 | 647 | 0 |
| TotalSulfurDioxide | -823.0 | 1,057.0 | 1,880.0 | 120.7 | 123.0 | 231.9 | 0.0 | 4.7 | 682 | 0 |
| Density | 0.9 | 1.1 | 0.2 | 1.0 | 1.0 | 0.0 | 0.0 | 4.9 | 0 | 226 |
| pH | 0.5 | 6.1 | 5.7 | 3.2 | 3.2 | 0.7 | 0.0 | 4.6 | 395 | 0 |
| Sulphates | -3.1 | 4.2 | 7.4 | 0.5 | 0.5 | 0.9 | 0.0 | 4.8 | 1,210 | 0 |
| Alcohol | -4.7 | 26.5 | 31.2 | 10.5 | 10.4 | 3.7 | 0.0 | 4.5 | 653 | 0 |
| LabelAppeal | -2.0 | 2.0 | 4.0 | 0.0 | 0.0 | 0.9 | 0.0 | 2.7 | 0 | 0 |
| AcidIndex | 4.0 | 17.0 | 13.0 | 7.8 | 8.0 | 1.3 | 1.6 | 8.2 | 0 | 264 |
| STARS | 1.0 | 4.0 | 3.0 | 2.0 | 2.0 | 0.9 | 0.4 | 2.3 | 3,359 | 0 |
(ref:hist-caption) Histogram of Wine Training Data Variables
[1] "INDEX" "TARGET" "FixedAcidity"
[4] "VolatileAcidity" "CitricAcid" "ResidualSugar"
[7] "Chlorides" "FreeSulfurDioxide" "TotalSulfurDioxide"
[10] "Density" "pH" "Sulphates"
[13] "Alcohol" "LabelAppeal" "AcidIndex"
[16] "STARS"
(ref:hist-caption)
In Figure we compare the three categorical variables to the TARGET variable. For each of these predictors, there appears to be a significant relationship between the ordered levels and the number of wine cases sold. Unit increases to the LabelAppeal metric are associated with consistent increases in the response. Similarly, increases in the STARS rating correspond with higher wine sales. Interestingly enough, wines that have not been rated tend to have low sales compared to their rated counterparts. Finally, AcidIndex score appear to be negatively correlated with the response.
The relationship between the TARGET variable and the continuous variables is less clear. We initially created scatterplots to explore these relationship. Unfortunately, there is significant variability in response values across narrow bands of similar predictor values; so patterns are not easily discerned. As an alternative we plotted Loess curves to get a general sense of the relationship between the predictors and the response–see Figure . The plotted curves are fairly complex but we noted several mild or moderate patterns:
(ref:bp-caption) Boxplots of TARGET vs. Categorical Variables
(ref:bp-caption)
(ref:lc-caption) Loess Curves of TARGET vs. Continuous Variables
(ref:lc-caption)
In Figure , we produced a correlation matrix of all candidate variables. For computational purposes, we converted all categorical variables back to their original numerical form.
None of the predictors have a particularly strong correlation with the TARGET variable. The STARS predictor has a moderate positive association with response. LabelAppeal has a moderately weak positive correlation with TARGET, while AcidIndex has a fairly weak negative association with the response. All other predictors have extremely weak linear associations with TARGET.
Finally, the correlations between predictor variables are all weak. In other words, there there are no immediately obvious collinearity issues.
(ref:cor-caption) Correlation of Wine Data Variables
(ref:cor-caption)
As discussed previously, none of the predictors related to chemical composition should have negative values, as the units of measurement for these variables are always non-negative.
Based on the descriptive statistics and histogram plots in Figure and Table , respectively, we see a similarity in the magnitudes of the negative and positive ranges for each of thee predictors of concern. In other words, the distributions are all symmetric.
It is possible that the minus signs in the data reflect data entry errors or some other identifying characteristic, but we have no data documentation to support these hypotheses. On the other hand, we found a resource online–see the References section–that analyzes the distributions of various wine chemical properties. In that resource, most of the wine distributions featured all positive values distributions and were mostly right skewed.
Based on this information, we decided to apply simple absolute value transformation for the predictors of concern. In total, we applied this transformation to nine predictors: FixedAcidity, VolatileACidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Sulphate, and Alcohol.
Missing values impact roughly 50% of records in the training data. The values are present in varying degrees across eight of the predictor candidates: STARS, Sulphates, TotalSulfurDioxide, Alcohol, FreeSulfurDioxide, Chlorides, ResidualSugar, and pH. For a summary description of missing elements, refer to Figure .
The STARS variable has the highest number of missing values, with roughly one quarter of observations coded as NA. Given the large number of missing records, and the apparent significance of the NA records relationship with TARGET vis-a-vis the other variable levels, we decided to leave the NA records as is.
The Sulphates variable has missing values in approximately 10% of observations, while the six other predictors have missing values in the 3% - 5% range. Given the relatively small percentages of missing values for the remaining predictors, we will impute values using R’s MICE package. The use of the MICE procedure results in less bias in model regression coefficients compared to simpler methods like mean substitution.
Refer to Figure for a graphical depiction of the MICE imputation procedure. The blue curves represent the distribution of non-missing values, while the red curves are the imputed values. The red and blue distributions are very similar. This result suggests that the imputation procedure approximated the original distribution accurately.
(ref:missing-caption) Plot of Missing Observations in Wine Training Data
(ref:missing-caption)
(ref:mice-caption) Graphical Summary of MICE Imputation Procedure
(ref:mice-caption)
We have made substantial revisions to our predictor variables; so we will review the revised variables based on the transformations applied thus far.
In Table , we recalculated basic descriptive statistics for all variables. We also generated new histograms, correlation plots, and loess curve plots–see Figures , , and , respectively.
We see that many of the predictor variables now have a moderate right skew. The majority of the predictors are still leptokurtic. Also, our transformed variables indicate a higher prevalence of outlier observations. The revised Loess curve plots indicate some significant changes in the relationship between the predictors and the TARGET variable. For instance, sulfur dioxide content now seems to be negatively correlated with wine sales. Interestingly, the loess curves are almost linear or slightly curved for typical predictor values. However, the curves are very complex and unstable for predictor values that are are sparsely populated. Finally, the revised correlation plot indicates no material changes to the original correlations.
| Min | Max | Range | Mean | Med | Stdev | Skew | Kurt | NAs | Outliers | |
|---|---|---|---|---|---|---|---|---|---|---|
| TARGET | 0.0 | 8.0 | 8.0 | 3.0 | 3.0 | 1.9 | -0.3 | 2.1 | 0 | 0 |
| FixedAcidity | 0.0 | 34.4 | 34.4 | 8.1 | 7.0 | 5.0 | 1.2 | 5.0 | 0 | 156 |
| VolatileAcidity | 0.0 | 3.7 | 3.7 | 0.6 | 0.4 | 0.6 | 1.7 | 6.1 | 0 | 239 |
| CitricAcid | 0.0 | 3.9 | 3.9 | 0.7 | 0.4 | 0.6 | 1.6 | 5.9 | 0 | 230 |
| ResidualSugar | 0.0 | 141.2 | 141.2 | 23.4 | 12.9 | 24.9 | 1.5 | 5.2 | 0 | 242 |
| Chlorides | 0.0 | 1.4 | 1.4 | 0.2 | 0.1 | 0.2 | 1.5 | 5.2 | 0 | 235 |
| FreeSulfurDioxide | 0.0 | 623.0 | 623.0 | 107.3 | 56.0 | 108.8 | 1.5 | 5.4 | 0 | 246 |
| TotalSulfurDioxide | 0.0 | 1,057.0 | 1,057.0 | 203.9 | 153.0 | 162.9 | 1.6 | 6.0 | 0 | 202 |
| Density | 0.9 | 1.1 | 0.2 | 1.0 | 1.0 | 0.0 | 0.0 | 4.9 | 0 | 226 |
| pH | 0.5 | 6.1 | 5.7 | 3.2 | 3.2 | 0.7 | 0.0 | 4.6 | 0 | 189 |
| Sulphates | 0.0 | 4.2 | 4.2 | 0.8 | 0.6 | 0.7 | 1.7 | 6.2 | 0 | 211 |
| Alcohol | 0.0 | 26.5 | 26.5 | 10.5 | 10.4 | 3.6 | 0.2 | 4.0 | 0 | 85 |
| LabelAppeal | -2.0 | 2.0 | 4.0 | 0.0 | 0.0 | 0.9 | 0.0 | 2.7 | 0 | 0 |
| AcidIndex | 4.0 | 17.0 | 13.0 | 7.8 | 8.0 | 1.3 | 1.6 | 8.2 | 0 | 264 |
| STARS | 1.0 | 4.0 | 3.0 | 2.0 | 2.0 | 0.9 | 0.4 | 2.3 | 3,359 | 0 |
(ref:revhisto) Revised Histogram of Wine Training Data Variables
(ref:revhisto)
(ref:cor-caption-rev) Revised Correlation of Wine Data Variables
(ref:cor-caption-rev)
(ref:lccaprev) Revised Loess Curves of TARGET vs. Continuous Variables
(ref:lccaprev)
We previously decided to treat AcidIndex as an ordered, categorical variable. Because the index values are sparsely populated at the extreme low and high ends, we will bin the categories into fewer groupings. Specifically, we will merge values of 4 and 5 into the 6 index measure. We will also merge indices 11 and higher into the original 10 index value. This decision is also supported by the boxplot in Figure , where extreme low and high index values appear to have similar relationship with the TARGET variable.
We contemplated applying Box-Cox suggested power transformations to some of the predictor variables. Because many of the predictor values have a moderate right skew, we could reasonably apply log, square root, and/or quarter-root transformations. These transformations may make sense in a basic OLS model; however, we are building a variety of glm models in addition to OLS models. Most glms do not require such transformations. We ultimately decided to refrain from applying any power transformations at this point in the modeling process.
We also debated whether or not to winsorize some of the candidate predictors, given the prevalence of outlier observations. While this procedure can soften the influence of outlier observations, it can also introduce bias in the point estimates and inference mechanisms. We therefore decided to forgo winsorizing the predictors.
Finally, we noted that our client, LWM, requires an interpretable model. By refraining from applying complex predictor transformations, our model should remain relatively easy to explain.
For our first linear regression model, we decided to include the following four predictors: AcidIndex, STARS, LabelAppeal, and Alcohol. We included each of the categorical predictors because of their clear relationship with TARGET–see the boxplots in Figure . Also the correlation plot (Figure ) indicated higher magnitude correlation values compared to the other candidate predictors.
We included the Alcohol variable for a variety of reasons. First, we recognize that our variable transformation involving the absolute value may not have been the best approach to handling negative values. Fortunately, the raw Alcohol values were all positive. Secondly, this variable is also not significantly skewed and has only mildly fat tails. We also observe fewer outliers with this variable compared to many other predictors. Finally, Alochol's relationship with TARGET–refer to Figure –appears to be more straightforward compared to other relationships.
See Table for regression output. The model has an adjusted \(R^2\) value of 0.54. All predictors and levels within each categorical predictor are significant at the 5% level. The model coefficients make intuitive sense. Sales of wine decrease with each increase in the AcidIndex value. Having a STARS rating results in more sales compared to no ratings, and higher STARS ratings are associated with higher sales. Finally, increases in alcohol content are associated with higher sales. This is consistent with observations from our EDA work.
| Predictor | \(b\) | 95% CI | \(t(12781)\) | \(p\) |
|---|---|---|---|---|
| Intercept | 0.62 | $[0.47$, $0.78]$ | 7.93 | < .001 |
| AcidIndex7 | -0.09 | $[-0.17$, $-0.01]$ | -2.14 | .032 |
| AcidIndex8 | -0.20 | $[-0.28$, $-0.12]$ | -4.70 | < .001 |
| AcidIndex9 | -0.51 | $[-0.61$, $-0.41]$ | -10.04 | < .001 |
| AcidIndex10 | -1.06 | $[-1.17$, $-0.95]$ | -19.27 | < .001 |
| STARS1 | 1.36 | $[1.30$, $1.43]$ | 41.27 | < .001 |
| STARS2 | 2.41 | $[2.34$, $2.47]$ | 75.05 | < .001 |
| STARS3 | 2.97 | $[2.90$, $3.04]$ | 79.95 | < .001 |
| STARS4 | 3.66 | $[3.55$, $3.78]$ | 61.73 | < .001 |
| LabelAppeal-1 | 0.36 | $[0.24$, $0.49]$ | 5.76 | < .001 |
| LabelAppeal0 | 0.83 | $[0.71$, $0.95]$ | 13.49 | < .001 |
| LabelAppeal1 | 1.29 | $[1.17$, $1.42]$ | 20.15 | < .001 |
| LabelAppeal2 | 1.88 | $[1.71$, $2.05]$ | 22.25 | < .001 |
| Alcohol | 0.01 | $[0.01$, $0.02]$ | 3.92 | < .001 |
For our second multiple linear regression model, we we used stepwise regression with variable selection occurring in both directions. Model output can be found in Table . This second model has an adjusted \(R^2\) of 0.54, and has 11 out of the 15 potential candidate predictors. Only FixedAcidity, ResidualSugar, and FreeSulfureDioxide excluded from the model.
| Predictor | \(b\) | 95% CI | \(t(12773)\) | \(p\) |
|---|---|---|---|---|
| Intercept | 1.69 | $[0.82$, $2.56]$ | 3.80 | < .001 |
| STARS1 | 1.36 | $[1.29$, $1.42]$ | 41.13 | < .001 |
| STARS2 | 2.39 | $[2.33$, $2.46]$ | 74.66 | < .001 |
| STARS3 | 2.96 | $[2.88$, $3.03]$ | 79.68 | < .001 |
| STARS4 | 3.65 | $[3.54$, $3.77]$ | 61.65 | < .001 |
| LabelAppeal-1 | 0.36 | $[0.23$, $0.48]$ | 5.67 | < .001 |
| LabelAppeal0 | 0.82 | $[0.70$, $0.94]$ | 13.45 | < .001 |
| LabelAppeal1 | 1.29 | $[1.16$, $1.41]$ | 20.09 | < .001 |
| LabelAppeal2 | 1.88 | $[1.71$, $2.04]$ | 22.25 | < .001 |
| AcidIndex7 | -0.09 | $[-0.17$, $-0.01]$ | -2.25 | .024 |
| AcidIndex8 | -0.20 | $[-0.28$, $-0.12]$ | -4.81 | < .001 |
| AcidIndex9 | -0.51 | $[-0.61$, $-0.41]$ | -10.00 | < .001 |
| AcidIndex10 | -1.05 | $[-1.16$, $-0.94]$ | -19.02 | < .001 |
| VolatileAcidity | -0.11 | $[-0.15$, $-0.07]$ | -5.22 | < .001 |
| Alcohol | 0.01 | $[0.01$, $0.02]$ | 4.08 | < .001 |
| TotalSulfurDioxide | 0.00 | $[0.00$, $0.00]$ | 2.73 | .006 |
| Sulphates | -0.04 | $[-0.07$, $-0.01]$ | -2.25 | .024 |
| Density | -0.92 | $[-1.78$, $-0.07]$ | -2.11 | .035 |
| FreeSulfurDioxide | 0.00 | $[0.00$, $0.00]$ | 1.87 | .062 |
| Chlorides | -0.09 | $[-0.19$, $0.01]$ | -1.82 | .069 |
| PH | -0.03 | $[-0.06$, $0.00]$ | -1.74 | .082 |
| CitricAcid | 0.03 | $[-0.01$, $0.07]$ | 1.53 | .127 |
Three predictors have p-values over 5%: pH, CitricAcid, and Sulphates. Let’s create a reduced model that removes these variables–see Table for regression output for this smaller model. The adjusted \(R^2\) is still 0.54, but now all model p-values are below 5%.
| Predictor | \(b\) | 95% CI | \(t(12776)\) | \(p\) |
|---|---|---|---|---|
| Intercept | 1.60 | $[0.73$, $2.46]$ | 3.62 | < .001 |
| STARS1 | 1.36 | $[1.29$, $1.42]$ | 41.22 | < .001 |
| STARS2 | 2.40 | $[2.33$, $2.46]$ | 74.75 | < .001 |
| STARS3 | 2.96 | $[2.89$, $3.03]$ | 79.77 | < .001 |
| STARS4 | 3.65 | $[3.54$, $3.77]$ | 61.69 | < .001 |
| LabelAppeal-1 | 0.36 | $[0.24$, $0.48]$ | 5.71 | < .001 |
| LabelAppeal0 | 0.83 | $[0.71$, $0.95]$ | 13.47 | < .001 |
| LabelAppeal1 | 1.29 | $[1.16$, $1.41]$ | 20.13 | < .001 |
| LabelAppeal2 | 1.88 | $[1.71$, $2.04]$ | 22.26 | < .001 |
| AcidIndex7 | -0.09 | $[-0.17$, $-0.01]$ | -2.19 | .029 |
| AcidIndex8 | -0.20 | $[-0.28$, $-0.12]$ | -4.71 | < .001 |
| AcidIndex9 | -0.50 | $[-0.60$, $-0.40]$ | -9.93 | < .001 |
| AcidIndex10 | -1.04 | $[-1.15$, $-0.94]$ | -18.98 | < .001 |
| VolatileAcidity | -0.11 | $[-0.15$, $-0.07]$ | -5.27 | < .001 |
| Alcohol | 0.01 | $[0.01$, $0.02]$ | 4.09 | < .001 |
| TotalSulfurDioxide | 0.00 | $[0.00$, $0.00]$ | 2.74 | .006 |
| Density | -0.95 | $[-1.80$, $-0.09]$ | -2.17 | .030 |
| FreeSulfurDioxide | 0.00 | $[0.00$, $0.00]$ | 1.91 | .057 |
| Chlorides | -0.09 | $[-0.19$, $0.00]$ | -1.90 | .058 |
Let’s perform an ANOVA test to see if the full model is significantly different than the reduced model–refer to Table .
| Res.Df | RSS | Df | Sum.of.Sq | F | Pr..F. |
|---|---|---|---|---|---|
| 12,776.00 | 21,800.44 | NA | NA | NA | NA |
| 12,773.00 | 21,782.73 | 3.00 | 17.71 | 3.46 | 0.02 |
The p-value of the ANOVA test is greater than 5%, We proceed with the reduced Model 2.1.
All coefficients included in Model 2 that were also in Model 1 have similar signs and magnitudes as those in the prior model. Furthermore, the signs of the four additional predictors are consistent with results from our EDA. For instance, we expected the number of wine cases sold to decrease with increases in VolatileAcidity. We also expected some improvement in TARGET output with increases to TotalSulfurDioxide. Finally, our EDA–see Figure –indicated a general decrease in TARGET values with increases with Chloride and Density. This is consistent with the negative model coefficients ovserved in Model 2.1.
A Poisson model is potentially more appropriate for our modeling problem because our output variable is count-based measure.
In a Poisson-based model, we assume the dependent variable’s mean is identical to its variance. Let’s briefly check this assumption by comparing the unconditional mean and variance of TARGET variable. The variable’s mean is 3.03 and the variance is 3.71. Because the variance is somewhat higher than the mean, there could possibly be an issue with overdispersion.
Let’s start by building a Poisson glm using the same variables from Model 1. Model output is provided in Table . The AIC value is 45604. All coefficients are statistically significant, with signs in and magnitudes that are consistent with our understanding of the the predictor relationship with TARGET.
| Predictor | \(b\) | 95% CI | \(z\) | \(p\) |
|---|---|---|---|---|
| Intercept | -0.14 | $[-0.23$, $-0.06]$ | -3.26 | .001 |
| AcidIndex7 | -0.03 | $[-0.06$, $0.00]$ | -1.71 | .088 |
| AcidIndex8 | -0.06 | $[-0.10$, $-0.03]$ | -3.48 | .001 |
| AcidIndex9 | -0.17 | $[-0.22$, $-0.13]$ | -7.70 | < .001 |
| AcidIndex10 | -0.48 | $[-0.53$, $-0.42]$ | -16.49 | < .001 |
| STARS1 | 0.76 | $[0.72$, $0.80]$ | 38.95 | < .001 |
| STARS2 | 1.08 | $[1.05$, $1.12]$ | 59.38 | < .001 |
| STARS3 | 1.20 | $[1.16$, $1.24]$ | 62.58 | < .001 |
| STARS4 | 1.32 | $[1.28$, $1.37]$ | 54.50 | < .001 |
| LabelAppeal-1 | 0.24 | $[0.16$, $0.31]$ | 6.27 | < .001 |
| LabelAppeal0 | 0.43 | $[0.36$, $0.50]$ | 11.56 | < .001 |
| LabelAppeal1 | 0.56 | $[0.49$, $0.64]$ | 14.88 | < .001 |
| LabelAppeal2 | 0.70 | $[0.61$, $0.78]$ | 16.44 | < .001 |
| Alcohol | 0.00 | $[0.00$, $0.01]$ | 2.76 | .006 |
TotalSulfurDioxide, Chlorides, and Density.| Predictor | \(b\) | 95% CI | \(z\) | \(p\) |
|---|---|---|---|---|
| Intercept | 0.23 | $[-0.16$, $0.62]$ | 1.16 | .244 |
| STARS1 | 0.76 | $[0.72$, $0.80]$ | 38.84 | < .001 |
| STARS2 | 1.08 | $[1.04$, $1.12]$ | 59.11 | < .001 |
| STARS3 | 1.20 | $[1.16$, $1.24]$ | 62.31 | < .001 |
| STARS4 | 1.32 | $[1.27$, $1.37]$ | 54.35 | < .001 |
| LabelAppeal-1 | 0.24 | $[0.16$, $0.31]$ | 6.22 | < .001 |
| LabelAppeal0 | 0.43 | $[0.36$, $0.50]$ | 11.52 | < .001 |
| LabelAppeal1 | 0.56 | $[0.49$, $0.63]$ | 14.84 | < .001 |
| LabelAppeal2 | 0.70 | $[0.61$, $0.78]$ | 16.42 | < .001 |
| AcidIndex7 | -0.03 | $[-0.06$, $0.00]$ | -1.79 | .073 |
| AcidIndex8 | -0.06 | $[-0.10$, $-0.03]$ | -3.55 | < .001 |
| AcidIndex9 | -0.17 | $[-0.22$, $-0.13]$ | -7.64 | < .001 |
| AcidIndex10 | -0.47 | $[-0.53$, $-0.42]$ | -16.32 | < .001 |
| VolatileAcidity | -0.04 | $[-0.05$, $-0.02]$ | -3.84 | < .001 |
| Alcohol | 0.00 | $[0.00$, $0.01]$ | 2.85 | .004 |
| Sulphates | -0.02 | $[-0.03$, $0.00]$ | -1.93 | .053 |
| TotalSulfurDioxide | 0.00 | $[0.00$, $0.00]$ | 2.00 | .045 |
| Density | -0.31 | $[-0.68$, $0.07]$ | -1.60 | .110 |
| Chlorides | -0.03 | $[-0.08$, $0.01]$ | -1.50 | .134 |
| PH | -0.01 | $[-0.03$, $0.00]$ | -1.43 | .152 |
We’ll build a reduced model with these three variables removed. Output can be found in Table .
| Predictor | \(b\) | 95% CI | \(z\) | \(p\) |
|---|---|---|---|---|
| Intercept | -0.07 | $[-0.17$, $0.03]$ | -1.32 | .187 |
| STARS1 | 0.76 | $[0.72$, $0.80]$ | 38.87 | < .001 |
| STARS2 | 1.08 | $[1.04$, $1.12]$ | 59.17 | < .001 |
| STARS3 | 1.20 | $[1.16$, $1.24]$ | 62.39 | < .001 |
| STARS4 | 1.32 | $[1.27$, $1.37]$ | 54.36 | < .001 |
| LabelAppeal-1 | 0.24 | $[0.16$, $0.31]$ | 6.22 | < .001 |
| LabelAppeal0 | 0.43 | $[0.35$, $0.50]$ | 11.51 | < .001 |
| LabelAppeal1 | 0.56 | $[0.49$, $0.63]$ | 14.84 | < .001 |
| LabelAppeal2 | 0.70 | $[0.61$, $0.78]$ | 16.39 | < .001 |
| AcidIndex7 | -0.03 | $[-0.06$, $0.00]$ | -1.82 | .069 |
| AcidIndex8 | -0.06 | $[-0.10$, $-0.03]$ | -3.58 | < .001 |
| AcidIndex9 | -0.17 | $[-0.22$, $-0.13]$ | -7.72 | < .001 |
| AcidIndex10 | -0.48 | $[-0.53$, $-0.42]$ | -16.46 | < .001 |
| VolatileAcidity | -0.04 | $[-0.06$, $-0.02]$ | -3.90 | < .001 |
| Alcohol | 0.00 | $[0.00$, $0.01]$ | 2.81 | .005 |
| Sulphates | -0.02 | $[-0.03$, $0.00]$ | -1.98 | .048 |
| PH | -0.01 | $[-0.03$, $0.00]$ | -1.44 | .151 |
We’ll now perform an analysis of deviance test to compare the full and reduced models-see Table . The p-value for this test roughly 3%, so we will proceed with the full model.
| Resid..Df | Resid..Dev | Df | Deviance | Pr..Chi. |
|---|---|---|---|---|
| 12,778.00 | 13,614.22 | NA | NA | NA |
| 12,775.00 | 13,605.48 | 3.00 | 8.74 | 0.03 |
We noted earlier that the TARGET variable is zero-inflated. Therefore, it makes sense to attempt to fit a zero-inflated Poisson regression model.
We’ll use the same set of predictors from Model 2.1 and 4. Model output is located in Table 13.
While the coefficients and signs of our variables are consistent with our understanding of these variables, we see four variables in the Poisson with log link component of the model with p-values greater than 5%.
Model 6 simply removes the insignificant predictors from model 5. Output can be found in Table 15.
All coefficients in the resulting model–with the exception of one dummy variable relating to AcidIndex–are significant.
In model 7, we build a negative binomial regression model. This model type seems appropriate given that the data is possibly overdispersed.
We’ll begin by using the small subset of predictors using in Model 1. Output is located in Table 17.
All coefficients are significant. The values of the coefficients are consistent with previous models examined.
In the next model, we fit a negative binomial model using stepwise regression. Model output is in Table 18. All categorical predictors are included in the final model, as well as VolatileAcidity, Alcohol, TotalSulfurDioxide, Chlorides. and Density. Of these predictors, TotalSulfurDioxide, Chlorides. and Density are not significant at the 5% threshold.
We’ll produce a reduced model with the three predictors removed. Output can be found in Table 19.
Now we’ll do an analysis of deviance test to compare the full and reduced models–see Table 20 for details.
| theta | Resid..df | X…2.x.log.lik. | Test | X…df | LR.stat. | Pr.Chi. |
|---|---|---|---|---|---|---|
| 40,747.48 | 12,778.00 | -45,556.66 | NA | NA | NA | |
| 40,786.10 | 12,775.00 | -45,547.92 | 1 vs 2 | 3.00 | 8.74 | 0.03 |
We proceed the full model, given the p-value of 3%.
Next, we build a zero-inflated negative binomial model. We’ll start with the set of candidate predictors used in model 8–refer to Table 21 for detailed output.
All variables are significant except VolatileAcidity, TotalSulfurDioxed, Chlorides, and Density.
For our last model, we’ll start with Model 9, but remove the predictors that were not statistically significant. Full model output is in Table 23.
All predictors are statistically significant in this model.
We have 10 models in total to compare. We need a common set of statistics that we can use to compare each model. We are able to calculate BIC for 6 out of the 10 models, and AIC for all 10 models. Refer to Table 25 for a full model comparison summary.
| AIC | BIC | loglik | |
|---|---|---|---|
| mod1_out | 43,207.25 | 43,319.10 | -21,588.63 |
| mod2.1_out | 43,168.78 | 43,317.92 | -21,564.39 |
| mod3_out | 45,605.86 | 45,710.25 | -22,788.93 |
| mod4_out | 45,587.50 | 45,736.64 | -22,773.75 |
| mod5_out | 40,818.21 | NA | -20,373.10 |
| mod6_out | 40,833.28 | NA | -20,386.64 |
| mod7_out | 45,608.28 | 45,720.13 | -22,789.14 |
| mod8_out | 45,589.92 | 45,746.51 | -22,773.96 |
| mod9_out | 40,820.21 | NA | -20,373.10 |
| mod10_out | 40,846.96 | NA | -20,394.48 |
Using AIC, we see that the multiple linear regression models (Model 1 and 2.1) outperform the basic Poisson (Models 3 and 4) and Negative Binomial Models (Models 7 and 8).
This result is somewhat surprising, as we suspected that the negative binomial models would be superior to the other model types. Interestingly, the basic Poisson and Negative Binomial performed about the same.
On the other hand, the zero inflated Poisson (Models 5 and 6) and Zero inflated Negative Binomial models performed the best. The models where we included in the statistically insignificant predictors (Models 5 and 9) performed the best.
While Models 5 and 9 have virtually identical AIC measures, we select Model 9, the zero inflated Negative Binomial model, as our model to use going forward, due to possible overdispersion issues with the data.
We applied the same imputation and transformation procedures used for the training data on our evaluation data set.
Next, we predicted TARGET values using Model 9, one of our zero-inflated Negative Binomial models.
Finally uploaded the prediction file to Github:
https://github.com/niteen11/MSDS/blob/master/DATA621/Week%205/test_data_with_predictions.csv
As a final check, we’ll compare the distributions of the training data TARGET variable and the predicted response values in the test data set–refer to Table for the summary table.
| Min | Max | Range | Mean | Med | Stdev | Skew | Kurt | NAs | Outliers | |
|---|---|---|---|---|---|---|---|---|---|---|
| train.TARGET | 0.00 | 8.00 | 8.00 | 3.03 | 3.00 | 1.93 | -0.33 | 2.12 | 0.00 | 0.00 |
| test.TARGET | 0.14 | 7.09 | 6.95 | 3.06 | 3.02 | 1.48 | 0.12 | 2.25 | 0.00 | 0.00 |
The distributions are fairly similar. The means and medians are almost identical, and the kurtosis values are close. The standard deviation and range of the predicted TARGET values are slightly lower than the comparable statistics in the training data.
We produced an interpretable model that LWM can use for potentially modifying its current wine offerings.
Based on our modeling process, we’ve determined that high expert ratings and attractive label design are very important factors that are routinely associated with high wine sales. Wines that are less acidic tend to sell better, and higher alcohol content is associated above average sales.
Additional attributes that are common in high-selling wines include low concentrations of volatile acidic content, chlorides, and total sulfur dioxide. Finally, low density wines tend to sell better than high density wines.