We will explore, analyze and model a data set containing approximately 12,000 records of commerically available wines. Each record has various predictor variables regarding the wine’s chemical properties, critic ratings and visual appeal of the label. The response variables within this dataset indicate how many cases were sold.
We will create a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. We don’t expect great success using techniques available in this class as the type of wine and price-point isn’t specified.
Different types of wine are valued for different characteristics, such as acidity in certain white wines meant to be paired with spicy foods or tannins in certain red wines meant to be paired with fatty foods. What makes one type of wine sell successful in it’s category could make another wine do poorly in it’s category which will muddle our results.
Also wine is purchased at different price points. It may be that the characteristics of wine of the same type at different price points could vary. Or that purchasing decisions could be made to stock low-price wines regardless of the characteristics of the wine.
We’ll see what we can accomplish with the data provided.
Subsections:
- Loading Data
- Description of Variables
- Assumptions about the Data
Here we load in the data from the CSVs which we have downloaded. The data is separated into a training set for building the model and an evaluation set for testing the model.
We can see that we have the expected amount of records loaded in: 12,795.
## Rows: 12,795
## Columns: 16
## $ INDEX <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19…
## $ TARGET <dbl> 3, 3, 5, 3, 4, 0, 0, 4, 3, 6, 0, 4, 3, 7, 4, 0, 0, …
## $ FixedAcidity <dbl> 3.2, 4.5, 7.1, 5.7, 8.0, 11.3, 7.7, 6.5, 14.8, 5.5,…
## $ VolatileAcidity <dbl> 1.160, 0.160, 2.640, 0.385, 0.330, 0.320, 0.290, -1…
## $ CitricAcid <dbl> -0.98, -0.81, -0.88, 0.04, -1.26, 0.59, -0.40, 0.34…
## $ ResidualSugar <dbl> 54.20, 26.10, 14.80, 18.80, 9.40, 2.20, 21.50, 1.40…
## $ Chlorides <dbl> -0.567, -0.425, 0.037, -0.425, NA, 0.556, 0.060, 0.…
## $ FreeSulfurDioxide <dbl> NA, 15, 214, 22, -167, -37, 287, 523, -213, 62, 551…
## $ TotalSulfurDioxide <dbl> 268, -327, 142, 115, 108, 15, 156, 551, NA, 180, 65…
## $ Density <dbl> 0.99280, 1.02792, 0.99518, 0.99640, 0.99457, 0.9994…
## $ pH <dbl> 3.33, 3.38, 3.12, 2.24, 3.12, 3.20, 3.49, 3.20, 4.9…
## $ Sulphates <dbl> -0.59, 0.70, 0.48, 1.83, 1.77, 1.29, 1.21, NA, 0.26…
## $ Alcohol <dbl> 9.9, NA, 22.0, 6.2, 13.7, 15.4, 10.3, 11.6, 15.0, 1…
## $ LabelAppeal <dbl> 0, -1, -1, -1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 0, 0, …
## $ AcidIndex <dbl> 8, 7, 8, 6, 9, 11, 8, 7, 6, 8, 5, 10, 7, 8, 9, 8, 9…
## $ STARS <dbl> 2, 3, 3, 1, 2, NA, NA, 3, NA, 4, 1, 2, 2, 3, NA, NA…
Below is a short description of the variables of interest in the data set:
Variable | Quantitative or Categorical | Definition |
---|---|---|
INDEX | Categorical | Identification Variable |
TARGET | Count | Number of Cases Purchased |
TARGET_AMT | Quantitative | Damage value of the car crash |
AcidIndex | Quantitative | Proprietary total acidity measure |
Alcohol | Quantitative | Alcohol Content |
Chlorides | Quantitative | Chloride content of wine |
CitricAcid | Quantitative | Citric Acid Content |
Density | Quantitative | Density of Wine |
FixedAcidity | Quantitative | Fixed Acidity of wine |
FreeSulfurDioxide | Quantitative | Sulfur Dioxide content of wine |
LabelAppeal | Likert | Marketing Score with higher numbers meaning more label appeal |
ResidualSugar | Quantitative | Wine rating by a team of experts 4 Stars = Excellent, 1 Star = Poor |
Sulphates | Quantitative | Sulfate content of wine |
TotalSulfurDioxide | Quantitative | Total Sulfur Dioxide of Wine |
VolatileAcidity | Quantitative | Volatile Acid content of wine |
pH | Quantitative | pH of wine |
It looks like LabelAppeal and wine ratings are going to drive number of cases bought because of the issue with using chemical properties for wines to evaluate appeal but without specifying which type of wine or the price point.
However it may be that some properties, such as possibly Sulfur Dioxide, might indicate a flaw in the wine that is undesireable regardless of wine type or price point.
From the table above we can identify groups of variables such as ‘acidity’ (four variables) and ‘sulfur dioxide’ (two variables), so we may see some correlation in those variables or be able to make new ones.
Note, volatile means it can evaporate which would provide an aroma and means it can be lost to the air during wine creation before it is bottled. Among the four acid variables it looks like we may just use AcidIndex, the proprietary measure of acidity, and VolatileAcidity. From a review of Agrovin’s Techniques for Correcting Wine Acidity it looks like too much Volatile Acidity may indicate a flaw in the wine. Also Citric acid is only one of the acids that comprise a wine’s fixed acidity. The others are tartaric, malic, succinic and lactic, and of the five the most important is tartaric which we assume is accounted for in the proprietary measure of acidity, AcidIndex.
We have free and total sulphur dioxide so we could create a column called fixed sulphur dioxide which is the difference between the two.
Density is also not an intuitive variable. It’s the density of the wine where you take it’s weight in grams divided by it’s volume in mL. Water has a density of 1.00 g/mL, wine on average has a density of 0.99g/mL, and ethanol has a density of 0.79g/mL. So density in wine may be a proxy for higher alcohol content.
Subsections:
- Data Preprocessing - Data Skimming
- Histogram
- Outliers
- Missing Values
- Correlation
From our precursory variable exploration, we know that we have both qualitative and quantitative variables contained within our data. Before we can begin exploring the data, we need to ensure that our data is contained as the type which it should be.
Right away we see the variables that need to factorized such as LabelAppeal and STARS. We considered factorizing AcidIndex, however it’s roughly normally distributed across it’s range of values and so looks like it can stay numeric.
## Rows: 12,795
## Columns: 16
## $ INDEX <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19…
## $ TARGET <dbl> 3, 3, 5, 3, 4, 0, 0, 4, 3, 6, 0, 4, 3, 7, 4, 0, 0, …
## $ FixedAcidity <dbl> 3.2, 4.5, 7.1, 5.7, 8.0, 11.3, 7.7, 6.5, 14.8, 5.5,…
## $ VolatileAcidity <dbl> 1.160, 0.160, 2.640, 0.385, 0.330, 0.320, 0.290, -1…
## $ CitricAcid <dbl> -0.98, -0.81, -0.88, 0.04, -1.26, 0.59, -0.40, 0.34…
## $ ResidualSugar <dbl> 54.20, 26.10, 14.80, 18.80, 9.40, 2.20, 21.50, 1.40…
## $ Chlorides <dbl> -0.567, -0.425, 0.037, -0.425, NA, 0.556, 0.060, 0.…
## $ FreeSulfurDioxide <dbl> NA, 15, 214, 22, -167, -37, 287, 523, -213, 62, 551…
## $ TotalSulfurDioxide <dbl> 268, -327, 142, 115, 108, 15, 156, 551, NA, 180, 65…
## $ Density <dbl> 0.99280, 1.02792, 0.99518, 0.99640, 0.99457, 0.9994…
## $ pH <dbl> 3.33, 3.38, 3.12, 2.24, 3.12, 3.20, 3.49, 3.20, 4.9…
## $ Sulphates <dbl> -0.59, 0.70, 0.48, 1.83, 1.77, 1.29, 1.21, NA, 0.26…
## $ Alcohol <dbl> 9.9, NA, 22.0, 6.2, 13.7, 15.4, 10.3, 11.6, 15.0, 1…
## $ LabelAppeal <dbl> 0, -1, -1, -1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 0, 0, …
## $ AcidIndex <dbl> 8, 7, 8, 6, 9, 11, 8, 7, 6, 8, 5, 10, 7, 8, 9, 8, 9…
## $ STARS <dbl> 2, 3, 3, 1, 2, NA, NA, 3, NA, 4, 1, 2, 2, 3, NA, NA…
We begin our data preprocessing efforts by converting the categorical columns of LabelAppeal, AcidIndex, and STARS from doubles into factors. We then view the levels of each of these factors to better understand which levels need to be modified and which factors should be ordered.
## $LabelAppeal
## [1] "-2" "-1" "0" "1" "2"
##
## $STARS
## [1] "1" "2" "3" "4"
We begin our data exploration by skimming the test data for immediate summary statistics, data types, and a visualization of distributions with mini histograms.
We still have our 12,795 rows with 16 columns that are split between 3 factors and 13 numeric variables.
Name | train_1 |
Number of rows | 12795 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 14 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | ordered | n_unique | top_counts |
---|---|---|---|---|
LabelAppeal | 0 | FALSE | 5 | 0: 5617, -1: 3136, 1: 3048, -2: 504 |
STARS | 3359 | FALSE | 4 | 2: 3570, 1: 3042, 3: 2212, 4: 612 |
Variable type: numeric
skim_variable | n_missing | mean | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|
INDEX | 0 | 8069.980 | 1.00 | 4037.50 | 8110.00 | 12106.50 | 16129.00 | ▇▇▇▇▇ |
TARGET | 0 | 3.029 | 0.00 | 2.00 | 3.00 | 4.00 | 8.00 | ▆▇▇▆▁ |
FixedAcidity | 0 | 7.076 | -18.10 | 5.20 | 6.90 | 9.50 | 34.40 | ▁▂▇▂▁ |
VolatileAcidity | 0 | 0.324 | -2.79 | 0.13 | 0.28 | 0.64 | 3.68 | ▁▂▇▂▁ |
CitricAcid | 0 | 0.308 | -3.24 | 0.03 | 0.31 | 0.58 | 3.86 | ▁▂▇▂▁ |
ResidualSugar | 616 | 5.419 | -127.80 | -2.00 | 3.90 | 15.90 | 141.15 | ▁▂▇▂▁ |
Chlorides | 638 | 0.055 | -1.17 | -0.03 | 0.05 | 0.15 | 1.35 | ▁▂▇▂▁ |
FreeSulfurDioxide | 647 | 30.846 | -555.00 | 0.00 | 30.00 | 70.00 | 623.00 | ▁▂▇▂▁ |
TotalSulfurDioxide | 682 | 120.714 | -823.00 | 27.00 | 123.00 | 208.00 | 1057.00 | ▁▂▇▂▁ |
Density | 0 | 0.994 | 0.89 | 0.99 | 0.99 | 1.00 | 1.10 | ▁▂▇▂▁ |
pH | 395 | 3.208 | 0.48 | 2.96 | 3.20 | 3.47 | 6.13 | ▁▂▇▂▁ |
Sulphates | 1210 | 0.527 | -3.13 | 0.28 | 0.50 | 0.86 | 4.24 | ▁▂▇▂▁ |
Alcohol | 653 | 10.489 | -4.70 | 9.00 | 10.40 | 12.40 | 26.50 | ▁▂▇▂▁ |
AcidIndex | 0 | 7.773 | 4.00 | 7.00 | 8.00 | 8.00 | 17.00 | ▁▇▁▁▁ |
We can determine by the count of the factors, which we expand below, that none of our factor variables are degenerate through having a level with extremely low counts compared to either the total population or the highest level. This means we do not need to worry about removing factor variables due to degeneracy.
|
|
ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, and Alcohol all have roughly 600 missing variables. pH has ~300 missing variables and Sulphates has 1,200 missing variables.
For any of these, a missing value shouldn’t correspond to another value. Typically that might be zero but none of these should be zero theoretically. It may be that the wine coming from certain regions doesn’t get tested for certain characteristics and then we would worry about the interconnectedness of the missing data.
Plotting the amount of missing values in each row we can see the vast majority of rows having only a single missing value and an absolute minuscule amount that have 3 or more columns with missing values. Thus, we will consider these missing values to not be interconnected.
For our purposes we’re going to delete records with missing values for Alcohol or Residual Sugar because we assume Alcohol has an impact on cases of wines purchased and also Residual Sugar because it seems like if a wine had a flaw like excess residual sugar the test result might be omitted if the tester is the wine seller themselves. For the rest we feel comfortable using MICE to input values and then move on to demonstrate a count regression model.
To spot check two of the relationships between missing values we’ve plotted out scatterplots which use jittered points under the 0 ticks on both the y and x-axis in order to determine how missing values differ in their distributions.
We’ve plotted Alcohol vs. pH and Chlorides vs. ResidualSugar and for both it looks, for either case, like there is no unusual pattern of distributions in the second value based on the missingness of the first.
For the last part of data exploration we take a look at what variables might be collinear, and which ones might be correlated to our response variables. Below we plot a correlation plot focusing on each variable. It is a bit hard to see the plot in its entirety, but resizing it and making it fullscreen allows us to read the information.
In our case we have some collinearity between AcidIndex and FixedAcidity which makes sense. There’s also a tiny collinearity between Sulphates and FixedAcidity, also TotalSulphurDioxide and AcidIndex, so it may be that higher acidity and sulphur compounds constitute a flavor profile in wines.
Interestingly there is a a little collinearity between being rated 1 star and having VolatileAcidity so it may be that VolatileAcidity is a flaw in wine that is evaluated negatively by critics. Also there is a positive correlation to negative correlation gradient for Alcohol among STARS 4 through 1 with four STARS having the highest correlation, so it appears alcohol has a significant impact on ratings.
It looks like there’s a super faint correlation between 4 STARS and density so density being assumed to be a proxy for alcohol in wine in our Assumptions about Data section may be correct. However I assume the other compounds in wine mute the relationship and it’s better to just use Alcohol directly.
Next we create a correlation funnel that is plotting against target amount binned into above and below the median. This looks like a guide to which variables will have the strongest impact on how many cases our purchased, starting with our predcited LabelAppeal and Stars followed by AcidIndex and Alcohol. Interestingly CitrixAcid is high up there. Possibly it has an outsized affect on flavor independent of it’s contribution to acidity.
After exploring our data we move on to preparing it to be more useful for our goal of creating a linear model to predict team wins.
Subsections:
- Missing Values
- Distribution Transformation
- Interaction Columns
- Correlation
When there are missing values within our data, attempting to create a model from the data will end up dropping roughly 40% of rows used to calculate the model. To avoid this loss of data we will deal with the missing values ourselves.
As discussed in our data exploration we are removing records missing Alcohol and ResidualSugar values.
## There are 0 NA values in the Alcohol column
## There are 0 NA values in the ResidualSugar column
Finally, the remaining columns that still have NAs in them have all been analyzed and deemed as suitable for MICE imputation. Through MICE imputation we are essentially taking a look at multiple ways to impute a model and determining which way leaves the data closest to what it most likely would have been with the NA truly replaced.
## INDEX TARGET FixedAcidity VolatileAcidity
## 0 0 0 0
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## 0 0 0 0
## TotalSulfurDioxide Density pH Sulphates
## 0 0 0 0
## Alcohol LabelAppeal AcidIndex STARS
## 0 0 0 0
After all of our data preparation is done we want to revisit the correlation plot to see if any of the changes we have made have majorly changed the correlation values. It looks like the mice imputations and deleted the records with no value for alcohol and zero sugar increased the correlation in our data.
We begin our model building with a comparison of the three main types of count regression models and end with a comparison of three sets of selected features within the prefered model. This is how we will select the final count regression model to predict how many cases of wine are purchased.
We’re going to use the following three types of count regression models using the same set of features to evaluate which model is best for our data.
Feature selection for this initial step will be based on the backwards stepwise selection achieved for the Poisson regression model.
Our prediction is that we’ll end up using the negative binomial model. A poisson model is better for small success probabilities and large totals and it seems most of our wines have at least one case purchased but none seem to have many cases purchased. And a dispersed poisson model is for when there are a number of zero or low counts, that as long as an external factor isn’t causing them, the distribution of counts could be accounted for by a modified poisson model. In our case it looks like neither of those are true and so we assume the negative binomial model is more suitable. The negative binomial model is a type of poisson model where the parameter lambda is gamma distributed.
The first model that we will tackle is a poisson regression model. We utilize backward stepwise regression to gradually remove variables from it, determining how significant they may be within our model.
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + pH + Sulphates + Alcohol +
## LabelAppeal + AcidIndex + STARS, family = poisson, data = train_1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.248e+00 2.088e-01 5.976 2.29e-09 ***
## VolatileAcidity -3.941e-02 6.885e-03 -5.724 1.04e-08 ***
## Chlorides -6.521e-02 1.690e-02 -3.858 0.000114 ***
## FreeSulfurDioxide 1.140e-04 3.596e-05 3.171 0.001519 **
## TotalSulfurDioxide 6.951e-05 2.322e-05 2.993 0.002759 **
## Density -3.020e-01 2.019e-01 -1.496 0.134742
## pH -1.633e-02 7.894e-03 -2.069 0.038549 *
## Sulphates -1.095e-02 5.765e-03 -1.899 0.057504 .
## Alcohol 4.407e-03 1.450e-03 3.039 0.002376 **
## LabelAppeal-1 2.253e-01 4.004e-02 5.627 1.84e-08 ***
## LabelAppeal0 4.121e-01 3.902e-02 10.562 < 2e-16 ***
## LabelAppeal1 5.343e-01 3.965e-02 13.476 < 2e-16 ***
## LabelAppeal2 6.610e-01 4.473e-02 14.778 < 2e-16 ***
## AcidIndex -9.494e-02 4.701e-03 -20.196 < 2e-16 ***
## STARS2 6.388e-01 1.407e-02 45.412 < 2e-16 ***
## STARS3 8.222e-01 1.547e-02 53.131 < 2e-16 ***
## STARS4 9.526e-01 2.206e-02 43.178 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 20657 on 11562 degrees of freedom
## Residual deviance: 13844 on 11546 degrees of freedom
## AIC: 42706
##
## Number of Fisher Scoring iterations: 5
The second model we will tackle is a dispersed poisson model. We will hold the parameter selection the same as above to compare the models directly.
##
## Call:
## glm(formula = TARGET ~ . - INDEX - FixedAcidity - CitricAcid -
## ResidualSugar, family = quasipoisson, data = train_1)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.248e+00 1.967e-01 6.345 2.31e-10 ***
## VolatileAcidity -3.941e-02 6.484e-03 -6.078 1.26e-09 ***
## Chlorides -6.521e-02 1.592e-02 -4.096 4.23e-05 ***
## FreeSulfurDioxide 1.140e-04 3.386e-05 3.367 0.000762 ***
## TotalSulfurDioxide 6.951e-05 2.187e-05 3.178 0.001485 **
## Density -3.020e-01 1.902e-01 -1.588 0.112297
## pH -1.633e-02 7.435e-03 -2.197 0.028054 *
## Sulphates -1.095e-02 5.430e-03 -2.017 0.043737 *
## Alcohol 4.407e-03 1.366e-03 3.226 0.001257 **
## LabelAppeal-1 2.253e-01 3.771e-02 5.974 2.38e-09 ***
## LabelAppeal0 4.121e-01 3.675e-02 11.215 < 2e-16 ***
## LabelAppeal1 5.343e-01 3.735e-02 14.308 < 2e-16 ***
## LabelAppeal2 6.610e-01 4.212e-02 15.691 < 2e-16 ***
## AcidIndex -9.494e-02 4.427e-03 -21.444 < 2e-16 ***
## STARS2 6.388e-01 1.325e-02 48.218 < 2e-16 ***
## STARS3 8.222e-01 1.457e-02 56.414 < 2e-16 ***
## STARS4 9.526e-01 2.078e-02 45.846 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 0.8869995)
##
## Null deviance: 20657 on 11562 degrees of freedom
## Residual deviance: 13844 on 11546 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
The third model is a negative binomial model. We will hold the parameter selection the same as the above two to compare the models directly.
##
## Call:
## glm(formula = TARGET ~ . - INDEX - FixedAcidity - CitricAcid -
## ResidualSugar, family = negative.binomial(1), data = train_1)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.558e+00 2.336e-01 6.669 2.69e-11 ***
## VolatileAcidity -5.254e-02 7.763e-03 -6.768 1.37e-11 ***
## Chlorides -9.372e-02 1.908e-02 -4.911 9.17e-07 ***
## FreeSulfurDioxide 1.608e-04 4.068e-05 3.952 7.79e-05 ***
## TotalSulfurDioxide 1.132e-04 2.622e-05 4.316 1.60e-05 ***
## Density -3.238e-01 2.276e-01 -1.422 0.15497
## pH -2.716e-02 8.906e-03 -3.050 0.00229 **
## Sulphates -1.707e-02 6.504e-03 -2.624 0.00870 **
## Alcohol 3.433e-03 1.632e-03 2.104 0.03543 *
## LabelAppeal-1 2.196e-01 3.617e-02 6.072 1.30e-09 ***
## LabelAppeal0 3.894e-01 3.525e-02 11.049 < 2e-16 ***
## LabelAppeal1 4.833e-01 3.642e-02 13.270 < 2e-16 ***
## LabelAppeal2 5.922e-01 4.570e-02 12.959 < 2e-16 ***
## AcidIndex -1.240e-01 4.925e-03 -25.181 < 2e-16 ***
## STARS2 6.455e-01 1.438e-02 44.899 < 2e-16 ***
## STARS3 8.412e-01 1.709e-02 49.224 < 2e-16 ***
## STARS4 9.842e-01 2.810e-02 35.028 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1) family taken to be 0.3047134)
##
## Null deviance: 8184.4 on 11562 degrees of freedom
## Residual deviance: 6371.1 on 11546 degrees of freedom
## AIC: 50382
##
## Number of Fisher Scoring iterations: 5
From the models’ summary outputs it’s not clear which is the best so we’re going to turn to evaluating the primary models from our assumption that the negative binomial is more suitable.
It looks as if the backwards stepwise feature selection obtained in the first model didn’t remove enough features so that’s something we’ll explore in the next section.
It looks like the marginal model plots for the poisson model show that the non-factorized variable with the clearest effect on the model is AcidIndex.
Lastly, as a check on our Data Preparation section, we’ve looked at the cook distance for our data and it looks like they are all less than 0.0025 so we don’t have any influential outliers impacting the data.
In the previous three models we were consistently using the set of features achieved through backwards stepwise elimination on the poisson model. The negative binomial model using this is our base model and we will compare it to two other models with fewer and fewer features to identify which is best.
We remove Density because the p-value is 0.15497. This is supported by our data analysis where we assume that Density should be relatively similar across wines except between extremely high and low alcohol-content wines.
We remove pH and Sulphates because they had the lowest correlation to TARGET in the Data Preparation section.
We’re choosing to take out Alcohol even though it was shown to have correlation to TARGET earlier because it’s not given a high p-value.
With these four features removed we’re seeing a slight improvement 0.3048038 for model 3b compared to 0.3047134 for model 3. Even if the accuracy did not change we would take model 3b as four fewer features and the same accuracy is an improvement.
##
## Call:
## glm(formula = TARGET ~ . - INDEX - FixedAcidity - CitricAcid -
## ResidualSugar - Density - pH - Sulphates - Alcohol, family = negative.binomial(1),
## data = train_1)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.179e+00 5.114e-02 23.048 < 2e-16 ***
## VolatileAcidity -5.252e-02 7.762e-03 -6.767 1.38e-11 ***
## Chlorides -9.348e-02 1.907e-02 -4.902 9.63e-07 ***
## FreeSulfurDioxide 1.574e-04 4.068e-05 3.870 0.000109 ***
## TotalSulfurDioxide 1.113e-04 2.621e-05 4.247 2.18e-05 ***
## LabelAppeal-1 2.191e-01 3.616e-02 6.060 1.41e-09 ***
## LabelAppeal0 3.883e-01 3.524e-02 11.021 < 2e-16 ***
## LabelAppeal1 4.822e-01 3.641e-02 13.244 < 2e-16 ***
## LabelAppeal2 5.893e-01 4.569e-02 12.897 < 2e-16 ***
## AcidIndex -1.243e-01 4.906e-03 -25.336 < 2e-16 ***
## STARS2 6.464e-01 1.438e-02 44.966 < 2e-16 ***
## STARS3 8.452e-01 1.706e-02 49.550 < 2e-16 ***
## STARS4 9.873e-01 2.806e-02 35.187 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1) family taken to be 0.3048038)
##
## Null deviance: 8184.4 on 11562 degrees of freedom
## Residual deviance: 6378.0 on 11550 degrees of freedom
## AIC: 50381
##
## Number of Fisher Scoring iterations: 5
In this version of the model we remove FreeSulfurDioxide, TotalSulfurDioxide, VolatileAcidity and Chlorides.
What we show is a small diminishment in accuracy (0.3048038 in model 3b and 0.3043724 in model 3c) but very little change with four fewer features.
##
## Call:
## glm(formula = TARGET ~ LabelAppeal + AcidIndex + STARS, family = negative.binomial(1),
## data = train_1)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.192608 0.050756 23.497 < 2e-16 ***
## LabelAppeal-1 0.218279 0.036103 6.046 1.53e-09 ***
## LabelAppeal0 0.387390 0.035182 11.011 < 2e-16 ***
## LabelAppeal1 0.482764 0.036347 13.282 < 2e-16 ***
## LabelAppeal2 0.585360 0.045621 12.831 < 2e-16 ***
## AcidIndex -0.126325 0.004887 -25.849 < 2e-16 ***
## STARS2 0.647537 0.014345 45.141 < 2e-16 ***
## STARS3 0.847077 0.017017 49.778 < 2e-16 ***
## STARS4 0.988160 0.028016 35.271 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1) family taken to be 0.3043724)
##
## Null deviance: 8184.4 on 11562 degrees of freedom
## Residual deviance: 6410.0 on 11554 degrees of freedom
## AIC: 50405
##
## Number of Fisher Scoring iterations: 5
The intercept is 1.192608, so we would expect when all predictors are zero that one case is sold.
All of the predictors have very low p-values indicating they are all highly significant.
The coefficients of AcidIndex is negative so we would expect that when his predictor is higher, fewer cases are sold.
The remaining predictors are factorized variables, but we can see how higher numbers of STARS equate to more cases sold because the coefficients go up with each star. Similarly we can see an increase in coefficients for each step up in LabelAppeal.
The Marginal Effects Plot below shows the relative coefficients and we can see the increasing stepwise effect of higher values of both LabelAppeal and STARS.
The residual deviance is 6,410.0 which is lower than the null deviance (8,184.4), indicating that the model provides a better fit than a null model with no predictors.
The dispersion parameter (0.30437824) should be closse to 1 if it is a well-fitted model. If it is greater than 1, there is over dispersion. It looks like we have an under dispersed model.
The AIC, or Akaike Information Criterion, is a measure of model quality that penalizes models for having more parameters. In our case it is slightly higher at 50,405 than our previous two models (50,3XX each), however they seem equivalent enough to justify accepting the model with the fewest parameters, Model 3c.
Note, that we are not evaluating the Fisher Scoring Iterations. This is the number of iterations performed during the fitting process and is a measure of how complex the optimization was.
Keeping glm_3c
for our count regression model to predict
cases sold seems like the best idea. There was very little deterioration
in the negative binomial model profile with the fewest parameters that
we tried. It is not a good fit, but this exercise has demonstrated some
interesting connections in the data.
Unfortunately, we were not provided TARGET values for our test data,
eval_1
. Had we been, we would have been able to compare how
well our model performed on the test data with the code below:
eval_predictions <- predict(glm_3c, newdata = eval_1, type = "response")
eval_actual <- eval_1$TARGET
eval_mae <- mean(abs(eval_actual - eval_predictions))
eval_mse <- mean((eval_actual - eval_predictions)^2)
print(paste("Mean Absolute Error (MAE):", eval_mae))
## [1] "Mean Absolute Error (MAE): NA"
print(paste("Mean Squared Error (MSE):", eval_mse))
## [1] "Mean Squared Error (MSE): NA"
It seems that for our count regression model we weren’t able to get a competent model. We did learn more about the chemistry of wine and it seems that the proprietary index of specific acids in a wine does have some predictive value of how many cases of wine will be bought. Also the rating of the wine and visual appeal of the label does as well. The other interesting comparisons in our data analysis section were not born out by the model but could prove true when holding type of wine constant.
The appeal of the label was one of our teams personal experience working the cash register at a friend’s wine shop opening. A lot of the same bottles were purchased despite a huge selection, even only looking at the lower price point wines, and they all had appealing labels.
Avenues for future analysis could be to try the two different
combinations of features, b
and c
, with the
other model types, 1
and 2
, for four more
models to compare.
We could also try different things at a data level, such as imputing the missing values for Alcohol and ResidualSugar or excluding records with missing values in other variables.
#install.packages("margins")
library(tidyverse)
library(olsrr)
library(MASS)
library(skimr)
library(DataExplorer)
library(corrplot)
library(fabletools)
library(ggfortify)
library(mice)
library(caret)
library(naniar)
library(cowplot)
library(correlationfunnel)
library(car)
library(pROC)
library(smotefamily)
library(margins)
train_1 <- read_csv("wine-training-data.csv", show_col_types = FALSE)
eval_1 <- read_csv("wine-evaluation-data.csv", show_col_types = FALSE)
glimpse(train_1)
glimpse(train_1)
train_1 |>
mutate(
across(
c(LabelAppeal, STARS),
as_factor
)
) -> train_1
eval_1 |>
mutate(
across(
c(LabelAppeal, STARS),
as_factor
)
) -> eval_1
train_1 |>
dplyr::select(where(is.factor)) |>
sapply(levels)
train_1 |>
skim() |>
dplyr::select(-complete_rate, -numeric.sd) |>
mutate(numeric.mean = format(numeric.mean, scientific = FALSE, digits = 2))
d1 <- train_1 |>
dplyr::select(where(is.factor)) |>
group_by(LabelAppeal) |>
summarise(n = n())
d2 <- train_1 |>
dplyr::select(where(is.factor)) |>
group_by(STARS) |>
summarise(n = n())
knitr::kable(
list(d1, d2)
)
tibble(NAs = rowSums(is.na(train_1))) |>
filter(NAs != 0) |>
group_by(NAs) |>
summarise(n = n()) |>
ggplot(aes(x = NAs, y = n)) +
geom_bar(stat="identity") +
labs(title = "Combined missing values are rare", subtitle = "# NAs per row", y = NULL, x = NULL) +
theme_minimal()
p1 <- train_1 |>
ggplot(aes(x=pH, y=Alcohol)) +
geom_miss_point() +
theme_minimal() +
labs(title = "Records missing pH don't significantly differ in Alcohol distribution", y = "pH", x = "Alcohol")
p2 <- train_1 |>
ggplot(aes(x=ResidualSugar,y=Chlorides)) +
geom_miss_point() +
theme_minimal() +
labs(title = "Records missing ResidualSugar don't significantly differ in Chlorides distribution", y = "ResidualSugar", x = "Chlorides")
plot_grid(p1, p2, ncol = 1)
train_1 |>
na.omit() |>
dplyr::select(-INDEX, -TARGET) |>
plot_correlation(
ggtheme = theme_minimal(),
theme_config = list(legend.position = "none", axis.text.x = element_text(angle =
45)))
train_1 %>%
dplyr::select(-INDEX) %>%
mutate(TARGET_BIN = if_else(TARGET > mean(TARGET), 1, 0)) %>%
dplyr::select(-TARGET) %>%
na.omit() %>%
binarize(n_bins = 6, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) %>%
correlate(TARGET_BIN__1) %>%
plot_correlation_funnel(alpha = 0.7, interactive = TRUE)
train_1 %>%
filter(!is.na(Alcohol)) -> train_1
eval_1 %>%
filter(!is.na(Alcohol)) -> eval_1
cat("There are",sum(is.na(train_1$Alcohol)),"NA values in the Alcohol column")
train_1 %>%
filter(!is.na(ResidualSugar)) -> train_1
eval_1 %>%
filter(!is.na(ResidualSugar)) -> eval_1
cat("There are",sum(is.na(train_1$ResidualSugar)),"NA values in the ResidualSugar column")
train_1 |>
mice(printFlag = FALSE) -> cheesed
cheesed |>
complete() -> train_1
eval_1 |>
mice(printFlag = FALSE) -> cheesed
cheesed |>
complete() -> eval_1
colSums(is.na(train_1))
train_1 |>
na.omit() %>%
dplyr::select(-INDEX) %>%
plot_correlation(
ggtheme = theme_minimal(),
theme_config = list(legend.position = "none", axis.text.x = element_text(angle =
45)))
glm_1 <- glm(TARGET ~ . - INDEX,
data = train_1,
family = poisson) |>
stepAIC(
direction = "backward",
trace = FALSE
)
summary(glm_1)
glm_2 <- glm(TARGET ~ . - INDEX - FixedAcidity - CitricAcid - ResidualSugar,
data = train_1,
family = quasipoisson)
summary(glm_2)
glm_3 <- glm(TARGET ~ . - INDEX - FixedAcidity - CitricAcid - ResidualSugar,
data = train_1,
negative.binomial(1))
summary(glm_3)
mmps(glm_1, ask = FALSE)
plot(glm_1, which = 4, id.n=3)
glm_3b <- glm(TARGET ~ . - INDEX - FixedAcidity - CitricAcid - ResidualSugar - Density - pH - Sulphates - Alcohol,
data = train_1,
negative.binomial(1))
summary(glm_3b)
glm_3c <- glm(TARGET ~ LabelAppeal + AcidIndex + STARS,
data = train_1,
negative.binomial(1))
summary(glm_3c)
margins_object <- margins(glm_3c)
plot(margins_object, main = "Marginal Effects Plot")
eval_predictions <- predict(glm_3c, newdata = eval_1, type = "response")
eval_actual <- eval_1$TARGET
eval_mae <- mean(abs(eval_actual - eval_predictions))
eval_mse <- mean((eval_actual - eval_predictions)^2)
print(paste("Mean Absolute Error (MAE):", eval_mae))
print(paste("Mean Squared Error (MSE):", eval_mse))