Introduction

We will explore, analyze and model a data set containing approximately 12,000 records of commerically available wines. Each record has various predictor variables regarding the wine’s chemical properties, critic ratings and visual appeal of the label. The response variables within this dataset indicate how many cases were sold.

We will create a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. We don’t expect great success using techniques available in this class as the type of wine and price-point isn’t specified.

Different types of wine are valued for different characteristics, such as acidity in certain white wines meant to be paired with spicy foods or tannins in certain red wines meant to be paired with fatty foods. What makes one type of wine sell successful in it’s category could make another wine do poorly in it’s category which will muddle our results.

Also wine is purchased at different price points. It may be that the characteristics of wine of the same type at different price points could vary. Or that purchasing decisions could be made to stock low-price wines regardless of the characteristics of the wine.

We’ll see what we can accomplish with the data provided.

Subsections:
- Loading Data
- Description of Variables
- Assumptions about the Data

Loading Data

Here we load in the data from the CSVs which we have downloaded. The data is separated into a training set for building the model and an evaluation set for testing the model.

We can see that we have the expected amount of records loaded in: 12,795.

## Rows: 12,795
## Columns: 16
## $ INDEX              <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19…
## $ TARGET             <dbl> 3, 3, 5, 3, 4, 0, 0, 4, 3, 6, 0, 4, 3, 7, 4, 0, 0, …
## $ FixedAcidity       <dbl> 3.2, 4.5, 7.1, 5.7, 8.0, 11.3, 7.7, 6.5, 14.8, 5.5,…
## $ VolatileAcidity    <dbl> 1.160, 0.160, 2.640, 0.385, 0.330, 0.320, 0.290, -1…
## $ CitricAcid         <dbl> -0.98, -0.81, -0.88, 0.04, -1.26, 0.59, -0.40, 0.34…
## $ ResidualSugar      <dbl> 54.20, 26.10, 14.80, 18.80, 9.40, 2.20, 21.50, 1.40…
## $ Chlorides          <dbl> -0.567, -0.425, 0.037, -0.425, NA, 0.556, 0.060, 0.…
## $ FreeSulfurDioxide  <dbl> NA, 15, 214, 22, -167, -37, 287, 523, -213, 62, 551…
## $ TotalSulfurDioxide <dbl> 268, -327, 142, 115, 108, 15, 156, 551, NA, 180, 65…
## $ Density            <dbl> 0.99280, 1.02792, 0.99518, 0.99640, 0.99457, 0.9994…
## $ pH                 <dbl> 3.33, 3.38, 3.12, 2.24, 3.12, 3.20, 3.49, 3.20, 4.9…
## $ Sulphates          <dbl> -0.59, 0.70, 0.48, 1.83, 1.77, 1.29, 1.21, NA, 0.26…
## $ Alcohol            <dbl> 9.9, NA, 22.0, 6.2, 13.7, 15.4, 10.3, 11.6, 15.0, 1…
## $ LabelAppeal        <dbl> 0, -1, -1, -1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 0, 0, …
## $ AcidIndex          <dbl> 8, 7, 8, 6, 9, 11, 8, 7, 6, 8, 5, 10, 7, 8, 9, 8, 9…
## $ STARS              <dbl> 2, 3, 3, 1, 2, NA, NA, 3, NA, 4, 1, 2, 2, 3, NA, NA…

Description of Variables

Below is a short description of the variables of interest in the data set:

Variable	Quantitative or Categorical	Definition
INDEX	Categorical	Identification Variable
TARGET	Count	Number of Cases Purchased
TARGET_AMT	Quantitative	Damage value of the car crash
AcidIndex	Quantitative	Proprietary total acidity measure
Alcohol	Quantitative	Alcohol Content
Chlorides	Quantitative	Chloride content of wine
CitricAcid	Quantitative	Citric Acid Content
Density	Quantitative	Density of Wine
FixedAcidity	Quantitative	Fixed Acidity of wine
FreeSulfurDioxide	Quantitative	Sulfur Dioxide content of wine
LabelAppeal	Likert	Marketing Score with higher numbers meaning more label appeal
ResidualSugar	Quantitative	Wine rating by a team of experts 4 Stars = Excellent, 1 Star = Poor
Sulphates	Quantitative	Sulfate content of wine
TotalSulfurDioxide	Quantitative	Total Sulfur Dioxide of Wine
VolatileAcidity	Quantitative	Volatile Acid content of wine
pH	Quantitative	pH of wine

Assumptions about the Data

It looks like LabelAppeal and wine ratings are going to drive number of cases bought because of the issue with using chemical properties for wines to evaluate appeal but without specifying which type of wine or the price point.

However it may be that some properties, such as possibly Sulfur Dioxide, might indicate a flaw in the wine that is undesireable regardless of wine type or price point.

From the table above we can identify groups of variables such as ‘acidity’ (four variables) and ‘sulfur dioxide’ (two variables), so we may see some correlation in those variables or be able to make new ones.

Note, volatile means it can evaporate which would provide an aroma and means it can be lost to the air during wine creation before it is bottled. Among the four acid variables it looks like we may just use AcidIndex, the proprietary measure of acidity, and VolatileAcidity. From a review of Agrovin’s Techniques for Correcting Wine Acidity it looks like too much Volatile Acidity may indicate a flaw in the wine. Also Citric acid is only one of the acids that comprise a wine’s fixed acidity. The others are tartaric, malic, succinic and lactic, and of the five the most important is tartaric which we assume is accounted for in the proprietary measure of acidity, AcidIndex.

We have free and total sulphur dioxide so we could create a column called fixed sulphur dioxide which is the difference between the two.

Density is also not an intuitive variable. It’s the density of the wine where you take it’s weight in grams divided by it’s volume in mL. Water has a density of 1.00 g/mL, wine on average has a density of 0.99g/mL, and ethanol has a density of 0.79g/mL. So density in wine may be a proxy for higher alcohol content.

§1 Data Exploration

Subsections:
- Data Preprocessing - Data Skimming
- Histogram
- Outliers
- Missing Values
- Correlation

Data Preprocessing

From our precursory variable exploration, we know that we have both qualitative and quantitative variables contained within our data. Before we can begin exploring the data, we need to ensure that our data is contained as the type which it should be.

Right away we see the variables that need to factorized such as LabelAppeal and STARS. We considered factorizing AcidIndex, however it’s roughly normally distributed across it’s range of values and so looks like it can stay numeric.

## Rows: 12,795
## Columns: 16
## $ INDEX              <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19…
## $ TARGET             <dbl> 3, 3, 5, 3, 4, 0, 0, 4, 3, 6, 0, 4, 3, 7, 4, 0, 0, …
## $ FixedAcidity       <dbl> 3.2, 4.5, 7.1, 5.7, 8.0, 11.3, 7.7, 6.5, 14.8, 5.5,…
## $ VolatileAcidity    <dbl> 1.160, 0.160, 2.640, 0.385, 0.330, 0.320, 0.290, -1…
## $ CitricAcid         <dbl> -0.98, -0.81, -0.88, 0.04, -1.26, 0.59, -0.40, 0.34…
## $ ResidualSugar      <dbl> 54.20, 26.10, 14.80, 18.80, 9.40, 2.20, 21.50, 1.40…
## $ Chlorides          <dbl> -0.567, -0.425, 0.037, -0.425, NA, 0.556, 0.060, 0.…
## $ FreeSulfurDioxide  <dbl> NA, 15, 214, 22, -167, -37, 287, 523, -213, 62, 551…
## $ TotalSulfurDioxide <dbl> 268, -327, 142, 115, 108, 15, 156, 551, NA, 180, 65…
## $ Density            <dbl> 0.99280, 1.02792, 0.99518, 0.99640, 0.99457, 0.9994…
## $ pH                 <dbl> 3.33, 3.38, 3.12, 2.24, 3.12, 3.20, 3.49, 3.20, 4.9…
## $ Sulphates          <dbl> -0.59, 0.70, 0.48, 1.83, 1.77, 1.29, 1.21, NA, 0.26…
## $ Alcohol            <dbl> 9.9, NA, 22.0, 6.2, 13.7, 15.4, 10.3, 11.6, 15.0, 1…
## $ LabelAppeal        <dbl> 0, -1, -1, -1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 0, 0, …
## $ AcidIndex          <dbl> 8, 7, 8, 6, 9, 11, 8, 7, 6, 8, 5, 10, 7, 8, 9, 8, 9…
## $ STARS              <dbl> 2, 3, 3, 1, 2, NA, NA, 3, NA, 4, 1, 2, 2, 3, NA, NA…

We begin our data preprocessing efforts by converting the categorical columns of LabelAppeal, AcidIndex, and STARS from doubles into factors. We then view the levels of each of these factors to better understand which levels need to be modified and which factors should be ordered.

## $LabelAppeal
## [1] "-2" "-1" "0"  "1"  "2" 
## 
## $STARS
## [1] "1" "2" "3" "4"

Data Skimming

We begin our data exploration by skimming the test data for immediate summary statistics, data types, and a visualization of distributions with mini histograms.

We still have our 12,795 rows with 16 columns that are split between 3 factors and 13 numeric variables.

Data summary
Name	train_1
Number of rows	12795
Number of columns	16
_______________________
Column type frequency:
factor	2
numeric	14
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	ordered	n_unique	top_counts
LabelAppeal	0	FALSE	5	0: 5617, -1: 3136, 1: 3048, -2: 504
STARS	3359	FALSE	4	2: 3570, 1: 3042, 3: 2212, 4: 612

Variable type: numeric

skim_variable	n_missing	mean	p0	p25	p50	p75	p100	hist
INDEX	0	8069.980	1.00	4037.50	8110.00	12106.50	16129.00	▇▇▇▇▇
TARGET	0	3.029	0.00	2.00	3.00	4.00	8.00	▆▇▇▆▁
FixedAcidity	0	7.076	-18.10	5.20	6.90	9.50	34.40	▁▂▇▂▁
VolatileAcidity	0	0.324	-2.79	0.13	0.28	0.64	3.68	▁▂▇▂▁
CitricAcid	0	0.308	-3.24	0.03	0.31	0.58	3.86	▁▂▇▂▁
ResidualSugar	616	5.419	-127.80	-2.00	3.90	15.90	141.15	▁▂▇▂▁
Chlorides	638	0.055	-1.17	-0.03	0.05	0.15	1.35	▁▂▇▂▁
FreeSulfurDioxide	647	30.846	-555.00	0.00	30.00	70.00	623.00	▁▂▇▂▁
TotalSulfurDioxide	682	120.714	-823.00	27.00	123.00	208.00	1057.00	▁▂▇▂▁
Density	0	0.994	0.89	0.99	0.99	1.00	1.10	▁▂▇▂▁
pH	395	3.208	0.48	2.96	3.20	3.47	6.13	▁▂▇▂▁
Sulphates	1210	0.527	-3.13	0.28	0.50	0.86	4.24	▁▂▇▂▁
Alcohol	653	10.489	-4.70	9.00	10.40	12.40	26.50	▁▂▇▂▁
AcidIndex	0	7.773	4.00	7.00	8.00	8.00	17.00	▁▇▁▁▁

We can determine by the count of the factors, which we expand below, that none of our factor variables are degenerate through having a level with extremely low counts compared to either the total population or the highest level. This means we do not need to worry about removing factor variables due to degeneracy.

LabelAppeal	n
-2	504
-1	3136
0	5617
1	3048
2	490

STARS	n
1	3042
2	3570
3	2212
4	612
NA	3359

Missing Values

ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, and Alcohol all have roughly 600 missing variables. pH has ~300 missing variables and Sulphates has 1,200 missing variables.

For any of these, a missing value shouldn’t correspond to another value. Typically that might be zero but none of these should be zero theoretically. It may be that the wine coming from certain regions doesn’t get tested for certain characteristics and then we would worry about the interconnectedness of the missing data.

Plotting the amount of missing values in each row we can see the vast majority of rows having only a single missing value and an absolute minuscule amount that have 3 or more columns with missing values. Thus, we will consider these missing values to not be interconnected.

For our purposes we’re going to delete records with missing values for Alcohol or Residual Sugar because we assume Alcohol has an impact on cases of wines purchased and also Residual Sugar because it seems like if a wine had a flaw like excess residual sugar the test result might be omitted if the tester is the wine seller themselves. For the rest we feel comfortable using MICE to input values and then move on to demonstrate a count regression model.

To spot check two of the relationships between missing values we’ve plotted out scatterplots which use jittered points under the 0 ticks on both the y and x-axis in order to determine how missing values differ in their distributions.

We’ve plotted Alcohol vs. pH and Chlorides vs. ResidualSugar and for both it looks, for either case, like there is no unusual pattern of distributions in the second value based on the missingness of the first.

Correlation

For the last part of data exploration we take a look at what variables might be collinear, and which ones might be correlated to our response variables. Below we plot a correlation plot focusing on each variable. It is a bit hard to see the plot in its entirety, but resizing it and making it fullscreen allows us to read the information.

In our case we have some collinearity between AcidIndex and FixedAcidity which makes sense. There’s also a tiny collinearity between Sulphates and FixedAcidity, also TotalSulphurDioxide and AcidIndex, so it may be that higher acidity and sulphur compounds constitute a flavor profile in wines.

Interestingly there is a a little collinearity between being rated 1 star and having VolatileAcidity so it may be that VolatileAcidity is a flaw in wine that is evaluated negatively by critics. Also there is a positive correlation to negative correlation gradient for Alcohol among STARS 4 through 1 with four STARS having the highest correlation, so it appears alcohol has a significant impact on ratings.

It looks like there’s a super faint correlation between 4 STARS and density so density being assumed to be a proxy for alcohol in wine in our Assumptions about Data section may be correct. However I assume the other compounds in wine mute the relationship and it’s better to just use Alcohol directly.

Next we create a correlation funnel that is plotting against target amount binned into above and below the median. This looks like a guide to which variables will have the strongest impact on how many cases our purchased, starting with our predcited LabelAppeal and Stars followed by AcidIndex and Alcohol. Interestingly CitrixAcid is high up there. Possibly it has an outsized affect on flavor independent of it’s contribution to acidity.

§2 Data Preparation

After exploring our data we move on to preparing it to be more useful for our goal of creating a linear model to predict team wins.

Subsections:
- Missing Values
- Distribution Transformation
- Interaction Columns
- Correlation

Missing Values

When there are missing values within our data, attempting to create a model from the data will end up dropping roughly 40% of rows used to calculate the model. To avoid this loss of data we will deal with the missing values ourselves.

Removing Rows with Missing Values

As discussed in our data exploration we are removing records missing Alcohol and ResidualSugar values.

## There are 0 NA values in the Alcohol column

## There are 0 NA values in the ResidualSugar column

MICE Impute the Mean for Other Missing Values

Finally, the remaining columns that still have NAs in them have all been analyzed and deemed as suitable for MICE imputation. Through MICE imputation we are essentially taking a look at multiple ways to impute a model and determining which way leaves the data closest to what it most likely would have been with the NA truly replaced.

##              INDEX             TARGET       FixedAcidity    VolatileAcidity 
##                  0                  0                  0                  0 
##         CitricAcid      ResidualSugar          Chlorides  FreeSulfurDioxide 
##                  0                  0                  0                  0 
## TotalSulfurDioxide            Density                 pH          Sulphates 
##                  0                  0                  0                  0 
##            Alcohol        LabelAppeal          AcidIndex              STARS 
##                  0                  0                  0                  0

Correlation

After all of our data preparation is done we want to revisit the correlation plot to see if any of the changes we have made have majorly changed the correlation values. It looks like the mice imputations and deleted the records with no value for alcohol and zero sugar increased the correlation in our data.

§3 Build Models

We begin our model building with a comparison of the three main types of count regression models and end with a comparison of three sets of selected features within the prefered model. This is how we will select the final count regression model to predict how many cases of wine are purchased.

Types of Count Regression Models

We’re going to use the following three types of count regression models using the same set of features to evaluate which model is best for our data.

Poisson or Binomial Regression
Dispersed Poisson Regression
Negative Binomial Regression

Feature selection for this initial step will be based on the backwards stepwise selection achieved for the Poisson regression model.

Our prediction is that we’ll end up using the negative binomial model. A poisson model is better for small success probabilities and large totals and it seems most of our wines have at least one case purchased but none seem to have many cases purchased. And a dispersed poisson model is for when there are a number of zero or low counts, that as long as an external factor isn’t causing them, the distribution of counts could be accounted for by a modified poisson model. In our case it looks like neither of those are true and so we assume the negative binomial model is more suitable. The negative binomial model is a type of poisson model where the parameter lambda is gamma distributed.

Poisson or Binomial model

The first model that we will tackle is a poisson regression model. We utilize backward stepwise regression to gradually remove variables from it, determining how significant they may be within our model.

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS, family = poisson, data = train_1)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.248e+00  2.088e-01   5.976 2.29e-09 ***
## VolatileAcidity    -3.941e-02  6.885e-03  -5.724 1.04e-08 ***
## Chlorides          -6.521e-02  1.690e-02  -3.858 0.000114 ***
## FreeSulfurDioxide   1.140e-04  3.596e-05   3.171 0.001519 ** 
## TotalSulfurDioxide  6.951e-05  2.322e-05   2.993 0.002759 ** 
## Density            -3.020e-01  2.019e-01  -1.496 0.134742    
## pH                 -1.633e-02  7.894e-03  -2.069 0.038549 *  
## Sulphates          -1.095e-02  5.765e-03  -1.899 0.057504 .  
## Alcohol             4.407e-03  1.450e-03   3.039 0.002376 ** 
## LabelAppeal-1       2.253e-01  4.004e-02   5.627 1.84e-08 ***
## LabelAppeal0        4.121e-01  3.902e-02  10.562  < 2e-16 ***
## LabelAppeal1        5.343e-01  3.965e-02  13.476  < 2e-16 ***
## LabelAppeal2        6.610e-01  4.473e-02  14.778  < 2e-16 ***
## AcidIndex          -9.494e-02  4.701e-03 -20.196  < 2e-16 ***
## STARS2              6.388e-01  1.407e-02  45.412  < 2e-16 ***
## STARS3              8.222e-01  1.547e-02  53.131  < 2e-16 ***
## STARS4              9.526e-01  2.206e-02  43.178  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 20657  on 11562  degrees of freedom
## Residual deviance: 13844  on 11546  degrees of freedom
## AIC: 42706
## 
## Number of Fisher Scoring iterations: 5

Dispersed poisson model

The second model we will tackle is a dispersed poisson model. We will hold the parameter selection the same as above to compare the models directly.

## 
## Call:
## glm(formula = TARGET ~ . - INDEX - FixedAcidity - CitricAcid - 
##     ResidualSugar, family = quasipoisson, data = train_1)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.248e+00  1.967e-01   6.345 2.31e-10 ***
## VolatileAcidity    -3.941e-02  6.484e-03  -6.078 1.26e-09 ***
## Chlorides          -6.521e-02  1.592e-02  -4.096 4.23e-05 ***
## FreeSulfurDioxide   1.140e-04  3.386e-05   3.367 0.000762 ***
## TotalSulfurDioxide  6.951e-05  2.187e-05   3.178 0.001485 ** 
## Density            -3.020e-01  1.902e-01  -1.588 0.112297    
## pH                 -1.633e-02  7.435e-03  -2.197 0.028054 *  
## Sulphates          -1.095e-02  5.430e-03  -2.017 0.043737 *  
## Alcohol             4.407e-03  1.366e-03   3.226 0.001257 ** 
## LabelAppeal-1       2.253e-01  3.771e-02   5.974 2.38e-09 ***
## LabelAppeal0        4.121e-01  3.675e-02  11.215  < 2e-16 ***
## LabelAppeal1        5.343e-01  3.735e-02  14.308  < 2e-16 ***
## LabelAppeal2        6.610e-01  4.212e-02  15.691  < 2e-16 ***
## AcidIndex          -9.494e-02  4.427e-03 -21.444  < 2e-16 ***
## STARS2              6.388e-01  1.325e-02  48.218  < 2e-16 ***
## STARS3              8.222e-01  1.457e-02  56.414  < 2e-16 ***
## STARS4              9.526e-01  2.078e-02  45.846  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.8869995)
## 
##     Null deviance: 20657  on 11562  degrees of freedom
## Residual deviance: 13844  on 11546  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

Negative binomial model

The third model is a negative binomial model. We will hold the parameter selection the same as the above two to compare the models directly.

## 
## Call:
## glm(formula = TARGET ~ . - INDEX - FixedAcidity - CitricAcid - 
##     ResidualSugar, family = negative.binomial(1), data = train_1)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.558e+00  2.336e-01   6.669 2.69e-11 ***
## VolatileAcidity    -5.254e-02  7.763e-03  -6.768 1.37e-11 ***
## Chlorides          -9.372e-02  1.908e-02  -4.911 9.17e-07 ***
## FreeSulfurDioxide   1.608e-04  4.068e-05   3.952 7.79e-05 ***
## TotalSulfurDioxide  1.132e-04  2.622e-05   4.316 1.60e-05 ***
## Density            -3.238e-01  2.276e-01  -1.422  0.15497    
## pH                 -2.716e-02  8.906e-03  -3.050  0.00229 ** 
## Sulphates          -1.707e-02  6.504e-03  -2.624  0.00870 ** 
## Alcohol             3.433e-03  1.632e-03   2.104  0.03543 *  
## LabelAppeal-1       2.196e-01  3.617e-02   6.072 1.30e-09 ***
## LabelAppeal0        3.894e-01  3.525e-02  11.049  < 2e-16 ***
## LabelAppeal1        4.833e-01  3.642e-02  13.270  < 2e-16 ***
## LabelAppeal2        5.922e-01  4.570e-02  12.959  < 2e-16 ***
## AcidIndex          -1.240e-01  4.925e-03 -25.181  < 2e-16 ***
## STARS2              6.455e-01  1.438e-02  44.899  < 2e-16 ***
## STARS3              8.412e-01  1.709e-02  49.224  < 2e-16 ***
## STARS4              9.842e-01  2.810e-02  35.028  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1) family taken to be 0.3047134)
## 
##     Null deviance: 8184.4  on 11562  degrees of freedom
## Residual deviance: 6371.1  on 11546  degrees of freedom
## AIC: 50382
## 
## Number of Fisher Scoring iterations: 5

Type of Model Results

From the models’ summary outputs it’s not clear which is the best so we’re going to turn to evaluating the primary models from our assumption that the negative binomial is more suitable.

It looks as if the backwards stepwise feature selection obtained in the first model didn’t remove enough features so that’s something we’ll explore in the next section.

It looks like the marginal model plots for the poisson model show that the non-factorized variable with the clearest effect on the model is AcidIndex.

Lastly, as a check on our Data Preparation section, we’ve looked at the cook distance for our data and it looks like they are all less than 0.0025 so we don’t have any influential outliers impacting the data.

Refining the Feature Selection

In the previous three models we were consistently using the set of features achieved through backwards stepwise elimination on the poisson model. The negative binomial model using this is our base model and we will compare it to two other models with fewer and fewer features to identify which is best.

Model 3b

We remove Density because the p-value is 0.15497. This is supported by our data analysis where we assume that Density should be relatively similar across wines except between extremely high and low alcohol-content wines.

We remove pH and Sulphates because they had the lowest correlation to TARGET in the Data Preparation section.

We’re choosing to take out Alcohol even though it was shown to have correlation to TARGET earlier because it’s not given a high p-value.

With these four features removed we’re seeing a slight improvement 0.3048038 for model 3b compared to 0.3047134 for model 3. Even if the accuracy did not change we would take model 3b as four fewer features and the same accuracy is an improvement.

## 
## Call:
## glm(formula = TARGET ~ . - INDEX - FixedAcidity - CitricAcid - 
##     ResidualSugar - Density - pH - Sulphates - Alcohol, family = negative.binomial(1), 
##     data = train_1)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.179e+00  5.114e-02  23.048  < 2e-16 ***
## VolatileAcidity    -5.252e-02  7.762e-03  -6.767 1.38e-11 ***
## Chlorides          -9.348e-02  1.907e-02  -4.902 9.63e-07 ***
## FreeSulfurDioxide   1.574e-04  4.068e-05   3.870 0.000109 ***
## TotalSulfurDioxide  1.113e-04  2.621e-05   4.247 2.18e-05 ***
## LabelAppeal-1       2.191e-01  3.616e-02   6.060 1.41e-09 ***
## LabelAppeal0        3.883e-01  3.524e-02  11.021  < 2e-16 ***
## LabelAppeal1        4.822e-01  3.641e-02  13.244  < 2e-16 ***
## LabelAppeal2        5.893e-01  4.569e-02  12.897  < 2e-16 ***
## AcidIndex          -1.243e-01  4.906e-03 -25.336  < 2e-16 ***
## STARS2              6.464e-01  1.438e-02  44.966  < 2e-16 ***
## STARS3              8.452e-01  1.706e-02  49.550  < 2e-16 ***
## STARS4              9.873e-01  2.806e-02  35.187  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1) family taken to be 0.3048038)
## 
##     Null deviance: 8184.4  on 11562  degrees of freedom
## Residual deviance: 6378.0  on 11550  degrees of freedom
## AIC: 50381
## 
## Number of Fisher Scoring iterations: 5

Model 3c

In this version of the model we remove FreeSulfurDioxide, TotalSulfurDioxide, VolatileAcidity and Chlorides.

What we show is a small diminishment in accuracy (0.3048038 in model 3b and 0.3043724 in model 3c) but very little change with four fewer features.

## 
## Call:
## glm(formula = TARGET ~ LabelAppeal + AcidIndex + STARS, family = negative.binomial(1), 
##     data = train_1)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.192608   0.050756  23.497  < 2e-16 ***
## LabelAppeal-1  0.218279   0.036103   6.046 1.53e-09 ***
## LabelAppeal0   0.387390   0.035182  11.011  < 2e-16 ***
## LabelAppeal1   0.482764   0.036347  13.282  < 2e-16 ***
## LabelAppeal2   0.585360   0.045621  12.831  < 2e-16 ***
## AcidIndex     -0.126325   0.004887 -25.849  < 2e-16 ***
## STARS2         0.647537   0.014345  45.141  < 2e-16 ***
## STARS3         0.847077   0.017017  49.778  < 2e-16 ***
## STARS4         0.988160   0.028016  35.271  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1) family taken to be 0.3043724)
## 
##     Null deviance: 8184.4  on 11562  degrees of freedom
## Residual deviance: 6410.0  on 11554  degrees of freedom
## AIC: 50405
## 
## Number of Fisher Scoring iterations: 5

Model Results

The intercept is 1.192608, so we would expect when all predictors are zero that one case is sold.

All of the predictors have very low p-values indicating they are all highly significant.

The coefficients of AcidIndex is negative so we would expect that when his predictor is higher, fewer cases are sold.

The remaining predictors are factorized variables, but we can see how higher numbers of STARS equate to more cases sold because the coefficients go up with each star. Similarly we can see an increase in coefficients for each step up in LabelAppeal.

The Marginal Effects Plot below shows the relative coefficients and we can see the increasing stepwise effect of higher values of both LabelAppeal and STARS.

Model Evaluation

The residual deviance is 6,410.0 which is lower than the null deviance (8,184.4), indicating that the model provides a better fit than a null model with no predictors.

The dispersion parameter (0.30437824) should be closse to 1 if it is a well-fitted model. If it is greater than 1, there is over dispersion. It looks like we have an under dispersed model.

The AIC, or Akaike Information Criterion, is a measure of model quality that penalizes models for having more parameters. In our case it is slightly higher at 50,405 than our previous two models (50,3XX each), however they seem equivalent enough to justify accepting the model with the fewest parameters, Model 3c.

Note, that we are not evaluating the Fisher Scoring Iterations. This is the number of iterations performed during the fitting process and is a measure of how complex the optimization was.

§4 Select Models

Keeping glm_3c for our count regression model to predict cases sold seems like the best idea. There was very little deterioration in the negative binomial model profile with the fewest parameters that we tried. It is not a good fit, but this exercise has demonstrated some interesting connections in the data.

Unfortunately, we were not provided TARGET values for our test data, eval_1. Had we been, we would have been able to compare how well our model performed on the test data with the code below:

eval_predictions <- predict(glm_3c, newdata = eval_1, type = "response")
eval_actual <- eval_1$TARGET

eval_mae <- mean(abs(eval_actual - eval_predictions))
eval_mse <- mean((eval_actual - eval_predictions)^2)

print(paste("Mean Absolute Error (MAE):", eval_mae))

## [1] "Mean Absolute Error (MAE): NA"

print(paste("Mean Squared Error (MSE):", eval_mse))

## [1] "Mean Squared Error (MSE): NA"

Conclusion

It seems that for our count regression model we weren’t able to get a competent model. We did learn more about the chemistry of wine and it seems that the proprietary index of specific acids in a wine does have some predictive value of how many cases of wine will be bought. Also the rating of the wine and visual appeal of the label does as well. The other interesting comparisons in our data analysis section were not born out by the model but could prove true when holding type of wine constant.

The appeal of the label was one of our teams personal experience working the cash register at a friend’s wine shop opening. A lot of the same bottles were purchased despite a huge selection, even only looking at the lower price point wines, and they all had appealing labels.

Avenues for Future Analysis

Avenues for future analysis could be to try the two different combinations of features, b and c, with the other model types, 1 and 2, for four more models to compare.

We could also try different things at a data level, such as imputing the missing values for Alcohol and ResidualSugar or excluding records with missing values in other variables.

Code Appendix

#install.packages("margins")
library(tidyverse)
library(olsrr)
library(MASS)
library(skimr)
library(DataExplorer)
library(corrplot)
library(fabletools)
library(ggfortify)
library(mice)
library(caret)
library(naniar)
library(cowplot)
library(correlationfunnel)
library(car)
library(pROC)
library(smotefamily)
library(margins)
train_1 <- read_csv("wine-training-data.csv", show_col_types = FALSE)
eval_1 <- read_csv("wine-evaluation-data.csv", show_col_types = FALSE)
glimpse(train_1)
glimpse(train_1)
train_1 |>
  mutate(
    across(
      c(LabelAppeal, STARS),
      as_factor
    )
  )  -> train_1

eval_1 |>
  mutate(
    across(
      c(LabelAppeal, STARS),
      as_factor
    )
  )  -> eval_1

train_1 |>
  dplyr::select(where(is.factor)) |>
  sapply(levels)
train_1 |>
  skim() |>
  dplyr::select(-complete_rate, -numeric.sd) |>
  mutate(numeric.mean = format(numeric.mean, scientific = FALSE, digits = 2))
d1 <- train_1 |>
  dplyr::select(where(is.factor)) |>
  group_by(LabelAppeal) |>
  summarise(n = n())
d2 <- train_1 |>
  dplyr::select(where(is.factor)) |>
  group_by(STARS) |>
  summarise(n = n())

knitr::kable(
  list(d1, d2)
)
tibble(NAs = rowSums(is.na(train_1))) |>
  filter(NAs != 0) |>
  group_by(NAs) |>
  summarise(n = n()) |>
  ggplot(aes(x = NAs, y = n)) +
    geom_bar(stat="identity") +
    labs(title = "Combined missing values are rare", subtitle = "# NAs per row", y = NULL, x = NULL) +
    theme_minimal()
p1 <- train_1 |>
  ggplot(aes(x=pH, y=Alcohol)) +
    geom_miss_point() +
    theme_minimal() +
    labs(title = "Records missing pH don't significantly differ in Alcohol distribution", y = "pH", x = "Alcohol")

p2 <- train_1 |>
  ggplot(aes(x=ResidualSugar,y=Chlorides)) +
    geom_miss_point() +
    theme_minimal() +
    labs(title = "Records missing ResidualSugar don't significantly differ in Chlorides distribution", y = "ResidualSugar", x = "Chlorides")

plot_grid(p1, p2, ncol = 1)
train_1 |>
  na.omit() |>
  dplyr::select(-INDEX, -TARGET) |>
  plot_correlation(
                 ggtheme = theme_minimal(),
                 theme_config = list(legend.position = "none", axis.text.x = element_text(angle =
    45)))
train_1 %>%
  dplyr::select(-INDEX) %>%
  mutate(TARGET_BIN = if_else(TARGET > mean(TARGET), 1, 0)) %>% 
  dplyr::select(-TARGET) %>% 
  na.omit() %>%
  binarize(n_bins = 6, thresh_infreq = 0.01, name_infreq = "OTHER", one_hot = TRUE) %>% 
  correlate(TARGET_BIN__1) %>% 
  plot_correlation_funnel(alpha = 0.7, interactive = TRUE)
train_1 %>% 
  filter(!is.na(Alcohol)) -> train_1

eval_1 %>% 
  filter(!is.na(Alcohol)) -> eval_1

cat("There are",sum(is.na(train_1$Alcohol)),"NA values in the Alcohol column")
train_1 %>% 
  filter(!is.na(ResidualSugar)) -> train_1

eval_1 %>% 
  filter(!is.na(ResidualSugar)) -> eval_1

cat("There are",sum(is.na(train_1$ResidualSugar)),"NA values in the ResidualSugar column")
train_1 |>
  mice(printFlag = FALSE) -> cheesed

cheesed |>
  complete() -> train_1

eval_1 |>
  mice(printFlag = FALSE) -> cheesed

cheesed |>
  complete() -> eval_1

colSums(is.na(train_1))
train_1 |>
  na.omit() %>% 
  dplyr::select(-INDEX) %>% 
  plot_correlation(
                 ggtheme = theme_minimal(),
                 theme_config = list(legend.position = "none", axis.text.x = element_text(angle =
    45)))
glm_1 <- glm(TARGET ~ . - INDEX,
             data = train_1,
             family = poisson) |>
          stepAIC(
            direction = "backward",
            trace = FALSE
          )

summary(glm_1)
glm_2 <- glm(TARGET ~ . - INDEX - FixedAcidity - CitricAcid - ResidualSugar,
             data = train_1,
             family = quasipoisson)

summary(glm_2)
glm_3 <- glm(TARGET ~ . - INDEX - FixedAcidity - CitricAcid - ResidualSugar,
             data = train_1,
             negative.binomial(1))

summary(glm_3)
mmps(glm_1, ask = FALSE)
plot(glm_1, which = 4, id.n=3)
glm_3b <- glm(TARGET ~ . - INDEX - FixedAcidity - CitricAcid - ResidualSugar - Density - pH - Sulphates - Alcohol,
             data = train_1,
             negative.binomial(1))

summary(glm_3b)
glm_3c <- glm(TARGET ~ LabelAppeal + AcidIndex + STARS,
             data = train_1,
             negative.binomial(1))

summary(glm_3c)
margins_object <- margins(glm_3c)

plot(margins_object, main = "Marginal Effects Plot")
eval_predictions <- predict(glm_3c, newdata = eval_1, type = "response")
eval_actual <- eval_1$TARGET

eval_mae <- mean(abs(eval_actual - eval_predictions))
eval_mse <- mean((eval_actual - eval_predictions)^2)

print(paste("Mean Absolute Error (MAE):", eval_mae))
print(paste("Mean Squared Error (MSE):", eval_mse))

D621-HW5: Wine Case Analysis

Group 5: Taha Ahmad, Diana Plunkett, Justin Williams, PK O’Flaherty

2023-12-13

Introduction

Loading Data

Description of Variables

Assumptions about the Data

§1 Data Exploration

Data Preprocessing

Data Skimming

Missing Values

Correlation

§2 Data Preparation

Missing Values

Removing Rows with Missing Values

MICE Impute the Mean for Other Missing Values

Correlation

§3 Build Models

Types of Count Regression Models

Poisson or Binomial model

Dispersed poisson model

Negative binomial model

Type of Model Results

Refining the Feature Selection

Model 3b

Model 3c

Model Results

Model Evaluation

§4 Select Models

Conclusion

Avenues for Future Analysis

Code Appendix