DATA 621 Homework #5

Introduction

We have been given a dataset with information on 12795 commercially available wines. The variables in the dataset are mostly related to the chemical properties of the wine being sold. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. The objective is explore, analyze and build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine.

Data Exploration & Preparation

As previously stated we have 12795 observations for developing the model. The evaluation dataset (which we will set aside for now) consists of 3335 observations. We will begin by taking a look at the modeling data:

TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol LabelAppeal AcidIndex STARS
Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
NA NA NA NA NA’s :616 NA’s :638 NA’s :647 NA’s :682 NA NA’s :395 NA’s :1210 NA’s :653 NA NA NA’s :3359

Two things jump out about this dataset. The first thing is that there are variables with negative values. These negative values don’t make sense in the context of chemical properties of wine. After all, how can you have negative chlorides?

Second, there are quite a few records with missing values. We will need to either impute values or drop the records with incomplete data. These two issues will need to be corrected in the data.

Correcting Invalid Values

We will begin by correcting the invalid values. We will assume the negative values were a data entry issue and a negative value was entered when a positive value was desired. If these negative values were changed into positive numbers, they would be within the range of plausible values based on the other data.

Missing Values

The instructions indicate that sometimes, the fact that a variable is missing is actually predictive of the target. Let’s see what we can learn about TARGET from the missing values.

We learn that the observations that are missing STARS are much more likely to have a zero TARGET. So let’s look at how many stars wine with a zero TARGET have:

STARS n
1 607
2 89

Based on this it looks like we should assign 1 for all the missing STARS.

There is little to no information encoded in the NAs for Alcohol, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, and Sulphates. We will impute using the median value.

Training Test Split

Next we will look at splitting the data into a training and testing set at a standard 70-30 split between train and test:

Bivariate Analysis

Now let’s examine the correlations between the variables:

Hmm. This suggests that the chemical makeup of the wine has little bearing on the number of cases sold. The label appeal, acidity index value and star rating have the strongest correlation. This also suggests that it may be worth the effort to do a better job filling in the missing values in the STARS variable.

Model Building & Evaluation

Evaluation Approach

In order to assess the models we will return the the original intent of the exercise: If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. We will evaluate the models based on the number of cases that are likely to be sold. We will assume that when the model overestimates the cases sold, these cases will never be sold and represent a loss. When the model underestimates there is also a loss but it is unseen. It is potential sales that are not realized.

We will summarize the evaluation by first constructing a confusion matrix like summary looking at what our model predicts vs what actually occured. We will also summarize the total number of cases that would be sold, the net cases sold (after adjusting for the glut due to overestimating), and a final adjusted net which factors in “losses” from under estimation.

evaluate_model <- function(model, test_df, yhat = FALSE){
  temp <- data.frame(yhat=c(0:8), TARGET = c(0:8), n=c(0))
  
  if(yhat){
    test_df$yhat <- yhat
  } else {
    test_df$yhat <- round(predict.glm(model, newdata=test_df, type="response"), 0)
  }
  
  test_df <- test_df %>%
    group_by(yhat, TARGET) %>%
    tally() %>%
    mutate(accuracy = ifelse(yhat > TARGET, "Over", ifelse(yhat < TARGET, "Under", "Accurate"))) %>%
    mutate(cases_sold = ifelse(yhat > TARGET, TARGET, yhat) * n,
           glut = ifelse(yhat > TARGET, yhat - TARGET, 0) * n,
           missed_opportunity = ifelse(yhat < TARGET, TARGET - yhat, 0) * n) %>%
    mutate(net_cases_sold = cases_sold - glut,
           adj_net_cases_sold = cases_sold - glut - missed_opportunity)
  
  results <- test_df %>%
    group_by(accuracy) %>%
    summarise(n = sum(n)) %>%
    spread(accuracy, n)
  
  accurate <- results$Accurate
  over <- results$Over
  under <- results$Under
  
  cases_sold <- sum(test_df$cases_sold)
  net_cases_sold <- sum(test_df$net_cases_sold)
  adj_net_cases_sold <- sum(test_df$adj_net_cases_sold)
  missed_opportunity <- sum(test_df$missed_opportunity)
  glut <- sum(test_df$glut)
  
  confusion_matrix <- test_df %>%
    bind_rows(temp) %>%
    group_by(yhat, TARGET) %>%
    summarise(n = sum(n)) %>%
    spread(TARGET, n, fill = 0)
  
  return(list("confusion_matrix" = confusion_matrix, "results" = results, "df" = test_df, "accurate" = accurate, "over" = over, "under" = under, "cases_sold" = cases_sold, "net_cases_sold" = net_cases_sold, "adj_net_cases_sold" = adj_net_cases_sold, "glut" = glut, "missed_opportunity" = missed_opportunity))
}

Chemical Property Model

We will begin by first constructing a count regression model based on the chemical properties of the wine. So we will be excluding the label, star rating and the acid index as it is a weighted average of the acidity variables. Based on our bivariate analysis we would not expect this model to preform well. Since the mean is not equal to the variance we will be using the quasi-Poisson model.


Call:
glm(formula = TARGET ~ ., family = quasipoisson, data = train1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7505  -0.7187   0.1491   0.7245   2.5330  

Coefficients:
                      Estimate  Std. Error t value         Pr(>|t|)    
(Intercept)         1.73286580  0.25394606   6.824 0.00000000000945 ***
FixedAcidity       -0.00587468  0.00135015  -4.351 0.00001369486133 ***
VolatileAcidity    -0.08484398  0.01249535  -6.790 0.00000000001192 ***
CitricAcid          0.00947082  0.01107239   0.855          0.39238    
ResidualSugar       0.00023887  0.00027490   0.869          0.38492    
Chlorides          -0.09293940  0.02997436  -3.101          0.00194 ** 
FreeSulfurDioxide   0.00013301  0.00006294   2.113          0.03459 *  
TotalSulfurDioxide  0.00010001  0.00004135   2.419          0.01559 *  
Density            -0.64377662  0.25159620  -2.559          0.01052 *  
pH                 -0.00219600  0.01001878  -0.219          0.82651    
Sulphates          -0.02528552  0.01096751  -2.305          0.02116 *  
Alcohol             0.01099055  0.00186578   5.891 0.00000000398672 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 1.216629)

    Null deviance: 15984  on 8957  degrees of freedom
Residual deviance: 15817  on 8946  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

The following is a confusion matrix like examination of the model with the predicted cases going down the rows and the actual cases going across the columns.

yhat 0 1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0
2 17 2 6 14 15 3 2 1 0
3 792 74 314 751 914 581 216 45 8
4 11 0 4 18 24 20 4 1 0
5 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0

The model is only accurate 781 times. It overestimates 1232 times and underestimates 1824 times. If decisions were based off of this model, one would expect to sell 8589. If we subtract out the extra cases due to overestimation we would have 5645 net cases sold. If we adjust this for the cases that could have been sold if the model didn’t underestimate we would have 2604 cases sold.

Second Chemical Property Model

We will try one more chemical property model but only select the variables that were statistically significant in the preious model (FixedAcidity, VolatileAcidity and Alcohol).


Call:
glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + Alcohol, 
    family = quasipoisson, data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7172  -0.7017   0.1321   0.7044   2.5983  

Coefficients:
                 Estimate Std. Error t value             Pr(>|t|)    
(Intercept)      1.094133   0.024678  44.335 < 0.0000000000000002 ***
FixedAcidity    -0.005920   0.001349  -4.389     0.00001153301758 ***
VolatileAcidity -0.086542   0.012496  -6.925     0.00000000000464 ***
Alcohol          0.010932   0.001865   5.861     0.00000000475204 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 1.216481)

    Null deviance: 15984  on 8957  degrees of freedom
Residual deviance: 15859  on 8954  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

We would expect to have 8574 cases sold using this model (with a net of 5643 and adjusted net of 2587). Looking at the confusion matrix we see this model is accurate only 784 out of 3837 times.

yhat 0 1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0
2 11 2 3 5 6 2 3 0 0
3 806 73 320 772 938 598 217 47 8
4 3 1 1 6 9 4 2 0 0
5 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0

The previous model is slightly better but is still a horible model. Building a model based off of the chemical properties does not seem like a good solution. Let’s turn our attention to the other variables with stronger correlations.

Preceived Quality Model

This model relies on the consumer preception of quality based on the label, and the star ratings of the experts.


Call:
glm(formula = TARGET ~ LabelAppeal + STARS, family = quasipoisson, 
    data = train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.83925  -0.67249   0.08322   0.53669   2.77111  

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 0.426188   0.013822   30.83 <0.0000000000000002 ***
LabelAppeal 0.138170   0.006936   19.92 <0.0000000000000002 ***
STARS       0.345702   0.006095   56.72 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 0.9132717)

    Null deviance: 15984  on 8957  degrees of freedom
Residual deviance: 11780  on 8955  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

Here is a confusion matrix for this model:

yhat 0 1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0
2 771 74 256 405 286 89 11 2 0
3 44 2 62 276 311 92 15 3 0
4 5 0 5 97 261 237 59 4 0
5 0 0 1 5 62 117 43 7 0
6 0 0 0 0 26 26 47 7 3
7 0 0 0 0 7 41 32 18 1
8 0 0 0 0 0 2 15 6 4

Perceived Quality Plus Model

We will begin with the preceeding model but add in the Acid Index and the flag if the STARS was imputed.


Call:
glm(formula = TARGET ~ LabelAppeal + STARS + AcidIndex + STARS_imputed, 
    family = quasipoisson, data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.1907  -0.6846  -0.0020   0.4711   3.7802  

Coefficients:
               Estimate Std. Error t value            Pr(>|t|)    
(Intercept)    1.512852   0.042068   35.96 <0.0000000000000002 ***
LabelAppeal    0.162061   0.006881   23.55 <0.0000000000000002 ***
STARS          0.189468   0.006835   27.72 <0.0000000000000002 ***
AcidIndex     -0.084084   0.005015  -16.77 <0.0000000000000002 ***
STARS_imputed -0.825021   0.020442  -40.36 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 0.8847836)

    Null deviance: 15984.3  on 8957  degrees of freedom
Residual deviance:  9727.3  on 8953  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 6

Let’s see how the predictions line up with the reality:

yhat 0 1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0 0 0
1 548 42 92 137 59 19 6 1 0
2 129 29 90 72 31 15 5 3 0
3 113 5 130 420 367 91 4 0 0
4 30 0 11 146 387 288 59 2 0
5 0 0 1 7 96 136 63 9 0
6 0 0 0 1 13 48 63 24 4
7 0 0 0 0 0 7 19 7 4
8 0 0 0 0 0 0 3 1 0

We would expect to have 9342 net cases sold using this model (with a net of 7071 and adjusted net of 4783). Looking at the confusion matrix we see this model is accurate only 979 out of 3837 times.

There are no zero predictions in this model which is not accurate. We will try to address this in our last model

Adjusted Perceived Quality Plus Model

We will start with the predictions from preceived quality plus model. We will then manually override the the predictions that have the stars filled in with a predicted zero.

yhat 0 1 2 3 4 5 6 7 8
0 617 42 92 137 68 29 10 4 0
1 1 0 0 1 0 0 0 0 0
2 59 29 90 71 22 5 1 0 0
3 113 5 130 420 367 91 4 0 0
4 30 0 11 146 387 288 59 2 0
5 0 0 1 7 96 136 63 9 0
6 0 0 0 1 13 48 63 24 4
7 0 0 0 0 0 7 19 7 4
8 0 0 0 0 0 0 3 1 0

We see that applying the adjustment based on if the stars are imputed increased the correct number of cases where it should be predicted as a zero. It also introduced some underestimation.

We would expect to have 9671 net cases sold using this model (with a net of 7839 and adjusted net of 5880). Looking at the confusion matrix we see this model is accurate only 1145 out of 3837 times.

Model Selection

The following is a summary of the five models

Model Number of Cases Sold Number of Cases Overestimated to Sell Missed Opportunities Times Accurate
Chemical Property Model 1 8589 2944 3041 781
Chemical Property Model 2 8574 2931 3056 784
Perceived Quality Model 9342 2271 2288 979
Perceived Quality Plus Model 9671 1832 1959 1145
Adjusted Perceived Quality Plus Model 9262 1145 2368 1720

The Perceived Quality Plus Model has the most cases sold out of the models. However the Adjusted Perceived Quality Plus Model has the highest accuracy. One of the reasons why the Perceived Quality Plus Model predicted more cases sold is because we forced more zero predictions in the Adjusted Perceived Quality Plus Model.

If we subtract out the waste from overestimation the Adjusted Perceived Quality Plus Model is the best model. This has real business consequences as having extra cases of wine on hand that don’t sell becomes a real cost to a business. For these two reasons it will be the model of choice in producing estimates for the hold out datase.

Mike Silva

2019-12-05