DATA 621 Homework #5
From Work by Critical Thinking Group 3
DATA 621 Homework #5
Introduction
We have been given a dataset with information on 12795 commercially available wines. The variables in the dataset are mostly related to the chemical properties of the wine being sold. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. The objective is explore, analyze and build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine.
Data Exploration & Preparation
As previously stated we have 12795 observations for developing the model. The evaluation dataset (which we will set aside for now) consists of 3335 observations. We will begin by taking a look at the modeling data:
TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. :0.000 | Min. :-18.100 | Min. :-2.7900 | Min. :-3.2400 | Min. :-127.800 | Min. :-1.1710 | Min. :-555.00 | Min. :-823.0 | Min. :0.8881 | Min. :0.480 | Min. :-3.1300 | Min. :-4.70 | Min. :-2.000000 | Min. : 4.000 | Min. :1.000 | |
1st Qu.:2.000 | 1st Qu.: 5.200 | 1st Qu.: 0.1300 | 1st Qu.: 0.0300 | 1st Qu.: -2.000 | 1st Qu.:-0.0310 | 1st Qu.: 0.00 | 1st Qu.: 27.0 | 1st Qu.:0.9877 | 1st Qu.:2.960 | 1st Qu.: 0.2800 | 1st Qu.: 9.00 | 1st Qu.:-1.000000 | 1st Qu.: 7.000 | 1st Qu.:1.000 | |
Median :3.000 | Median : 6.900 | Median : 0.2800 | Median : 0.3100 | Median : 3.900 | Median : 0.0460 | Median : 30.00 | Median : 123.0 | Median :0.9945 | Median :3.200 | Median : 0.5000 | Median :10.40 | Median : 0.000000 | Median : 8.000 | Median :2.000 | |
Mean :3.029 | Mean : 7.076 | Mean : 0.3241 | Mean : 0.3084 | Mean : 5.419 | Mean : 0.0548 | Mean : 30.85 | Mean : 120.7 | Mean :0.9942 | Mean :3.208 | Mean : 0.5271 | Mean :10.49 | Mean :-0.009066 | Mean : 7.773 | Mean :2.042 | |
3rd Qu.:4.000 | 3rd Qu.: 9.500 | 3rd Qu.: 0.6400 | 3rd Qu.: 0.5800 | 3rd Qu.: 15.900 | 3rd Qu.: 0.1530 | 3rd Qu.: 70.00 | 3rd Qu.: 208.0 | 3rd Qu.:1.0005 | 3rd Qu.:3.470 | 3rd Qu.: 0.8600 | 3rd Qu.:12.40 | 3rd Qu.: 1.000000 | 3rd Qu.: 8.000 | 3rd Qu.:3.000 | |
Max. :8.000 | Max. : 34.400 | Max. : 3.6800 | Max. : 3.8600 | Max. : 141.150 | Max. : 1.3510 | Max. : 623.00 | Max. :1057.0 | Max. :1.0992 | Max. :6.130 | Max. : 4.2400 | Max. :26.50 | Max. : 2.000000 | Max. :17.000 | Max. :4.000 | |
NA | NA | NA | NA | NA’s :616 | NA’s :638 | NA’s :647 | NA’s :682 | NA | NA’s :395 | NA’s :1210 | NA’s :653 | NA | NA | NA’s :3359 |
Two things jump out about this dataset. The first thing is that there are variables with negative values. These negative values don’t make sense in the context of chemical properties of wine. After all, how can you have negative chlorides?
Second, there are quite a few records with missing values. We will need to either impute values or drop the records with incomplete data. These two issues will need to be corrected in the data.
Correcting Invalid Values
We will begin by correcting the invalid values. We will assume the negative values were a data entry issue and a negative value was entered when a positive value was desired. If these negative values were changed into positive numbers, they would be within the range of plausible values based on the other data.
fix_negative_values <- function(df){
df %>%
mutate(FixedAcidity = abs(FixedAcidity),
VolatileAcidity = abs(VolatileAcidity),
CitricAcid = abs(CitricAcid),
ResidualSugar = abs(ResidualSugar),
Chlorides = abs(Chlorides),
FreeSulfurDioxide = abs(FreeSulfurDioxide),
TotalSulfurDioxide = abs(TotalSulfurDioxide),
Sulphates = abs(Sulphates),
Alcohol = abs(Alcohol))
}
df <- fix_negative_values(df)
hold_out_data <- fix_negative_values(hold_out_data)
Missing Values
The instructions indicate that sometimes, the fact that a variable is missing is actually predictive of the target. Let’s see what we can learn about TARGET
from the missing values.
temp <- df %>%
select(TARGET, STARS, Alcohol, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates) %>%
gather("variable", "value", -TARGET) %>%
mutate(na = ifelse(is.na(value), "Yes", "No")) %>%
group_by(TARGET, na, variable) %>%
tally() %>%
ungroup()
temp <- temp %>%
group_by(variable, na) %>%
summarise(total = sum(n)) %>%
merge(temp) %>%
mutate(share = n / total)
ggplot(temp, aes(TARGET, share, fill = na)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_brewer(palette = "Set1") +
facet_wrap(~variable, ncol = 4) +
ylab("Percent of Group")
We learn that the observations that are missing STARS
are much more likely to have a zero TARGET
. So let’s look at how many stars wine with a zero TARGET
have:
df %>%
filter(TARGET == 0) %>%
select(STARS) %>%
na.omit() %>%
group_by(STARS) %>%
tally() %>%
kable() %>%
kable_styling()
STARS | n |
---|---|
1 | 607 |
2 | 89 |
Based on this it looks like we should assign 1 for all the missing STARS
.
There is little to no information encoded in the NAs for Alcohol
, ResidualSugar
, Chlorides
, FreeSulfurDioxide
, TotalSulfurDioxide
, pH
, and Sulphates
. We will impute using the median value.
fix_missing_values <- function(df){
df %>%
mutate(
ResidualSugar = ifelse(is.na(ResidualSugar), median(ResidualSugar, na.rm = T), ResidualSugar),
Chlorides = ifelse(is.na(Chlorides), median(Chlorides, na.rm = T), Chlorides),
FreeSulfurDioxide = ifelse(is.na(FreeSulfurDioxide), median(FreeSulfurDioxide, na.rm = T), FreeSulfurDioxide),
TotalSulfurDioxide = ifelse(is.na(TotalSulfurDioxide), median(TotalSulfurDioxide, na.rm = T), TotalSulfurDioxide),
pH = ifelse(is.na(pH), median(pH, na.rm = T), pH),
Sulphates = ifelse(is.na(Sulphates), median(Sulphates, na.rm = T), Sulphates),
Alcohol = ifelse(is.na(Alcohol), median(Alcohol, na.rm = T), Alcohol),
STARS_imputed = ifelse(is.na(STARS), 1, 0),
STARS = ifelse(is.na(STARS), 1, STARS)
)
}
df <- fix_missing_values(df)
hold_out_data <- fix_missing_values(hold_out_data)
Training Test Split
Next we will look at splitting the data into a training and testing set at a standard 70-30 split between train and test:
Univariate Analysis
Response Variable
In order to model this as a Poisson process the mean should be equal to the variance
mean | variance |
---|---|
3.028243 | 3.701446 |
They are not the same. The implications of this requirement not being met is the the \(\beta\)s are correct but the standard errors will be wrong.
Predictors
dens <- lapply(predictors, FUN=function(var) {
ggplot(df, aes_string(x = var)) +
geom_density(fill = "gray") +
geom_vline(aes(xintercept = mean(train[, var])), color = "blue", size=1) +
geom_vline(aes(xintercept = median(train[, var])), color = "red", size=1) +
geom_vline(aes(xintercept = quantile(train[, var], 0.25)), linetype = "dashed", size = 0.5) +
geom_vline(aes(xintercept = quantile(train[, var], 0.75)), linetype = "dashed", size = 0.5)
})
do.call(grid.arrange, args = c(dens, list(ncol = 3)))
Some of the predictors (the chemical composition variables) are right skewed.
Bivariate Analysis
Now let’s examine the correlations between the variables:
Hmm. This suggests that the chemical makeup of the wine has little bearing on the number of cases sold. The label appeal, acidity index value and star rating have the strongest correlation. This also suggests that it may be worth the effort to do a better job filling in the missing values in the STARS
variable.
Model Building & Evaluation
Evaluation Approach
In order to assess the models we will return the the original intent of the exercise: If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. We will evaluate the models based on the number of cases that are likely to be sold. We will assume that when the model overestimates the cases sold, these cases will never be sold and represent a loss. When the model underestimates there is also a loss but it is unseen. It is potential sales that are not realized.
We will summarize the evaluation by first constructing a confusion matrix like summary looking at what our model predicts vs what actually occured. We will also summarize the total number of cases that would be sold, the net cases sold (after adjusting for the glut due to overestimating), and a final adjusted net which factors in “losses” from under estimation.
evaluate_model <- function(model, test_df, yhat = FALSE){
temp <- data.frame(yhat=c(0:8), TARGET = c(0:8), n=c(0))
if(yhat){
test_df$yhat <- yhat
} else {
test_df$yhat <- round(predict.glm(model, newdata=test_df, type="response"), 0)
}
test_df <- test_df %>%
group_by(yhat, TARGET) %>%
tally() %>%
mutate(accuracy = ifelse(yhat > TARGET, "Over", ifelse(yhat < TARGET, "Under", "Accurate"))) %>%
mutate(cases_sold = ifelse(yhat > TARGET, TARGET, yhat) * n,
glut = ifelse(yhat > TARGET, yhat - TARGET, 0) * n,
missed_opportunity = ifelse(yhat < TARGET, TARGET - yhat, 0) * n) %>%
mutate(net_cases_sold = cases_sold - glut,
adj_net_cases_sold = cases_sold - glut - missed_opportunity)
results <- test_df %>%
group_by(accuracy) %>%
summarise(n = sum(n)) %>%
spread(accuracy, n)
accurate <- results$Accurate
over <- results$Over
under <- results$Under
cases_sold <- sum(test_df$cases_sold)
net_cases_sold <- sum(test_df$net_cases_sold)
adj_net_cases_sold <- sum(test_df$adj_net_cases_sold)
missed_opportunity <- sum(test_df$missed_opportunity)
glut <- sum(test_df$glut)
confusion_matrix <- test_df %>%
bind_rows(temp) %>%
group_by(yhat, TARGET) %>%
summarise(n = sum(n)) %>%
spread(TARGET, n, fill = 0)
return(list("confusion_matrix" = confusion_matrix, "results" = results, "df" = test_df, "accurate" = accurate, "over" = over, "under" = under, "cases_sold" = cases_sold, "net_cases_sold" = net_cases_sold, "adj_net_cases_sold" = adj_net_cases_sold, "glut" = glut, "missed_opportunity" = missed_opportunity))
}
Chemical Property Model
We will begin by first constructing a count regression model based on the chemical properties of the wine. So we will be excluding the label, star rating and the acid index as it is a weighted average of the acidity variables. Based on our bivariate analysis we would not expect this model to preform well. Since the mean is not equal to the variance we will be using the quasi-Poisson model.
train1 <- train %>%
select(-LabelAppeal, -AcidIndex, -STARS, -STARS_imputed)
model1 <- glm(TARGET ~ ., family = quasipoisson, train1)
summary(model1)
Call:
glm(formula = TARGET ~ ., family = quasipoisson, data = train1)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7505 -0.7187 0.1491 0.7245 2.5330
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.73286580 0.25394606 6.824 0.00000000000945 ***
FixedAcidity -0.00587468 0.00135015 -4.351 0.00001369486133 ***
VolatileAcidity -0.08484398 0.01249535 -6.790 0.00000000001192 ***
CitricAcid 0.00947082 0.01107239 0.855 0.39238
ResidualSugar 0.00023887 0.00027490 0.869 0.38492
Chlorides -0.09293940 0.02997436 -3.101 0.00194 **
FreeSulfurDioxide 0.00013301 0.00006294 2.113 0.03459 *
TotalSulfurDioxide 0.00010001 0.00004135 2.419 0.01559 *
Density -0.64377662 0.25159620 -2.559 0.01052 *
pH -0.00219600 0.01001878 -0.219 0.82651
Sulphates -0.02528552 0.01096751 -2.305 0.02116 *
Alcohol 0.01099055 0.00186578 5.891 0.00000000398672 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 1.216629)
Null deviance: 15984 on 8957 degrees of freedom
Residual deviance: 15817 on 8946 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
The following is a confusion matrix like examination of the model with the predicted cases going down the rows and the actual cases going across the columns.
yhat | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 17 | 2 | 6 | 14 | 15 | 3 | 2 | 1 | 0 |
3 | 792 | 74 | 314 | 751 | 914 | 581 | 216 | 45 | 8 |
4 | 11 | 0 | 4 | 18 | 24 | 20 | 4 | 1 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The model is only accurate 781 times. It overestimates 1232 times and underestimates 1824 times. If decisions were based off of this model, one would expect to sell 8589. If we subtract out the extra cases due to overestimation we would have 5645 net cases sold. If we adjust this for the cases that could have been sold if the model didn’t underestimate we would have 2604 cases sold.
Second Chemical Property Model
We will try one more chemical property model but only select the variables that were statistically significant in the preious model (FixedAcidity
, VolatileAcidity
and Alcohol
).
model2 <- glm(TARGET ~ FixedAcidity + VolatileAcidity + Alcohol, family = quasipoisson, train)
summary(model2)
Call:
glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + Alcohol,
family = quasipoisson, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7172 -0.7017 0.1321 0.7044 2.5983
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.094133 0.024678 44.335 < 0.0000000000000002 ***
FixedAcidity -0.005920 0.001349 -4.389 0.00001153301758 ***
VolatileAcidity -0.086542 0.012496 -6.925 0.00000000000464 ***
Alcohol 0.010932 0.001865 5.861 0.00000000475204 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 1.216481)
Null deviance: 15984 on 8957 degrees of freedom
Residual deviance: 15859 on 8954 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
We would expect to have 8574 cases sold using this model (with a net of 5643 and adjusted net of 2587). Looking at the confusion matrix we see this model is accurate only 784 out of 3837 times.
yhat | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 11 | 2 | 3 | 5 | 6 | 2 | 3 | 0 | 0 |
3 | 806 | 73 | 320 | 772 | 938 | 598 | 217 | 47 | 8 |
4 | 3 | 1 | 1 | 6 | 9 | 4 | 2 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The previous model is slightly better but is still a horible model. Building a model based off of the chemical properties does not seem like a good solution. Let’s turn our attention to the other variables with stronger correlations.
Preceived Quality Model
This model relies on the consumer preception of quality based on the label, and the star ratings of the experts.
Call:
glm(formula = TARGET ~ LabelAppeal + STARS, family = quasipoisson,
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.83925 -0.67249 0.08322 0.53669 2.77111
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.426188 0.013822 30.83 <0.0000000000000002 ***
LabelAppeal 0.138170 0.006936 19.92 <0.0000000000000002 ***
STARS 0.345702 0.006095 56.72 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 0.9132717)
Null deviance: 15984 on 8957 degrees of freedom
Residual deviance: 11780 on 8955 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
Here is a confusion matrix for this model:
yhat | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 771 | 74 | 256 | 405 | 286 | 89 | 11 | 2 | 0 |
3 | 44 | 2 | 62 | 276 | 311 | 92 | 15 | 3 | 0 |
4 | 5 | 0 | 5 | 97 | 261 | 237 | 59 | 4 | 0 |
5 | 0 | 0 | 1 | 5 | 62 | 117 | 43 | 7 | 0 |
6 | 0 | 0 | 0 | 0 | 26 | 26 | 47 | 7 | 3 |
7 | 0 | 0 | 0 | 0 | 7 | 41 | 32 | 18 | 1 |
8 | 0 | 0 | 0 | 0 | 0 | 2 | 15 | 6 | 4 |
Perceived Quality Plus Model
We will begin with the preceeding model but add in the Acid Index and the flag if the STARS was imputed.
model4 <- glm(TARGET ~ LabelAppeal + STARS + AcidIndex + STARS_imputed, family = quasipoisson, train)
summary(model4)
Call:
glm(formula = TARGET ~ LabelAppeal + STARS + AcidIndex + STARS_imputed,
family = quasipoisson, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.1907 -0.6846 -0.0020 0.4711 3.7802
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.512852 0.042068 35.96 <0.0000000000000002 ***
LabelAppeal 0.162061 0.006881 23.55 <0.0000000000000002 ***
STARS 0.189468 0.006835 27.72 <0.0000000000000002 ***
AcidIndex -0.084084 0.005015 -16.77 <0.0000000000000002 ***
STARS_imputed -0.825021 0.020442 -40.36 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 0.8847836)
Null deviance: 15984.3 on 8957 degrees of freedom
Residual deviance: 9727.3 on 8953 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 6
Let’s see how the predictions line up with the reality:
yhat | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 548 | 42 | 92 | 137 | 59 | 19 | 6 | 1 | 0 |
2 | 129 | 29 | 90 | 72 | 31 | 15 | 5 | 3 | 0 |
3 | 113 | 5 | 130 | 420 | 367 | 91 | 4 | 0 | 0 |
4 | 30 | 0 | 11 | 146 | 387 | 288 | 59 | 2 | 0 |
5 | 0 | 0 | 1 | 7 | 96 | 136 | 63 | 9 | 0 |
6 | 0 | 0 | 0 | 1 | 13 | 48 | 63 | 24 | 4 |
7 | 0 | 0 | 0 | 0 | 0 | 7 | 19 | 7 | 4 |
8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 0 |
We would expect to have 9342 net cases sold using this model (with a net of 7071 and adjusted net of 4783). Looking at the confusion matrix we see this model is accurate only 979 out of 3837 times.
There are no zero predictions in this model which is not accurate. We will try to address this in our last model
Adjusted Perceived Quality Plus Model
We will start with the predictions from preceived quality plus model. We will then manually override the the predictions that have the stars filled in with a predicted zero.
temp <- test
temp$yhat <- round(predict.glm(model4, newdata=temp, type="response"), 0)
temp <- temp %>%
mutate(yhat = ifelse(STARS_imputed == 1, 0, yhat))
model5_results <- evaluate_model(model4, test, temp$yhat)
yhat | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|---|
0 | 617 | 42 | 92 | 137 | 68 | 29 | 10 | 4 | 0 |
1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 59 | 29 | 90 | 71 | 22 | 5 | 1 | 0 | 0 |
3 | 113 | 5 | 130 | 420 | 367 | 91 | 4 | 0 | 0 |
4 | 30 | 0 | 11 | 146 | 387 | 288 | 59 | 2 | 0 |
5 | 0 | 0 | 1 | 7 | 96 | 136 | 63 | 9 | 0 |
6 | 0 | 0 | 0 | 1 | 13 | 48 | 63 | 24 | 4 |
7 | 0 | 0 | 0 | 0 | 0 | 7 | 19 | 7 | 4 |
8 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 0 |
We see that applying the adjustment based on if the stars are imputed increased the correct number of cases where it should be predicted as a zero. It also introduced some underestimation.
We would expect to have 9671 net cases sold using this model (with a net of 7839 and adjusted net of 5880). Looking at the confusion matrix we see this model is accurate only 1145 out of 3837 times.
Model Selection
The following is a summary of the five models
Model | Number of Cases Sold | Number of Cases Overestimated to Sell | Missed Opportunities | Times Accurate |
---|---|---|---|---|
Chemical Property Model 1 | 8589 | 2944 | 3041 | 781 |
Chemical Property Model 2 | 8574 | 2931 | 3056 | 784 |
Perceived Quality Model | 9342 | 2271 | 2288 | 979 |
Perceived Quality Plus Model | 9671 | 1832 | 1959 | 1145 |
Adjusted Perceived Quality Plus Model | 9262 | 1145 | 2368 | 1720 |
The Perceived Quality Plus Model
has the most cases sold out of the models. However the Adjusted Perceived Quality Plus Model
has the highest accuracy. One of the reasons why the Perceived Quality Plus Model
predicted more cases sold is because we forced more zero predictions in the Adjusted Perceived Quality Plus Model
.
If we subtract out the waste from overestimation the Adjusted Perceived Quality Plus Model
is the best model. This has real business consequences as having extra cases of wine on hand that don’t sell becomes a real cost to a business. For these two reasons it will be the model of choice in producing estimates for the hold out datase.