library(tidyverse)
library(forcats)
library(modelr)
library(skimr)
library(knitr)
library(kableExtra)
library(broom)
library(caTools)
library(pscl)
library(grid)
library(gridExtra)
library(GGally)
library(mice)
library(car)
library(MASS)
library(caret)
library(corrplot)
library(reshape)
library(ggthemes)
library(moments)
library(qqplotr)
library(gridExtra)
library(geoR)
library(alr3)
library(caret)
library(pROC)
library(DataExplorer)
library(visdat)
library(janitor)
library(pander)
evaluate_model <- function(model, test_df, yhat = FALSE){
temp <- data.frame(yhat=c(0:8), TARGET = c(0:8), n=c(0))
if(yhat){
test_df$yhat <- yhat
} else {
test_df$yhat <- round(predict.glm(model, newdata=test_df, type="response"), 0)
}
test_df <- test_df %>%
group_by(yhat, TARGET) %>%
tally() %>%
mutate(accuracy = ifelse(yhat > TARGET, "Over", ifelse(yhat < TARGET, "Under", "Accurate"))) %>%
mutate(cases_sold = ifelse(yhat > TARGET, TARGET, yhat) * n,
glut = ifelse(yhat > TARGET, yhat - TARGET, 0) * n,
missed_opportunity = ifelse(yhat < TARGET, TARGET - yhat, 0) * n) %>%
mutate(net_cases_sold = cases_sold - glut,
adj_net_cases_sold = cases_sold - glut - missed_opportunity)
results <- test_df %>%
group_by(accuracy) %>%
summarise(n = sum(n)) %>%
spread(accuracy, n)
accurate <- results$Accurate
over <- results$Over
under <- results$Under
cases_sold <- sum(test_df$cases_sold)
net_cases_sold <- sum(test_df$net_cases_sold)
adj_net_cases_sold <- sum(test_df$adj_net_cases_sold)
missed_opportunity <- sum(test_df$missed_opportunity)
glut <- sum(test_df$glut)
confusion_matrix <- test_df %>%
bind_rows(temp) %>%
group_by(yhat, TARGET) %>%
summarise(n = sum(n)) %>%
spread(TARGET, n, fill = 0)
return(list("confusion_matrix" = confusion_matrix, "results" = results, "df" = test_df, "accurate" = accurate, "over" = over, "under" = under, "cases_sold" = cases_sold, "net_cases_sold" = net_cases_sold, "adj_net_cases_sold" = adj_net_cases_sold, "glut" = glut, "missed_opportunity" = missed_opportunity))
}
The objective of this analysis to predict the number of cases of wine that will be sold given certain properties of the wine. The data available for this analysis is set forth below:
VARIABLE NAME | DEFINITION | THEORETICAL EFFECT |
---|---|---|
INDEX |
Identification Variable (do not use) | None |
TARGET |
Number of Cases Purchased | None |
AcidIndex |
acidity of wine by using a weighted average | TBD |
Alcohol |
Alcohol Content | |
Chlorides |
Chloride content of wine | |
CitricAcid |
Citric Acid Content | |
Density |
Density of Wine | |
FixedAcidity |
Fixed Acidity of Wine | |
FreeSulfurDioxide |
Sulfur Dioxide content of wine | |
LabelAppeal |
Marketing Score indicator the appeal of label | |
ResidualSugar |
Residual Sugar of wine | |
STARS |
Wine rating by a team of experts. | A high number of stars suggests high sales |
… | 4 Stars = Excellent, 1 Star = Poor | |
Sulphates |
Sulfate content of wine | |
TotalSulfurDioxide |
Total Sulfur Dioxide of Wine | |
VolatileAcidity |
Volatile Acid content of wine | |
pH |
pH of wine |
After the initial data load, a variety of packages and functions were employed to preprocess the data. The object of the preprocessing phase is two fold. First we seek to develop a understanding of the data set and its variables and second we seek to identify potential problems that could undermine our prediction of wine sales.
The data set is comprised of 12,795 observations and 16 variables. Review of the str() function output below indicates that all variables are integer or numeric. There are also signs of missing data that will have to be explored further. The CitrixAcid, VolatileAcidity, Chlorides, Sulphtes and TotalSulfurDioxiode are also presenting negative value. This is conterintuitive for variables that are supposed to reflect chemical properties of wine and warrants some additional investigation.
Figure 1b, Summary Function Output, provides a convenient means to review descriptive statistics on the data set variables. Using the combination of Figures 1b and Figure 2 its evident that our dependent variable, TARGET, has a range of 0 to 8 (cases purchased) and a mode of four (4). The Summary table also re-enforces our concerns with missing and negative values. Next, we employed histograms (Figure 2) to provide an additional perspective of the data. One observation from the histograms is that many of the variables have similar histograms: uni modal, near-normal shapes with relatively fat tails. It also appears that the LabelAppeal, AcidIndex and STARS variable may be better suited as factors. The STARS variables seems to have a large percentage of missing data. Finally, there appears to be a fairly high number of zero values in the TARGET variable. Accordingly, our models will have to be be able to predict both positive and zero-value outcomes.
To drill-down on the missing data concern we employed the vis_dat() (Figure 3). This function provides a simple and visually appealing way to assess the level of missing data in a data set. Figure 3 indicates that numerous variable have a fairly a large percentage of missing data. The STARS variable suffers from particularly large percentage of missing value. This will be addressed in the preprocessing action section below.
'data.frame': 12795 obs. of 16 variables:
$ ï..INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
$ TARGET : int 3 3 5 3 4 0 0 4 3 6 ...
$ FixedAcidity : num 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
$ VolatileAcidity : num 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
$ CitricAcid : num -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
$ ResidualSugar : num 54.2 26.1 14.8 18.8 9.4 ...
$ Chlorides : num -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
$ FreeSulfurDioxide : num NA 15 214 22 -167 -37 287 523 -213 62 ...
$ TotalSulfurDioxide: num 268 -327 142 115 108 15 156 551 NA 180 ...
$ Density : num 0.993 1.028 0.995 0.996 0.995 ...
$ pH : num 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
$ Sulphates : num -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
$ Alcohol : num 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
$ LabelAppeal : int 0 -1 -1 -1 0 0 0 1 0 0 ...
$ AcidIndex : int 8 7 8 6 9 11 8 7 6 8 ...
$ STARS : int 2 3 3 1 2 NA NA 3 NA 4 ...
ï..INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
Min. : 1 Min. :0.00 Min. :-18.1 Min. :-2.79 Min. :-3.24 Min. :-128 Min. :-1
1st Qu.: 4038 1st Qu.:2.00 1st Qu.: 5.2 1st Qu.: 0.13 1st Qu.: 0.03 1st Qu.: -2 1st Qu.: 0
Median : 8110 Median :3.00 Median : 6.9 Median : 0.28 Median : 0.31 Median : 4 Median : 0
Mean : 8070 Mean :3.03 Mean : 7.1 Mean : 0.32 Mean : 0.31 Mean : 5 Mean : 0
3rd Qu.:12106 3rd Qu.:4.00 3rd Qu.: 9.5 3rd Qu.: 0.64 3rd Qu.: 0.58 3rd Qu.: 16 3rd Qu.: 0
Max. :16129 Max. :8.00 Max. : 34.4 Max. : 3.68 Max. : 3.86 Max. : 141 Max. : 1
NA's :616 NA's :638
FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol LabelAppeal
Min. :-555 Min. :-823 Min. :0.888 Min. :0 Min. :-3 Min. :-5 Min. :-2.000
1st Qu.: 0 1st Qu.: 27 1st Qu.:0.988 1st Qu.:3 1st Qu.: 0 1st Qu.: 9 1st Qu.:-1.000
Median : 30 Median : 123 Median :0.994 Median :3 Median : 0 Median :10 Median : 0.000
Mean : 31 Mean : 121 Mean :0.994 Mean :3 Mean : 1 Mean :10 Mean :-0.009
3rd Qu.: 70 3rd Qu.: 208 3rd Qu.:1.001 3rd Qu.:3 3rd Qu.: 1 3rd Qu.:12 3rd Qu.: 1.000
Max. : 623 Max. :1057 Max. :1.099 Max. :6 Max. : 4 Max. :26 Max. : 2.000
NA's :647 NA's :682 NA's :395 NA's :1210 NA's :653
AcidIndex STARS
Min. : 4.00 Min. :1
1st Qu.: 7.00 1st Qu.:1
Median : 8.00 Median :2
Mean : 7.77 Mean :2
3rd Qu.: 8.00 3rd Qu.:3
Max. :17.00 Max. :4
NA's :3359
mylevels <- names(df[,2:15])
summary_plot <- df %>%
gather() %>%
mutate(facet = factor(key, levels=mylevels)) %>%
ggplot(aes(value)) +
facet_wrap(~ facet, scales = "free") +
geom_histogram() + theme_pander() +
theme(axis.text.y = element_text(size=7),
strip.text.x = element_text(size= 9),
axis.text.x = element_text(size=6),
plot.title = element_text(hjust = 0.5, size=10),
axis.title.y = element_text(size=8)) +
labs(x=NULL, title="Wine Data Histograms")
summary_plot
Given our review of the data we will take the following preprocessing actions:
Issues | Action |
---|---|
Missing Data | Use the mice function to impute missing values |
Missing STARS | Make NA its own level |
Negative Values1 | Take absolute value of negative values in the models (abs(x)) |
New Factors | Make STARS, LabelAppeal and AcidIndex Factors |
Change Factors | Use Forcat’s Recode function to collapse AcidIndex into four buckets |
Index Variable | Remove from data set via dplyr |
Update str and summary tables are set forth below, post pre-processing actions. Note - negative values will still be visible given the decision to transform the negative values in the model specification utilizing the abs() function.
# Factors
df <- df %>%
dplyr::select(-"ï..INDEX") %>%
mutate(STARS = factor(STARS)) %>%
mutate(STARS = fct_explicit_na(STARS,na_level = "0")) %>%
mutate(LabelAppeal = factor(LabelAppeal)) %>%
mutate(AcidIndex = if_else(AcidIndex <= 7,4L,AcidIndex)) %>%
mutate(AcidIndex = if_else(AcidIndex == 8 | AcidIndex == 9 ,3L,AcidIndex)) %>%
mutate(AcidIndex = if_else(AcidIndex == 10 | AcidIndex == 15 ,2L,AcidIndex)) %>%
mutate(AcidIndex = if_else(AcidIndex == 16 | AcidIndex == 17 | AcidIndex == 11 | AcidIndex == 12 | AcidIndex == 13 | AcidIndex == 14 ,1L,AcidIndex)) %>%
mutate(AcidIndex = factor(AcidIndex))
#Remove negativ values
# Impute missing value.
tmp_data <- mice(df,maxit=3, method='pmm',seed=20, print=F)
df <- complete(tmp_data,1)
df$FixedAcidity <- abs(df$FixedAcidity)
df$VolatileAcidity <- abs(df$VolatileAcidity)
df$CitricAcid <- abs(df$CitricAcid)
df$ResidualSugar <- abs(df$ResidualSugar)
df$Chlorides <- abs(df$Chlorides)
df$FreeSulfurDioxide <- abs(df$FreeSulfurDioxide)
df$TotalSulfurDioxide <- abs(df$TotalSulfurDioxide)
df$Sulphates <- abs(df$Sulphates)
df$Alcohol <- abs(df$Alcohol)
'data.frame': 12795 obs. of 15 variables:
$ TARGET : int 3 3 5 3 4 0 0 4 3 6 ...
$ FixedAcidity : num 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
$ VolatileAcidity : num 1.16 0.16 2.64 0.385 0.33 0.32 0.29 1.22 0.27 0.22 ...
$ CitricAcid : num 0.98 0.81 0.88 0.04 1.26 0.59 0.4 0.34 1.05 0.39 ...
$ ResidualSugar : num 54.2 26.1 14.8 18.8 9.4 ...
$ Chlorides : num 0.567 0.425 0.037 0.425 0.047 0.556 0.06 0.04 0.007 0.277 ...
$ FreeSulfurDioxide : num 50 15 214 22 167 37 287 523 213 62 ...
$ TotalSulfurDioxide: num 268 327 142 115 108 15 156 551 307 180 ...
$ Density : num 0.993 1.028 0.995 0.996 0.995 ...
$ pH : num 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
$ Sulphates : num 0.59 0.7 0.48 1.83 1.77 1.29 1.21 0.42 0.26 0.75 ...
$ Alcohol : num 9.9 3.1 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
$ LabelAppeal : Factor w/ 5 levels "-2","-1","0",..: 3 2 2 2 3 3 3 4 3 3 ...
$ AcidIndex : Factor w/ 4 levels "1","2","3","4": 3 4 3 4 3 1 3 4 4 3 ...
$ STARS : Factor w/ 5 levels "1","2","3","4",..: 2 3 3 1 2 5 5 3 5 4 ...
TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
Min. :0.00 Min. : 0.0 Min. :0.00 Min. :0.00 Min. : 0.0 Min. :0.000 Min. : 0
1st Qu.:2.00 1st Qu.: 5.6 1st Qu.:0.25 1st Qu.:0.28 1st Qu.: 3.6 1st Qu.:0.045 1st Qu.: 28
Median :3.00 Median : 7.0 Median :0.41 Median :0.44 Median : 12.9 Median :0.098 Median : 56
Mean :3.03 Mean : 8.1 Mean :0.64 Mean :0.69 Mean : 23.4 Mean :0.222 Mean :107
3rd Qu.:4.00 3rd Qu.: 9.8 3rd Qu.:0.91 3rd Qu.:0.97 3rd Qu.: 38.6 3rd Qu.:0.368 3rd Qu.:171
Max. :8.00 Max. :34.4 Max. :3.68 Max. :3.86 Max. :141.2 Max. :1.351 Max. :623
TotalSulfurDioxide Density pH Sulphates Alcohol LabelAppeal AcidIndex STARS
Min. : 0 Min. :0.888 Min. :0.48 Min. :0.00 Min. : 0.0 -2: 504 1: 514 1:3042
1st Qu.: 100 1st Qu.:0.988 1st Qu.:2.96 1st Qu.:0.44 1st Qu.: 9.0 -1:3136 2: 559 2:3570
Median : 154 Median :0.994 Median :3.20 Median :0.59 Median :10.4 0 :5617 3:5569 3:2212
Mean : 205 Mean :0.994 Mean :3.21 Mean :0.85 Mean :10.5 1 :3048 4:6153 4: 612
3rd Qu.: 263 3rd Qu.:1.001 3rd Qu.:3.47 3rd Qu.:1.10 3rd Qu.:12.4 2 : 490 0:3359
Max. :1057 Max. :1.099 Max. :6.13 Max. :4.24 Max. :26.5
With out preprocessing complete we will not utilize some visual EDA to gain additional insight into the data and to inform our opinion on models specification and variable selection. EDA will utilize a variety of charts and plots including bar, box plots, correlation plots and more.
The box plots below set forth show fairly strong relationships between the TARGET value and out factors. Theses are encouraging plots that indicate these variables will likely play a role in our prediction models. Note that the , per our pre-processing actions, has be recoded into four buckets. Also missing values in the STAR variable have been assigned to the 0 bucket.
bp1 <- ggplot(df, aes(LabelAppeal,TARGET)) + geom_boxplot() + theme_fivethirtyeight() +
theme(axis.title = element_text(size=10),
plot.title = element_text(hjust= 0.5, size = 12)) +
labs(title = 'TARGET vs LabelAppeal')
bp2 <- ggplot(df, aes(STARS,TARGET)) + geom_boxplot() + theme_fivethirtyeight() +
theme(axis.title = element_text(size=10),
plot.title = element_text(hjust= 0.5, size = 12)) +
labs(title = 'TARGET vs STARS')
bp3 <- ggplot(df, aes(AcidIndex,TARGET)) + geom_boxplot() + theme_fivethirtyeight() +
theme(axis.title = element_text(size=10),
plot.title = element_text(hjust= 0.5, size = 12)) +
labs(title = 'TARGET vs AcidIndex')
grid.arrange(bp1, bp2, bp3, ncol = 3)
The Loess plots are bit difficult to interpret. For some variables there does appear to be a near-linear relationship between the variable and Target. Alcohol and VolatileAcidity, for example are two variables that show promise from a modeling perspective. That said, the correlation plot and the loess plots lead one to believe that the factor variables with be the stars in our modeling phase.
var_plot <- df %>%
dplyr::select(-STARS, -AcidIndex, -LabelAppeal) %>%
gather(key="var", value="val", -TARGET) %>%
ggplot(aes(val, TARGET))+
facet_wrap(~var, scales = "free") +
geom_smooth(method="loess") +
theme_fivethirtyeight()
var_plot
The correlation plot is striking for its apparent lack of correlation. This would seem to suggest that the chemical makeup of wine has no bearing on sales. This would also lead one to believe that these variables may play a diminished role in our models. There does not seem to be any particularly strong correlation between variables.
Before starting our modeling our data was partitioned into a train set (70%) and a test set (30%). The training data will be utilized to create the models and then newly created models will be applied to the test data for predictions and matrix analysis.
Our approach to modeling was to make strong use of the factor variable and limited use of the continuous variables given the uncertainty around the negative values. When continuous values were employed the absolute value of the variable is utilized in the model. We also employed three varieties of models in our analysis: Linear, Poisson, Negative Binomial Zero-Inflated.
A manually iterative process was employed to narrow the models down to the five contenders. Model summaries and confusion matrix data is presented for each model. The model evaluation section then picks a winner based upon a variety of factors, including: prediction ability (can the model predict all relevant value ranges), accuracy, AIC, BIC and LogLik.
mod1 <- glm(TARGET ~ STARS + AcidIndex + LabelAppeal + Alcohol, family = poisson, train)
summary(mod1)
Call:
glm(formula = TARGET ~ STARS + AcidIndex + LabelAppeal + Alcohol,
family = poisson, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.218 -0.652 0.001 0.460 3.757
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.02884 0.06718 -0.43 0.668
STARS2 0.32492 0.01718 18.92 < 2e-16 ***
STARS3 0.44835 0.01862 24.08 < 2e-16 ***
STARS4 0.57340 0.02623 21.86 < 2e-16 ***
STARS0 -0.75863 0.02335 -32.49 < 2e-16 ***
AcidIndex2 0.31552 0.05954 5.30 1.2e-07 ***
AcidIndex3 0.54272 0.04764 11.39 < 2e-16 ***
AcidIndex4 0.60928 0.04752 12.82 < 2e-16 ***
LabelAppeal-1 0.24764 0.04581 5.41 6.5e-08 ***
LabelAppeal0 0.43080 0.04469 9.64 < 2e-16 ***
LabelAppeal1 0.57071 0.04542 12.56 < 2e-16 ***
LabelAppeal2 0.69160 0.05112 13.53 < 2e-16 ***
Alcohol 0.00430 0.00169 2.55 0.011 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 16053.8 on 8957 degrees of freedom
Residual deviance: 9614.9 on 8945 degrees of freedom
AIC: 31980
Number of Fisher Scoring iterations: 6
pred <- predict(mod1, newdata=test, type='response')
predRound <- as.factor(round(pred,0)-1)
testData <- as.factor(test$TARGET)
levels(predRound) <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "0")
levels(testData) <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10")
cm <- confusionMatrix(predRound, testData)
cm$overall[1]
Accuracy
0.217
Reference
Prediction 0 1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 0 0 0
1 3 0 0 0 0 0 0 0 0 0 0
2 546 39 100 135 76 20 1 1 0 0 0
3 144 32 142 143 45 22 6 1 1 0 0
4 99 9 74 292 264 87 6 2 0 0 0
5 13 0 15 205 406 197 41 0 0 0 0
6 4 0 0 8 156 233 110 18 1 0 0
7 0 0 0 0 6 46 59 16 5 0 0
8 0 0 0 0 0 1 4 2 1 0 0
9 0 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 0
Call:
lm(formula = TARGET ~ STARS + AcidIndex + LabelAppeal + Alcohol,
data = train)
Residuals:
Min 1Q Median 3Q Max
-4.936 -0.849 0.060 0.822 6.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.70717 0.11117 6.36 2.1e-10 ***
STARS2 1.05570 0.03924 26.90 < 2e-16 ***
STARS3 1.63031 0.04514 36.12 < 2e-16 ***
STARS4 2.33799 0.07308 31.99 < 2e-16 ***
STARS0 -1.36059 0.03964 -34.33 < 2e-16 ***
AcidIndex2 0.45676 0.09666 4.73 2.3e-06 ***
AcidIndex3 0.96510 0.07308 13.21 < 2e-16 ***
AcidIndex4 1.17590 0.07303 16.10 < 2e-16 ***
LabelAppeal-1 0.38411 0.07623 5.04 4.8e-07 ***
LabelAppeal0 0.82929 0.07431 11.16 < 2e-16 ***
LabelAppeal1 1.31885 0.07754 17.01 < 2e-16 ***
LabelAppeal2 1.84886 0.10192 18.14 < 2e-16 ***
Alcohol 0.01346 0.00384 3.50 0.00046 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.31 on 8945 degrees of freedom
Multiple R-squared: 0.536, Adjusted R-squared: 0.535
F-statistic: 861 on 12 and 8945 DF, p-value: <2e-16
pred <- predict(mod2, newdata=test)
predRound <- as.factor(round(pred,0))
levels(predRound) <- levels(as.factor(test$TARGET))
confusionMatrix(predRound, as.factor(test$TARGET))
Confusion Matrix and Statistics
Reference
Prediction 0 1 2 3 4 5 6 7 8
0 2 0 0 0 0 0 0 0 0
1 67 7 9 6 2 0 0 0 0
2 386 32 88 125 52 9 1 0 0
3 224 31 134 137 55 26 5 2 1
4 109 10 78 279 254 78 5 0 0
5 20 0 22 229 465 262 63 2 0
6 1 0 0 7 120 188 103 20 2
7 0 0 0 0 5 43 50 16 5
8 0 0 0 0 0 0 0 0 0
Overall Statistics
Accuracy : 0.226
95% CI : (0.213, 0.24)
No Information Rate : 0.248
P-Value [Acc > NIR] : 0.999
Kappa : 0.087
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity 0.002472 0.08750 0.2659 0.1750 0.2665 0.4323 0.4537 0.40000 0.00000
Specificity 1.000000 0.97764 0.8274 0.8435 0.8062 0.7521 0.9064 0.97287 1.00000
Pos Pred Value 1.000000 0.07692 0.1270 0.2228 0.3124 0.2465 0.2336 0.13445 NaN
Neg Pred Value 0.789570 0.98051 0.9227 0.7995 0.7688 0.8760 0.9635 0.99354 0.99792
Prevalence 0.210842 0.02085 0.0863 0.2041 0.2484 0.1579 0.0592 0.01042 0.00208
Detection Rate 0.000521 0.00182 0.0229 0.0357 0.0662 0.0683 0.0268 0.00417 0.00000
Detection Prevalence 0.000521 0.02372 0.1806 0.1603 0.2119 0.2770 0.1149 0.03101 0.00000
Balanced Accuracy 0.501236 0.53257 0.5466 0.5092 0.5363 0.5922 0.6801 0.68644 0.50000
mod3 <- glm(TARGET ~ STARS + AcidIndex + LabelAppeal + VolatileAcidity, family = poisson, train)
summary(mod3)
Call:
glm(formula = TARGET ~ STARS + AcidIndex + LabelAppeal + VolatileAcidity,
family = poisson, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.239 -0.644 -0.010 0.463 3.751
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0431 0.0653 0.66 0.5094
STARS2 0.3234 0.0172 18.82 < 2e-16 ***
STARS3 0.4484 0.0186 24.09 < 2e-16 ***
STARS4 0.5761 0.0262 22.01 < 2e-16 ***
STARS0 -0.7589 0.0233 -32.50 < 2e-16 ***
AcidIndex2 0.3126 0.0595 5.25 1.5e-07 ***
AcidIndex3 0.5399 0.0476 11.33 < 2e-16 ***
AcidIndex4 0.6072 0.0475 12.77 < 2e-16 ***
LabelAppeal-1 0.2453 0.0458 5.35 8.6e-08 ***
LabelAppeal0 0.4279 0.0447 9.58 < 2e-16 ***
LabelAppeal1 0.5672 0.0454 12.49 < 2e-16 ***
LabelAppeal2 0.6901 0.0511 13.50 < 2e-16 ***
VolatileAcidity -0.0332 0.0113 -2.94 0.0033 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 16053.8 on 8957 degrees of freedom
Residual deviance: 9612.6 on 8945 degrees of freedom
AIC: 31977
Number of Fisher Scoring iterations: 6
pred <- predict(mod3, newdata=test, type='response')
predRound <- as.factor(round(pred,0)-1)
testData <- as.factor(test$TARGET)
levels(predRound) <- c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "0")
levels(testData) <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10")
cm <- confusionMatrix(predRound, testData)
cm$overall[1]
Accuracy
0.22
Reference
Prediction 0 1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 0 0 0
1 3 0 0 0 0 0 0 0 0 0 0
2 552 39 100 135 76 22 1 1 0 0 0
3 136 32 142 141 50 19 6 1 1 0 0
4 100 9 75 294 265 86 6 2 0 0 0
5 14 0 14 194 392 206 40 1 0 0 0
6 4 0 0 19 164 219 110 12 0 0 0
7 0 0 0 0 6 53 61 22 5 0 0
8 0 0 0 0 0 1 3 1 2 0 0
9 0 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 0
Call:
lm(formula = TARGET ~ STARS + AcidIndex + LabelAppeal + VolatileAcidity,
data = train)
Residuals:
Min 1Q Median 3Q Max
-4.977 -0.854 0.053 0.831 6.189
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9320 0.1047 8.90 < 2e-16 ***
STARS2 1.0511 0.0392 26.78 < 2e-16 ***
STARS3 1.6306 0.0451 36.14 < 2e-16 ***
STARS4 2.3454 0.0730 32.14 < 2e-16 ***
STARS0 -1.3618 0.0396 -34.37 < 2e-16 ***
AcidIndex2 0.4483 0.0966 4.64 3.5e-06 ***
AcidIndex3 0.9541 0.0731 13.06 < 2e-16 ***
AcidIndex4 1.1664 0.0730 15.97 < 2e-16 ***
LabelAppeal-1 0.3776 0.0762 4.95 7.4e-07 ***
LabelAppeal0 0.8227 0.0743 11.07 < 2e-16 ***
LabelAppeal1 1.3102 0.0775 16.90 < 2e-16 ***
LabelAppeal2 1.8465 0.1019 18.12 < 2e-16 ***
VolatileAcidity -0.1028 0.0252 -4.09 4.4e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.31 on 8945 degrees of freedom
Multiple R-squared: 0.536, Adjusted R-squared: 0.536
F-statistic: 862 on 12 and 8945 DF, p-value: <2e-16
pred <- predict(mod4, newdata=test)
predRound <- as.factor(round(pred,0))
levels(predRound) <- levels(as.factor(test$TARGET))
confusionMatrix(predRound, as.factor(test$TARGET))
Confusion Matrix and Statistics
Reference
Prediction 0 1 2 3 4 5 6 7 8
0 1 0 0 0 0 0 0 0 0
1 69 9 10 5 2 0 0 0 0
2 376 30 90 113 46 9 0 0 0
3 234 31 131 151 68 26 6 2 1
4 109 10 74 271 243 81 5 0 0
5 19 0 26 225 454 258 62 2 0
6 1 0 0 18 134 183 101 15 0
7 0 0 0 0 6 49 53 21 7
8 0 0 0 0 0 0 0 0 0
Overall Statistics
Accuracy : 0.228
95% CI : (0.215, 0.241)
No Information Rate : 0.248
P-Value [Acc > NIR] : 0.999
Kappa : 0.09
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity 0.001236 0.11250 0.2719 0.1928 0.2550 0.4257 0.4449 0.52500 0.00000
Specificity 1.000000 0.97711 0.8363 0.8366 0.8093 0.7561 0.9028 0.96971 1.00000
Pos Pred Value 1.000000 0.09474 0.1355 0.2323 0.3064 0.2467 0.2235 0.15441 NaN
Neg Pred Value 0.789364 0.98103 0.9240 0.8017 0.7668 0.8753 0.9628 0.99487 0.99792
Prevalence 0.210842 0.02085 0.0863 0.2041 0.2484 0.1579 0.0592 0.01042 0.00208
Detection Rate 0.000261 0.00235 0.0235 0.0394 0.0633 0.0672 0.0263 0.00547 0.00000
Detection Prevalence 0.000261 0.02476 0.1731 0.1694 0.2067 0.2726 0.1178 0.03544 0.00000
Balanced Accuracy 0.500618 0.54480 0.5541 0.5147 0.5321 0.5909 0.6739 0.74736 0.50000
mod5 <- zeroinfl(TARGET ~ STARS + LabelAppeal + AcidIndex + TotalSulfurDioxide + VolatileAcidity, data=train, dist="negbin")
summary(mod5)
Call:
zeroinfl(formula = TARGET ~ STARS + LabelAppeal + AcidIndex + TotalSulfurDioxide + VolatileAcidity,
data = train, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-2.289 -0.447 0.034 0.400 4.371
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.76e-01 7.00e-02 6.80 1.1e-11 ***
STARS2 1.19e-01 1.80e-02 6.61 3.9e-11 ***
STARS3 2.20e-01 1.93e-02 11.38 < 2e-16 ***
STARS4 3.28e-01 2.67e-02 12.26 < 2e-16 ***
STARS0 -6.56e-02 2.53e-02 -2.60 0.0094 **
LabelAppeal-1 4.43e-01 5.07e-02 8.74 < 2e-16 ***
LabelAppeal0 7.35e-01 4.96e-02 14.81 < 2e-16 ***
LabelAppeal1 9.26e-01 5.04e-02 18.38 < 2e-16 ***
LabelAppeal2 1.08e+00 5.57e-02 19.37 < 2e-16 ***
AcidIndex2 -5.76e-02 6.16e-02 -0.94 0.3497
AcidIndex3 1.91e-02 4.90e-02 0.39 0.6973
AcidIndex4 4.88e-02 4.88e-02 1.00 0.3172
TotalSulfurDioxide -4.77e-05 3.71e-05 -1.28 0.1995
VolatileAcidity -8.65e-03 1.15e-02 -0.75 0.4506
Log(theta) 1.97e+01 NA NA NA
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.63e+00 4.80e-01 -3.38 0.00072 ***
STARS2 -3.89e+00 4.44e-01 -8.78 < 2e-16 ***
STARS3 -1.84e+01 4.68e+02 -0.04 0.96855
STARS4 -1.86e+01 8.95e+02 -0.02 0.98340
STARS0 2.04e+00 8.95e-02 22.79 < 2e-16 ***
LabelAppeal-1 1.44e+00 4.46e-01 3.22 0.00128 **
LabelAppeal0 2.26e+00 4.43e-01 5.10 3.5e-07 ***
LabelAppeal1 2.91e+00 4.49e-01 6.48 9.2e-11 ***
LabelAppeal2 3.48e+00 5.08e-01 6.84 7.9e-12 ***
AcidIndex2 -1.32e+00 2.44e-01 -5.39 7.0e-08 ***
AcidIndex3 -2.03e+00 1.94e-01 -10.49 < 2e-16 ***
AcidIndex4 -2.50e+00 1.96e-01 -12.75 < 2e-16 ***
TotalSulfurDioxide -1.22e-03 2.59e-04 -4.70 2.5e-06 ***
VolatileAcidity 2.12e-01 7.06e-02 3.01 0.00262 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 352438663.38
Number of iterations in BFGS optimization: 41
Log-likelihood: -1.43e+04 on 29 Df
pred <- predict(mod5, newdata=test, type='response')
predRound <- as.factor(round(pred,0))
testData <- as.factor(test$TARGET)
cm <- confusionMatrix(predRound, testData)
cm$overall[1]
Accuracy
0.346
Reference
Prediction 0 1 2 3 4 5 6 7 8
0 91 1 0 4 2 1 0 0 0
1 431 32 66 79 54 20 3 2 1
2 168 38 181 189 62 17 4 0 0
3 100 9 74 315 275 81 5 0 0
4 15 0 10 191 391 177 34 1 0
5 4 0 0 5 161 247 105 10 0
6 0 0 0 0 8 58 66 22 3
7 0 0 0 0 0 5 10 5 4
8 0 0 0 0 0 0 0 0 0
The selection process is fairly straightforward. Only the Zero-Inflated Negative Binomial model (ZINB) was able to meet or prediction ability criteria. Specifically all other models struggled or simply were unable to predict zero values. Given the business purpose of this analysis, this is a fundamental requirement.
The ZINB model also outperformed all other models in terms of confusion matrix accuracy, AIC, BIC, logLik and length of model name. Summary results are set forth below.
mod1_out <- cbind(AIC=AIC(mod1), BIC = BIC(mod1), loglik=logLik(mod1))
mod2_out <- cbind(AIC=AIC(mod2),BIC = BIC(mod2), loglik=logLik(mod2))
mod3_out <- cbind(AIC=AIC(mod3),BIC = BIC(mod3), loglik=logLik(mod3))
mod4_out <- cbind(AIC=AIC(mod4), BIC = BIC(mod4), loglik=logLik(mod4))
mod5_out <- cbind(AIC=AIC(mod5), BIC = BIC(mod5), loglik=logLik(mod5))
model_comp <- rbind(mod1_out, mod2_out,mod3_out,mod4_out,mod5_out)
rownames(model_comp) <- c("mod1_out","mod2_out","mod3_out","mod4_out","mod5_out")
c <-kable(model_comp) %>%
kable_styling()
c
AIC | BIC | loglik | |
---|---|---|---|
mod1_out | 31980 | 32072 | -15977 |
mod2_out | 30331 | 30431 | -15152 |
mod3_out | 31977 | 32070 | -15976 |
mod4_out | 30327 | 30426 | -15149 |
mod5_out | 28603 | 28809 | -14273 |
The final step of our analysis was to apply the same transformations and imputations to the evaluation data set that we applied to the training data set and predict target values using our selected model.
Key findings from our analysis are set forth below.