In this assignment, we are tasked with predicting the number of cases of wine sold to wine distribution companies following a sampling. The target variable, cases of wine sold, is count data and therefore will be modeled using appropriate techniques such as Poisson and Negative Binomial regressions.
All the variable in this dataset are numeric and continuous except for AcidIndex
, STARS
and LabelAppeal
which are discrete. The target variable TARGET
is also discrete. There are a number of missing observations for certain chemical compositiion variables as well as a large number of wines with no STARS
rating. The distribution of the continous variables appear well centered.
Name | data_train |
Number of rows | 12795 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
numeric | 16 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
INDEX | 0 | 1.00 | 8069.98 | 4656.91 | 1.00 | 4037.50 | 8110.00 | 12106.50 | 16129.00 | ▇▇▇▇▇ |
TARGET | 0 | 1.00 | 3.03 | 1.93 | 0.00 | 2.00 | 3.00 | 4.00 | 8.00 | ▆▇▇▆▁ |
FixedAcidity | 0 | 1.00 | 7.08 | 6.32 | -18.10 | 5.20 | 6.90 | 9.50 | 34.40 | ▁▂▇▂▁ |
VolatileAcidity | 0 | 1.00 | 0.32 | 0.78 | -2.79 | 0.13 | 0.28 | 0.64 | 3.68 | ▁▂▇▂▁ |
CitricAcid | 0 | 1.00 | 0.31 | 0.86 | -3.24 | 0.03 | 0.31 | 0.58 | 3.86 | ▁▂▇▂▁ |
ResidualSugar | 616 | 0.95 | 5.42 | 33.75 | -127.80 | -2.00 | 3.90 | 15.90 | 141.15 | ▁▂▇▂▁ |
Chlorides | 638 | 0.95 | 0.05 | 0.32 | -1.17 | -0.03 | 0.05 | 0.15 | 1.35 | ▁▂▇▂▁ |
FreeSulfurDioxide | 647 | 0.95 | 30.85 | 148.71 | -555.00 | 0.00 | 30.00 | 70.00 | 623.00 | ▁▂▇▂▁ |
TotalSulfurDioxide | 682 | 0.95 | 120.71 | 231.91 | -823.00 | 27.00 | 123.00 | 208.00 | 1057.00 | ▁▂▇▂▁ |
Density | 0 | 1.00 | 0.99 | 0.03 | 0.89 | 0.99 | 0.99 | 1.00 | 1.10 | ▁▂▇▂▁ |
pH | 395 | 0.97 | 3.21 | 0.68 | 0.48 | 2.96 | 3.20 | 3.47 | 6.13 | ▁▂▇▂▁ |
Sulphates | 1210 | 0.91 | 0.53 | 0.93 | -3.13 | 0.28 | 0.50 | 0.86 | 4.24 | ▁▂▇▂▁ |
Alcohol | 653 | 0.95 | 10.49 | 3.73 | -4.70 | 9.00 | 10.40 | 12.40 | 26.50 | ▁▂▇▂▁ |
LabelAppeal | 0 | 1.00 | -0.01 | 0.89 | -2.00 | -1.00 | 0.00 | 1.00 | 2.00 | ▁▅▇▅▁ |
AcidIndex | 0 | 1.00 | 7.77 | 1.32 | 4.00 | 7.00 | 8.00 | 8.00 | 17.00 | ▁▇▁▁▁ |
STARS | 3359 | 0.74 | 2.04 | 0.90 | 1.00 | 1.00 | 2.00 | 3.00 | 4.00 | ▇▇▁▅▂ |
Name | data_test |
Number of rows | 3335 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
logical | 1 |
numeric | 15 |
________________________ | |
Group variables | None |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
TARGET | 3335 | 0 | NaN | : |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
IN | 0 | 1.00 | 8048.31 | 4655.48 | 3.00 | 4018.50 | 7906.00 | 12061.00 | 16130.00 | ▇▇▇▇▇ |
FixedAcidity | 0 | 1.00 | 6.86 | 6.32 | -18.20 | 5.20 | 6.90 | 9.00 | 33.50 | ▁▂▇▂▁ |
VolatileAcidity | 0 | 1.00 | 0.31 | 0.81 | -2.83 | 0.08 | 0.28 | 0.63 | 3.61 | ▁▂▇▂▁ |
CitricAcid | 0 | 1.00 | 0.31 | 0.87 | -3.12 | 0.00 | 0.31 | 0.60 | 3.76 | ▁▂▇▂▁ |
ResidualSugar | 168 | 0.95 | 5.32 | 34.37 | -128.30 | -2.60 | 3.60 | 17.20 | 145.40 | ▁▂▇▂▁ |
Chlorides | 138 | 0.96 | 0.06 | 0.31 | -1.15 | 0.02 | 0.05 | 0.17 | 1.26 | ▁▂▇▂▁ |
FreeSulfurDioxide | 152 | 0.95 | 34.95 | 149.63 | -563.00 | 3.00 | 30.00 | 79.25 | 617.00 | ▁▂▇▂▁ |
TotalSulfurDioxide | 157 | 0.95 | 123.41 | 225.80 | -769.00 | 27.25 | 124.00 | 210.00 | 1004.00 | ▁▂▇▂▁ |
Density | 0 | 1.00 | 0.99 | 0.03 | 0.89 | 0.99 | 0.99 | 1.00 | 1.10 | ▁▂▇▂▁ |
pH | 104 | 0.97 | 3.24 | 0.68 | 0.60 | 2.98 | 3.21 | 3.49 | 6.21 | ▁▂▇▂▁ |
Sulphates | 310 | 0.91 | 0.53 | 0.91 | -3.07 | 0.33 | 0.50 | 0.82 | 4.18 | ▁▂▇▂▁ |
Alcohol | 185 | 0.94 | 10.58 | 3.76 | -4.20 | 9.00 | 10.40 | 12.50 | 25.60 | ▁▂▇▂▁ |
LabelAppeal | 0 | 1.00 | 0.01 | 0.89 | -2.00 | -1.00 | 0.00 | 1.00 | 2.00 | ▁▅▇▅▁ |
AcidIndex | 0 | 1.00 | 7.75 | 1.32 | 5.00 | 7.00 | 8.00 | 8.00 | 17.00 | ▇▇▁▁▁ |
STARS | 841 | 0.75 | 2.04 | 0.91 | 1.00 | 1.00 | 2.00 | 3.00 | 4.00 | ▇▇▁▅▂ |
The visualizations below display the distributions of the dataset. All continuous variables are centered and close to normally distributed. We note that some variables are centered around zero and take negative values which is unexpected and will be investigated further and transformed for our analysis. The distribution of the TARGET
variable looks like it could be well described by the poisson distribution which has equal mean and variance but the high number of zero values justifies the use of a zero-inflated model as well. The mean and variance of TARGET
are 3.02 and 3.71 respectively, which is close enough to satisfy the equal mean-variance assumption of the poisson distribution.
# histogram
data_train %>%
dplyr::select(-c("AcidIndex", "STARS", "TARGET", "LabelAppeal")) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scale = "free", ncol = 3) +
geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="pink") +
theme_minimal()
In the box plot below, we plot several variables in to two panels. TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar variables have large ranges compared to other variables. Therefore, we separated those variables in to a different panel to view their distribution. From both panel, we can tell a high number of variables have numerous outliers.
library(ggpubr)
# boxplot
p1 <- data_train %>%
dplyr::select(-c("TotalSulfurDioxide", "FreeSulfurDioxide", "ResidualSugar")) %>%
gather(na.rm = TRUE) %>%
ggplot(aes(factor(key), value)) +
geom_boxplot(outlier.colour = "#e281cf", outlier.shape = 1, color = "#5aa1ed") +
coord_flip() +
labs(title = "Boxplot of Chemical Properties of Wine", x = "Chemical Properties", y = "Values") +
theme_minimal()
p2 <- data_train %>%
dplyr::select(c("TotalSulfurDioxide", "FreeSulfurDioxide", "ResidualSugar")) %>%
gather(na.rm = TRUE) %>%
ggplot(aes(factor(key), value)) +
geom_boxplot(outlier.colour = "#e281cf", outlier.shape = 1, color = "#5aa1ed") +
#labs(title = "Boxplot of Chemical Properties of Wine", x = "Chemical Properties", y = "Values") +
theme_minimal()
ggarrange(p1, p2)
In the bar char below, AcidIndex
tells us large quantity of wine were sold with the index number 7
and 8
. LabelAppeal
tells us generic labeled wine sells the most; however, better label does yield higher number of wine samples per order. Lastly, STARS
tells us excellent quality does not result in high wine orders. It could be due to high star wine bottle’s high price tag.
# barchart
p3 <- data_train %>%
dplyr::select(TARGET, STARS) %>%
mutate(STARS = as.factor(STARS),
TARGET = as.factor(TARGET)) %>%
ggplot(aes(STARS)) +
geom_bar(aes(fill = TARGET)) +
theme_minimal()
p4 <- data_train %>%
dplyr::select(TARGET, LabelAppeal) %>%
mutate(STARS = as.factor(LabelAppeal),
TARGET = as.factor(TARGET)) %>%
ggplot(aes(LabelAppeal)) +
geom_bar(aes(fill = TARGET)) +
theme_minimal()
p5 <- data_train %>%
dplyr::select(TARGET, AcidIndex) %>%
mutate(STARS = as.factor(AcidIndex),
TARGET = as.factor(TARGET)) %>%
ggplot(aes(AcidIndex)) +
geom_bar(aes(fill = TARGET)) +
theme_minimal()
ggarrange(p5, ggarrange(p3, p4, ncol = 2, nrow = 1, legend = "none"), nrow = 2, common.legend = TRUE)
# top correlation
wine_train_corr <- data_train %>%
drop_na() %>%
cor()
kable(sort(wine_train_corr[,1], decreasing = T), col.names = c("Correlation")) %>%
kable_styling(full_width = F)
Correlation | |
---|---|
TARGET | 1.0000000 |
STARS | 0.5546857 |
LabelAppeal | 0.4979465 |
Alcohol | 0.0737771 |
FreeSulfurDioxide | 0.0226398 |
TotalSulfurDioxide | 0.0216021 |
ResidualSugar | 0.0035196 |
CitricAcid | 0.0023450 |
pH | 0.0002199 |
FixedAcidity | -0.0125381 |
Sulphates | -0.0212204 |
Chlorides | -0.0304301 |
Density | -0.0475989 |
VolatileAcidity | -0.0759979 |
AcidIndex | -0.1676431 |
In the correlation table and plot below, we see, STARS
and LabelAppeal
are most positively correlated variables with the response variable. We expected this because our variable description mentions these variable’s theoretical affect are higher than other variables. Also, we some mild negative correlation between the response variable and AcidIndex
variable.
library(corrplot)
library(RColorBrewer)
# correlation plot
corrplot(wine_train_corr,
method = "number",
type = "lower",
col = brewer.pal(n = 15, name = "Reds"),
number.cex = .7, tl.cex = .7,
tl.col = "black", tl.srt = 45)
Using the aggr
function from VIM package
, We see several variables have missing values. According to UCI Machine Learning, who published this dataset, all wine contain some natural sulfites. Therefore, we will impute the missing values for sulfite chemical properties. Also, it’s rare to find wines with less than 1 gram/liter of sugar. We will impute this as well. Matter of fact, since all missing values are missing at random, we will impute all the missing values using the mice package and random forest method. Mice uses multivariate imputations to estimate the missing values. Using multiple imputations helps in resolving the uncertainty for the missingness. Our target variable will be removed as a predictor variable but still will be imputed. Our response variables will be removed as predictor variables but still will be imputed.
library(VIM)
# missing value columns
aggr(data_train,
sortVars=TRUE,
labels=names(data_train),
cex.axis=.5,
bars = FALSE,
col = c("white", "#E46726"),
combined = TRUE,
#border = NA,
ylab = "Missing Values")
Variables sorted by number of missings:
Variable Count
STARS 3359
Sulphates 1210
TotalSulfurDioxide 682
Alcohol 653
FreeSulfurDioxide 647
Chlorides 638
ResidualSugar 616
pH 395
TARGET 0
FixedAcidity 0
VolatileAcidity 0
CitricAcid 0
Density 0
LabelAppeal 0
AcidIndex 0
iter imp variable
1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
meth <- init$method
predM <- init$predictorMatrix
predM[, c("TARGET")] <- 0 #this code will remove the variable as a predictor but still will be imputed
data_train_impute <- mice(data_train, method = 'rf', predictorMatrix=predM)
iter imp variable
1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
data_train_imputed <- complete(data_train_impute)
print(paste0("Missing value after imputation: ", sum(is.na(data_train_imputed))))
[1] "Missing value after imputation: 0"
# poisson model with the missing values
model1 <- glm(TARGET ~ ., family = poisson, data_train)
summary(model1)
Call:
glm(formula = TARGET ~ ., family = poisson, data = data_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.2158 -0.2734 0.0616 0.3732 1.6830
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.593e+00 2.506e-01 6.359 2.03e-10 ***
FixedAcidity 3.293e-04 1.053e-03 0.313 0.75447
VolatileAcidity -2.560e-02 8.353e-03 -3.065 0.00218 **
CitricAcid -7.259e-04 7.575e-03 -0.096 0.92365
ResidualSugar -6.141e-05 1.941e-04 -0.316 0.75165
Chlorides -3.007e-02 2.056e-02 -1.463 0.14346
FreeSulfurDioxide 6.734e-05 4.404e-05 1.529 0.12620
TotalSulfurDioxide 2.081e-05 2.855e-05 0.729 0.46618
Density -3.725e-01 2.462e-01 -1.513 0.13026
pH -4.661e-03 9.598e-03 -0.486 0.62722
Sulphates -5.164e-03 7.051e-03 -0.732 0.46398
Alcohol 3.948e-03 1.771e-03 2.229 0.02579 *
LabelAppeal 1.771e-01 7.954e-03 22.271 < 2e-16 ***
AcidIndex -4.870e-02 5.903e-03 -8.251 < 2e-16 ***
STARS 1.871e-01 7.487e-03 24.993 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 5844.1 on 6435 degrees of freedom
Residual deviance: 4009.1 on 6421 degrees of freedom
(6359 observations deleted due to missingness)
AIC: 23172
Number of Fisher Scoring iterations: 5
[1] "Goodness of Fit Test:"
with(model1, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE)))
res.deviance df p
[1,] 4009.142 6421 1
The deviance residuals is quite symmetrical. This means that the predicted points are close to actual observed points. As expected and seen in our correlation table, STARS
, LabelAppeal
and AcidIndex
are significant variables. And the variation in standard error is low. The goodness of fit test has a high p value which indicates that the model fits the data well.
# poisson model with the imputed values
model2 <- glm(TARGET ~ ., family = poisson, data_train_imputed)
summary(model2)
Call:
glm(formula = TARGET ~ ., family = poisson, data = data_train_imputed)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.2375 -0.5140 0.2139 0.6265 2.5573
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.020e+00 1.955e-01 10.331 < 2e-16 ***
FixedAcidity -3.926e-04 8.194e-04 -0.479 0.631854
VolatileAcidity -4.870e-02 6.486e-03 -7.507 6.03e-14 ***
CitricAcid 1.263e-02 5.891e-03 2.144 0.032025 *
ResidualSugar 1.644e-04 1.515e-04 1.085 0.277860
Chlorides -5.719e-02 1.615e-02 -3.542 0.000397 ***
FreeSulfurDioxide 1.492e-04 3.448e-05 4.326 1.52e-05 ***
TotalSulfurDioxide 1.017e-04 2.221e-05 4.578 4.68e-06 ***
Density -3.779e-01 1.919e-01 -1.969 0.048955 *
pH -2.450e-02 7.565e-03 -3.239 0.001200 **
Sulphates -1.833e-02 5.542e-03 -3.308 0.000941 ***
Alcohol 5.327e-03 1.385e-03 3.847 0.000120 ***
LabelAppeal 1.951e-01 6.039e-03 32.309 < 2e-16 ***
AcidIndex -1.216e-01 4.455e-03 -27.307 < 2e-16 ***
STARS 1.946e-01 5.799e-03 33.566 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 22861 on 12794 degrees of freedom
Residual deviance: 18447 on 12780 degrees of freedom
AIC: 50419
Number of Fisher Scoring iterations: 5
[1] "Goodness of Fit Test:"
with(model2, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE)))
res.deviance df p
[1,] 18446.92 12780 1.038084e-214
Again the deviance residuals are quite the same. However, with imputation, there are more significant variables introduced in the model. Additionally, the AIC score improved dramatically from 23172 to 50384. Also, the deviance residuals decreased significantly to 18,412 from 40,000. However, the goodness of fit test has a very small p value, suggesting that this model does not fit the data well. Since the residual deviance is greater than the degrees of freedom, then some over-dispersion exists.
Given what we observed when looking at the distribution of TARGET
, we should expext the inflated count of zeros to affect the model and bias results. For this reason, we move to a zero-inflated model to reflect this.
We try a quasipoisson model to account for nay overdispersion and to see if the results change significantly. As seen in the summary below, the models are nearly identical.
# poisson model with the imputed values
model3 <- glm(TARGET ~ ., family = quasipoisson(link='log'), data_train_imputed)
summary(model3)
Call:
glm(formula = TARGET ~ ., family = quasipoisson(link = "log"),
data = data_train_imputed)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.2375 -0.5140 0.2139 0.6265 2.5573
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.020e+00 1.914e-01 10.556 < 2e-16 ***
FixedAcidity -3.926e-04 8.019e-04 -0.490 0.624443
VolatileAcidity -4.870e-02 6.348e-03 -7.671 1.83e-14 ***
CitricAcid 1.263e-02 5.765e-03 2.191 0.028478 *
ResidualSugar 1.644e-04 1.483e-04 1.109 0.267529
Chlorides -5.719e-02 1.580e-02 -3.619 0.000296 ***
FreeSulfurDioxide 1.492e-04 3.374e-05 4.421 9.91e-06 ***
TotalSulfurDioxide 1.017e-04 2.173e-05 4.678 2.92e-06 ***
Density -3.779e-01 1.878e-01 -2.012 0.044246 *
pH -2.450e-02 7.404e-03 -3.310 0.000937 ***
Sulphates -1.833e-02 5.424e-03 -3.380 0.000727 ***
Alcohol 5.327e-03 1.355e-03 3.931 8.51e-05 ***
LabelAppeal 1.951e-01 5.910e-03 33.014 < 2e-16 ***
AcidIndex -1.216e-01 4.360e-03 -27.903 < 2e-16 ***
STARS 1.946e-01 5.675e-03 34.298 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 0.9577432)
Null deviance: 22861 on 12794 degrees of freedom
Residual deviance: 18447 on 12780 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
[1] "Goodness of Fit Test:"
with(model3, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE)))
res.deviance df p
[1,] 18446.92 12780 1.038084e-214
We saw earlier that the dependent variable had an excess number of zeros which skewed the distribution from a typical poisson. The zero inflated model generates coefficients for the zero count part of the model as well as for the count part.
Call:
zeroinfl(formula = TARGET ~ ., data = data_train_imputed, dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-2.5523 -0.3699 0.1592 0.4997 4.1466
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.416e+00 2.068e-01 6.847 7.56e-12 ***
FixedAcidity 3.457e-04 8.573e-04 0.403 0.686765
VolatileAcidity -1.292e-02 6.876e-03 -1.878 0.060345 .
CitricAcid -7.268e-04 6.141e-03 -0.118 0.905787
ResidualSugar -1.076e-04 1.589e-04 -0.677 0.498245
Chlorides -1.685e-02 1.700e-02 -0.991 0.321551
FreeSulfurDioxide 3.159e-05 3.543e-05 0.892 0.372519
TotalSulfurDioxide -2.811e-05 2.257e-05 -1.246 0.212943
Density -2.906e-01 2.027e-01 -1.434 0.151694
pH 6.640e-03 7.965e-03 0.834 0.404528
Sulphates 9.380e-04 5.863e-03 0.160 0.872902
Alcohol 7.578e-03 1.444e-03 5.249 1.53e-07 ***
LabelAppeal 2.509e-01 6.423e-03 39.065 < 2e-16 ***
AcidIndex -1.695e-02 4.991e-03 -3.396 0.000683 ***
STARS 9.212e-02 6.291e-03 14.644 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.6217507 1.0555623 -5.326 1.00e-07 ***
FixedAcidity 0.0034100 0.0043691 0.780 0.4351
VolatileAcidity 0.2313922 0.0350883 6.595 4.26e-11 ***
CitricAcid -0.0802892 0.0318198 -2.523 0.0116 *
ResidualSugar -0.0014474 0.0008129 -1.780 0.0750 .
Chlorides 0.2487335 0.0866513 2.871 0.0041 **
FreeSulfurDioxide -0.0007772 0.0001843 -4.216 2.49e-05 ***
TotalSulfurDioxide -0.0008167 0.0001184 -6.899 5.25e-12 ***
Density 0.7365822 1.0363870 0.711 0.4773
pH 0.1942998 0.0406918 4.775 1.80e-06 ***
Sulphates 0.1219135 0.0299885 4.065 4.80e-05 ***
Alcohol 0.0114074 0.0074290 1.536 0.1247
LabelAppeal 0.3612193 0.0338672 10.666 < 2e-16 ***
AcidIndex 0.4786918 0.0199582 23.985 < 2e-16 ***
STARS -0.6768722 0.0371960 -18.197 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 35
Log-likelihood: -2.248e+04 on 30 Df
model4b <- zeroinfl(TARGET ~ . - FixedAcidity - Density, data_train_imputed, dist = 'poisson')
summary(model4b)
Call:
zeroinfl(formula = TARGET ~ . - FixedAcidity - Density, data = data_train_imputed,
dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-2.5516 -0.3710 0.1591 0.4985 3.9624
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.130e+00 5.243e-02 21.543 < 2e-16 ***
VolatileAcidity -1.297e-02 6.876e-03 -1.887 0.059225 .
CitricAcid -6.628e-04 6.140e-03 -0.108 0.914042
ResidualSugar -1.083e-04 1.589e-04 -0.682 0.495384
Chlorides -1.743e-02 1.700e-02 -1.025 0.305133
FreeSulfurDioxide 3.086e-05 3.541e-05 0.871 0.383511
TotalSulfurDioxide -2.894e-05 2.256e-05 -1.283 0.199660
pH 6.653e-03 7.966e-03 0.835 0.403656
Sulphates 1.048e-03 5.862e-03 0.179 0.858106
Alcohol 7.595e-03 1.444e-03 5.261 1.44e-07 ***
LabelAppeal 2.508e-01 6.422e-03 39.059 < 2e-16 ***
AcidIndex -1.702e-02 4.944e-03 -3.442 0.000578 ***
STARS 9.233e-02 6.290e-03 14.680 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.8891475 0.2417675 -20.223 < 2e-16 ***
VolatileAcidity 0.2321447 0.0350717 6.619 3.61e-11 ***
CitricAcid -0.0806133 0.0317915 -2.536 0.0112 *
ResidualSugar -0.0014497 0.0008128 -1.784 0.0745 .
Chlorides 0.2492837 0.0866164 2.878 0.0040 **
FreeSulfurDioxide -0.0007745 0.0001843 -4.202 2.65e-05 ***
TotalSulfurDioxide -0.0008167 0.0001184 -6.900 5.20e-12 ***
pH 0.1942247 0.0406838 4.774 1.81e-06 ***
Sulphates 0.1223315 0.0299698 4.082 4.47e-05 ***
Alcohol 0.0113516 0.0074270 1.528 0.1264
LabelAppeal 0.3601954 0.0338376 10.645 < 2e-16 ***
AcidIndex 0.4818269 0.0196286 24.547 < 2e-16 ***
STARS -0.6765979 0.0371710 -18.202 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 31
Log-likelihood: -2.248e+04 on 26 Df
Based on the models tested, we select the Zero Inflated model due to the highest accuracy of the 4 models
pred_train <- data.frame(TARGET=data_train_imputed$TARGET,
model2=model2$fitted.values,
model3=model3$fitted.values,
model4b=model4b$fitted.values)
pred_train <- round(pred_train, 0)
colnames(pred_train) <- c("TARGET","Poisson (Imputed)" ,"Quasipoisson Model", "Zero Inflated")
pred_train %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scale = "free", ncol = 4) +
geom_bar(fill="pink") +
theme_minimal() + labs(x="Cases Bought", y = "Count", title = "Prediction Histogram")
model2_fitted.values <- factor(round(model2$fitted.values),levels=rev(0:9))
model3_fitted.values <- factor(round(model3$fitted.values),levels=rev(0:9))
model4b_fitted.values <- factor(round(model4b$fitted.values),levels=rev(0:9))
m2_cfm <- confusionMatrix(model2_fitted.values, factor(data_train_imputed$TARGET,levels=rev(0:9)))
m3_cfm <- confusionMatrix(model3_fitted.values, factor(data_train_imputed$TARGET,levels=rev(0:9)))
m4_cfm <- confusionMatrix(model4b_fitted.values, factor(data_train_imputed$TARGET,levels=rev(0:9)))
models_sum <- data.frame(m2_cfm$overall, m3_cfm$overall, m4_cfm$overall)
colnames(models_sum) <- c("Poisson (Imputed)" ,"Quasipoisson Model", "Zero Inflated")
round(models_sum, 2)
iter imp variable
1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
meth <- init$method
predM <- init$predictorMatrix
predM[, c("TARGET")] <- 0 #this code will remove the variable as a predictor but still will be imputed
data_test_impute <- mice(data_test, method = 'rf', predictorMatrix=predM)
iter imp variable
1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
4 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
5 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
data_test_imputed <- complete(data_test_impute)
print(paste0("Missing value after imputation: ", sum(is.na(data_test_imputed))))
[1] "Missing value after imputation: 3335"
test_predict <- predict(model4b, newdata=data_test_imputed)
test_predict <- round(test_predict,0)
data_pred <- data.frame(TARGET=test_predict)
ggplot(data_pred, aes(x=TARGET)) + geom_bar(fill="orchid3") + theme_minimal() +
labs(y="Count", title = "Prediction: Zero Inflated Model (Model4b)") +
scale_x_discrete(name = "Cases Bought", limits=c("1","2","3","4", "5", "6", "7", "8"))