Overview
In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
image of data description
Count Regression Model
Count Regression model is a regression model that relates a non-negative integer value (0, 1, 2, 3, etc.) field of interest (a target variable) to one or more fields that are expected to have an influence on the target variable, and are often called predictor variables.
Objective
Our objective is to build poisson regression models, negative binomial regression models and multiple linear regression models on the training data and predict the number of cases of wine that will be sold given certain properties of the wine. Given 2 datasets (i.e training and evaluation), Using the training data set, will evaluate the performance of the count regression model and make predictions using the evaluation data set.
Library load
Data Load
Train dataset consists of 12795 observations (rows) and 15 features (variables).
wine_train <-
read.csv("https://raw.githubusercontent.com/mharrisonbaker/DATA621_GroupWork2/main/HW5/wine-training-data.csv",
header = TRUE) %>% dplyr::select(-ï..INDEX)
dim(wine_train)
## [1] 12795 15
kable(wine_train[1:10,]) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")
TARGET | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 3.2 | 1.160 | -0.98 | 54.20 | -0.567 | NA | 268 | 0.99280 | 3.33 | -0.59 | 9.9 | 0 | 8 | 2 |
3 | 4.5 | 0.160 | -0.81 | 26.10 | -0.425 | 15 | -327 | 1.02792 | 3.38 | 0.70 | NA | -1 | 7 | 3 |
5 | 7.1 | 2.640 | -0.88 | 14.80 | 0.037 | 214 | 142 | 0.99518 | 3.12 | 0.48 | 22.0 | -1 | 8 | 3 |
3 | 5.7 | 0.385 | 0.04 | 18.80 | -0.425 | 22 | 115 | 0.99640 | 2.24 | 1.83 | 6.2 | -1 | 6 | 1 |
4 | 8.0 | 0.330 | -1.26 | 9.40 | NA | -167 | 108 | 0.99457 | 3.12 | 1.77 | 13.7 | 0 | 9 | 2 |
0 | 11.3 | 0.320 | 0.59 | 2.20 | 0.556 | -37 | 15 | 0.99940 | 3.20 | 1.29 | 15.4 | 0 | 11 | NA |
0 | 7.7 | 0.290 | -0.40 | 21.50 | 0.060 | 287 | 156 | 0.99572 | 3.49 | 1.21 | 10.3 | 0 | 8 | NA |
4 | 6.5 | -1.220 | 0.34 | 1.40 | 0.040 | 523 | 551 | 1.03236 | 3.20 | NA | 11.6 | 1 | 7 | 3 |
3 | 14.8 | 0.270 | 1.05 | 11.25 | -0.007 | -213 | NA | 0.99620 | 4.93 | 0.26 | 15.0 | 0 | 6 | NA |
6 | 5.5 | -0.220 | 0.39 | 1.80 | -0.277 | 62 | 180 | 0.94724 | 3.09 | 0.75 | 12.6 | 0 | 8 | 4 |
Data Exploration
Summary of data shows some variables having NAs, we will see the disribution of these using boxplot to identify outliers.
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400
## 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300
## Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100
## Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084
## 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800
## Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600
##
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0
## 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0
## Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0
## Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7
## 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0
## Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0
## NA's :616 NA's :638 NA's :647 NA's :682
## Density pH Sulphates Alcohol
## Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70
## 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00
## Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40
## Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49
## 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40
## Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50
## NA's :395 NA's :1210 NA's :653
## LabelAppeal AcidIndex STARS
## Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median : 0.000000 Median : 8.000 Median :2.000
## Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :3359
Data distributions
All variables are symetrically unimodal distributed except TARGET.
Correlation of variables
STARTS and LabelAppeal seems to have a positive correlation with the number of cases sold.
kable(cor(drop_na(wine_train))[,14], "html", escape = F) %>%
kable_styling("striped", full_width = F) %>%
column_spec(1, bold = T) %>%
scroll_box(height = "500px")
x | |
---|---|
TARGET | -0.1676431 |
FixedAcidity | 0.1541678 |
VolatileAcidity | 0.0250530 |
CitricAcid | 0.0545838 |
ResidualSugar | -0.0203019 |
Chlorides | -0.0017134 |
FreeSulfurDioxide | -0.0147337 |
TotalSulfurDioxide | -0.0221293 |
Density | 0.0477788 |
pH | -0.0537129 |
Sulphates | 0.0310718 |
Alcohol | -0.0558919 |
LabelAppeal | 0.0103010 |
AcidIndex | 1.0000000 |
STARS | -0.0954826 |
library(corrgram)
corrgram(drop_na(wine_train), order=TRUE,
upper.panel=panel.cor, main="correlation")
Data Preparation
ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates, Alcohol, and STARS have NAs. Replace NAs with median value.
wine_train$ResidualSugar[is.na(wine_train$ResidualSugar)] <- median(wine_train$ResidualSugar, na.rm=TRUE)
wine_train$Chlorides[is.na(wine_train$Chlorides)] <- median(wine_train$Chlorides, na.rm=TRUE)
wine_train$FreeSulfurDioxide[is.na(wine_train$FreeSulfurDioxide)] <- median(wine_train$FreeSulfurDioxide, na.rm=TRUE)
wine_train$TotalSulfurDioxide[is.na(wine_train$TotalSulfurDioxide)] <- median(wine_train$TotalSulfurDioxide, na.rm=TRUE)
wine_train$pH[is.na(wine_train$pH)] <- median(wine_train$pH, na.rm=TRUE)
wine_train$Sulphates[is.na(wine_train$Sulphates)] <- median(wine_train$Sulphates, na.rm=TRUE)
wine_train$Alcohol[is.na(wine_train$Alcohol)] <- median(wine_train$Alcohol, na.rm=TRUE)
wine_train$STARS[is.na(wine_train$STARS)] <- median(wine_train$STARS, na.rm=TRUE)
# to check NAS exist or not
visdat::vis_miss(wine_train)
Build Models
We will build poisson regression models, negative binomial regression models, multiple linear regression models, and zero-inflated poisson model using all variables and backward approach.
poisson regression model1: all variables
##
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4819 -0.5261 0.2041 0.6365 2.5504
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.017e+00 1.957e-01 10.305 < 2e-16 ***
## FixedAcidity -4.515e-04 8.195e-04 -0.551 0.581697
## VolatileAcidity -5.045e-02 6.492e-03 -7.771 7.82e-15 ***
## CitricAcid 1.331e-02 5.892e-03 2.258 0.023925 *
## ResidualSugar 1.442e-04 1.545e-04 0.934 0.350499
## Chlorides -6.005e-02 1.645e-02 -3.650 0.000262 ***
## FreeSulfurDioxide 1.424e-04 3.513e-05 4.054 5.04e-05 ***
## TotalSulfurDioxide 1.065e-04 2.268e-05 4.695 2.67e-06 ***
## Density -4.315e-01 1.921e-01 -2.247 0.024651 *
## pH -2.389e-02 7.639e-03 -3.128 0.001762 **
## Sulphates -1.877e-02 5.739e-03 -3.271 0.001073 **
## Alcohol 5.368e-03 1.410e-03 3.807 0.000141 ***
## LabelAppeal 1.965e-01 6.021e-03 32.634 < 2e-16 ***
## AcidIndex -1.222e-01 4.463e-03 -27.381 < 2e-16 ***
## STARS 2.198e-01 6.468e-03 33.986 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 18419 on 12780 degrees of freedom
## AIC: 50391
##
## Number of Fisher Scoring iterations: 5
- Residual vs fitted plot shows red line is fairly flat so linearty assumption is met.
- QQ-plot shows some observations are fall on the line with minimal deviatation
- Scale-Location plot helps us to check the assumption of equal variance (homoscedasticity). Here red color line not completely staright but it is not bad.
- Residual vs Leverage plot helps us to determine if we have influential outliers in our data. Residuals are not present outside of the cooks distance
poisson regression model2: with selected variables
Remove acidity variables and chemical variables from models. We will build one more glm model, using the variables:Density, PH, Sulphates, Alcohol, LabelAppeal, AcidIndex, and STARS.
model2 = glm(TARGET ~
Alcohol + LabelAppeal + AcidIndex + STARS, data=wine_train, family=poisson)
summary(model2)
##
## Call:
## glm(formula = TARGET ~ Alcohol + LabelAppeal + AcidIndex + STARS,
## family = poisson, data = wine_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2829 -0.4886 0.2126 0.6327 2.7117
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.517121 0.040464 37.493 < 2e-16 ***
## Alcohol 0.005317 0.001410 3.771 0.000163 ***
## LabelAppeal 0.196742 0.006015 32.709 < 2e-16 ***
## AcidIndex -0.124712 0.004369 -28.543 < 2e-16 ***
## STARS 0.222430 0.006461 34.427 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 18568 on 12790 degrees of freedom
## AIC: 50520
##
## Number of Fisher Scoring iterations: 5
This model looks better than first model.
negative binomial regression model3: with all variables
##
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train, init.theta = 39421.16225,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4818 -0.5261 0.2041 0.6365 2.5503
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.017e+00 1.957e-01 10.305 < 2e-16 ***
## FixedAcidity -4.515e-04 8.196e-04 -0.551 0.581705
## VolatileAcidity -5.045e-02 6.493e-03 -7.770 7.83e-15 ***
## CitricAcid 1.331e-02 5.892e-03 2.258 0.023931 *
## ResidualSugar 1.443e-04 1.545e-04 0.934 0.350498
## Chlorides -6.005e-02 1.645e-02 -3.650 0.000262 ***
## FreeSulfurDioxide 1.424e-04 3.513e-05 4.054 5.04e-05 ***
## TotalSulfurDioxide 1.065e-04 2.269e-05 4.695 2.67e-06 ***
## Density -4.315e-01 1.921e-01 -2.247 0.024655 *
## pH -2.389e-02 7.639e-03 -3.128 0.001763 **
## Sulphates -1.877e-02 5.739e-03 -3.271 0.001073 **
## Alcohol 5.368e-03 1.410e-03 3.806 0.000141 ***
## LabelAppeal 1.965e-01 6.021e-03 32.633 < 2e-16 ***
## AcidIndex -1.222e-01 4.464e-03 -27.380 < 2e-16 ***
## STARS 2.198e-01 6.468e-03 33.984 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(39421.16) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 18419 on 12780 degrees of freedom
## AIC: 50394
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 39421
## Std. Err.: 59707
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -50361.62
negative binomial regression model4: with selected variables
model4 <- glm.nb(TARGET ~ Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train)
summary(model4)
##
## Call:
## glm.nb(formula = TARGET ~ Alcohol + LabelAppeal + AcidIndex +
## STARS, data = wine_train, init.theta = 38799.08563, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2828 -0.4885 0.2126 0.6327 2.7116
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.517138 0.040465 37.492 < 2e-16 ***
## Alcohol 0.005317 0.001410 3.771 0.000163 ***
## LabelAppeal 0.196743 0.006015 32.708 < 2e-16 ***
## AcidIndex -0.124714 0.004369 -28.542 < 2e-16 ***
## STARS 0.222428 0.006461 34.426 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(38799.09) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 18567 on 12790 degrees of freedom
## AIC: 50522
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 38799
## Std. Err.: 60155
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -50510.44
Linear Model model5: with all variables
##
## Call:
## lm(formula = TARGET ~ ., data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2211 -0.7540 0.3598 1.1254 4.3550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.355e+00 5.517e-01 9.707 < 2e-16 ***
## FixedAcidity -1.168e-03 2.315e-03 -0.505 0.613911
## VolatileAcidity -1.549e-01 1.838e-02 -8.429 < 2e-16 ***
## CitricAcid 3.976e-02 1.673e-02 2.377 0.017476 *
## ResidualSugar 4.716e-04 4.371e-04 1.079 0.280670
## Chlorides -1.931e-01 4.638e-02 -4.164 3.15e-05 ***
## FreeSulfurDioxide 4.286e-04 9.941e-05 4.312 1.63e-05 ***
## TotalSulfurDioxide 3.098e-04 6.387e-05 4.851 1.25e-06 ***
## Density -1.274e+00 5.427e-01 -2.347 0.018959 *
## pH -6.387e-02 2.154e-02 -2.965 0.003028 **
## Sulphates -5.485e-02 1.623e-02 -3.380 0.000728 ***
## Alcohol 1.883e-02 3.972e-03 4.739 2.17e-06 ***
## LabelAppeal 5.945e-01 1.686e-02 35.250 < 2e-16 ***
## AcidIndex -3.259e-01 1.117e-02 -29.169 < 2e-16 ***
## STARS 7.478e-01 1.946e-02 38.431 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.626 on 12780 degrees of freedom
## Multiple R-squared: 0.2879, Adjusted R-squared: 0.2871
## F-statistic: 369.1 on 14 and 12780 DF, p-value: < 2.2e-16
Linear Model model6: with selected variables
Added CitricAcid, Sulphates, VolatileAcidity, Chlorides in to the model, these varaibles having low p-value to get better fit model.
model6 <- lm(TARGET ~ CitricAcid + Sulphates + VolatileAcidity + Chlorides +
Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train)
summary(model6)
##
## Call:
## lm(formula = TARGET ~ CitricAcid + Sulphates + VolatileAcidity +
## Chlorides + Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0223 -0.7411 0.3702 1.1218 4.4756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.969865 0.106568 37.252 < 2e-16 ***
## CitricAcid 0.041882 0.016760 2.499 0.01247 *
## Sulphates -0.054633 0.016256 -3.361 0.00078 ***
## VolatileAcidity -0.158457 0.018409 -8.608 < 2e-16 ***
## Chlorides -0.199935 0.046443 -4.305 1.68e-05 ***
## Alcohol 0.018274 0.003978 4.594 4.40e-06 ***
## LabelAppeal 0.594530 0.016895 35.190 < 2e-16 ***
## AcidIndex -0.330626 0.010972 -30.133 < 2e-16 ***
## STARS 0.748890 0.019491 38.422 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.63 on 12786 degrees of freedom
## Multiple R-squared: 0.2846, Adjusted R-squared: 0.2842
## F-statistic: 636 on 8 and 12786 DF, p-value: < 2.2e-16
Above model shows Multiple R-squared 0.2846 means model explains 29% variation in the response variable.
Zero inflation model7: poisson
Zero inflation model:
Zero-inflated poisson regression is used to model count data that has an excess of zero counts. Further, theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently. Thus, the zip model has two parts, a poisson count model and the logit model for predicting excess zeros.
##
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = wine_train, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.5795 -0.2927 0.1833 0.5092 2.2406
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.646e+00 2.061e-01 7.987 1.38e-15 ***
## FixedAcidity 9.837e-05 8.538e-04 0.115 0.90828
## VolatileAcidity -1.900e-02 6.843e-03 -2.777 0.00548 **
## CitricAcid 2.249e-03 6.119e-03 0.368 0.71316
## ResidualSugar -4.630e-05 1.614e-04 -0.287 0.77420
## Chlorides -2.728e-02 1.722e-02 -1.584 0.11313
## FreeSulfurDioxide 4.982e-05 3.593e-05 1.386 0.16561
## TotalSulfurDioxide -3.468e-06 2.296e-05 -0.151 0.87995
## Density -3.195e-01 2.016e-01 -1.585 0.11288
## pH 7.112e-04 7.996e-03 0.089 0.92913
## Sulphates -3.625e-03 6.027e-03 -0.601 0.54755
## Alcohol 7.189e-03 1.461e-03 4.919 8.69e-07 ***
## LabelAppeal 2.435e-01 6.342e-03 38.399 < 2e-16 ***
## AcidIndex -4.335e-02 5.400e-03 -8.028 9.91e-16 ***
## STARS 1.025e-01 6.463e-03 15.852 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.36082 0.06746 -5.349 8.85e-08 ***
## STARS -0.58527 0.03405 -17.188 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 21
## Log-likelihood: -2.302e+04 on 17 Df
Zero inflation model8: negative binomial
##
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = wine_train, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.5795 -0.2927 0.1833 0.5092 2.2406
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.646e+00 2.061e-01 7.987 1.38e-15 ***
## FixedAcidity 9.837e-05 8.538e-04 0.115 0.90828
## VolatileAcidity -1.900e-02 6.843e-03 -2.777 0.00548 **
## CitricAcid 2.249e-03 6.119e-03 0.368 0.71317
## ResidualSugar -4.630e-05 1.614e-04 -0.287 0.77421
## Chlorides -2.728e-02 1.722e-02 -1.584 0.11313
## FreeSulfurDioxide 4.982e-05 3.593e-05 1.386 0.16561
## TotalSulfurDioxide -3.468e-06 2.296e-05 -0.151 0.87995
## Density -3.195e-01 2.016e-01 -1.585 0.11289
## pH 7.111e-04 7.996e-03 0.089 0.92913
## Sulphates -3.625e-03 6.027e-03 -0.601 0.54754
## Alcohol 7.189e-03 1.461e-03 4.919 8.69e-07 ***
## LabelAppeal 2.435e-01 6.342e-03 38.399 < 2e-16 ***
## AcidIndex -4.335e-02 5.400e-03 -8.028 9.91e-16 ***
## STARS 1.025e-01 6.463e-03 15.852 < 2e-16 ***
## Log(theta) 1.797e+01 NA NA NA
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.36082 0.06746 -5.349 8.85e-08 ***
## STARS -0.58527 0.03405 -17.188 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 63639583.2798
## Number of iterations in BFGS optimization: 21
## Log-likelihood: -2.302e+04 on 18 Df
Model Selection
Models by matrices MSE and AIC.
aic1 <- model1$aic
aic2 <- model2$aic
aic3 <- model3$aic
aic4 <- model4$aic
aic5 <- model5$aic
aic6 <- model6$aic
aic7 <- model7$aic
aic8 <- model8$aic
mse1 <- mean((wine_train$TARGET - predict(model1))^2)
mse2 <- mean((wine_train$TARGET - predict(model2))^2)
mse3 <- mean((wine_train$TARGET - predict(model3))^2)
mse4 <- mean((wine_train$TARGET - predict(model4))^2)
mse5 <- mean((wine_train$TARGET - predict(model5))^2)
mse6 <- mean((wine_train$TARGET - predict(model6))^2)
mse7 <- mean((wine_train$TARGET - predict(model7))^2)
mse8 <- mean((wine_train$TARGET - predict(model8))^2)
MSE <- c(mse1, mse2, mse3, mse4, mse5, mse6, mse7, mse8)
AIC <- c(aic1, aic2, aic3, aic4, aic5, aic6, aic7, aic8)
compare_df <- cbind(MSE,AIC)
rownames(compare_df) <- c("Model1", "Model2", "Model3", "Model4", "Model5", "Model6", "Model7", "Model8")
DT::datatable(compare_df)
Model 6 working fine than other models. Apply model 6 on test data.
Load test dataset to apply model and predict the TARGET.
wine_test =
read.csv("https://raw.githubusercontent.com/mharrisonbaker/DATA621_GroupWork2/main/HW5/wine-evaluation-data.csv",
header = TRUE) %>% dplyr::select(-IN)
dim(wine_test)
## [1] 3335 15
Test data cleaning
Remove Index column and replace NAS with Median value.
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Mode:logical Min. :-18.200 Min. :-2.8300 Min. :-3.1200
## NA's:3335 1st Qu.: 5.200 1st Qu.: 0.0800 1st Qu.: 0.0000
## Median : 6.900 Median : 0.2800 Median : 0.3100
## Mean : 6.864 Mean : 0.3103 Mean : 0.3124
## 3rd Qu.: 9.000 3rd Qu.: 0.6300 3rd Qu.: 0.6050
## Max. : 33.500 Max. : 3.6100 Max. : 3.7600
##
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. :-128.300 Min. :-1.15000 Min. :-563.00 Min. :-769.00
## 1st Qu.: -2.600 1st Qu.: 0.01600 1st Qu.: 3.00 1st Qu.: 27.25
## Median : 3.600 Median : 0.04700 Median : 30.00 Median : 124.00
## Mean : 5.319 Mean : 0.06143 Mean : 34.95 Mean : 123.41
## 3rd Qu.: 17.200 3rd Qu.: 0.17100 3rd Qu.: 79.25 3rd Qu.: 210.00
## Max. : 145.400 Max. : 1.26300 Max. : 617.00 Max. :1004.00
## NA's :168 NA's :138 NA's :152 NA's :157
## Density pH Sulphates Alcohol
## Min. :0.8898 Min. :0.600 Min. :-3.0700 Min. :-4.20
## 1st Qu.:0.9883 1st Qu.:2.980 1st Qu.: 0.3300 1st Qu.: 9.00
## Median :0.9946 Median :3.210 Median : 0.5000 Median :10.40
## Mean :0.9947 Mean :3.237 Mean : 0.5346 Mean :10.58
## 3rd Qu.:1.0005 3rd Qu.:3.490 3rd Qu.: 0.8200 3rd Qu.:12.50
## Max. :1.0998 Max. :6.210 Max. : 4.1800 Max. :25.60
## NA's :104 NA's :310 NA's :185
## LabelAppeal AcidIndex STARS
## Min. :-2.00000 Min. : 5.000 Min. :1.00
## 1st Qu.:-1.00000 1st Qu.: 7.000 1st Qu.:1.00
## Median : 0.00000 Median : 8.000 Median :2.00
## Mean : 0.01349 Mean : 7.748 Mean :2.04
## 3rd Qu.: 1.00000 3rd Qu.: 8.000 3rd Qu.:3.00
## Max. : 2.00000 Max. :17.000 Max. :4.00
## NA's :841
wine_test$ResidualSugar[is.na(wine_test$ResidualSugar)] <- median(wine_test$ResidualSugar, na.rm=TRUE)
wine_test$Chlorides[is.na(wine_test$Chlorides)] <- median(wine_test$Chlorides, na.rm=TRUE)
wine_test$FreeSulfurDioxide[is.na(wine_test$FreeSulfurDioxide)] <- median(wine_test$FreeSulfurDioxide, na.rm=TRUE)
wine_test$TotalSulfurDioxide[is.na(wine_test$TotalSulfurDioxide)] <- median(wine_test$TotalSulfurDioxide, na.rm=TRUE)
wine_test$pH[is.na(wine_test$pH)] <- median(wine_test$pH, na.rm=TRUE)
wine_test$Sulphates[is.na(wine_test$Sulphates)] <- median(wine_test$Sulphates, na.rm=TRUE)
wine_test$Alcohol[is.na(wine_test$Alcohol)] <- median(wine_test$Alcohol, na.rm=TRUE)
wine_test$STARS[is.na(wine_test$STARS)] <- median(wine_test$STARS, na.rm=TRUE)
# to check NAS exist or not
visdat::vis_miss(wine_test)
## TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1 3 5.4 -0.860 0.27 -10.7 0.092
## 2 3 12.4 0.385 -0.76 -19.7 1.169
## 3 2 7.2 1.750 0.17 -33.0 0.065
## 4 2 6.2 0.100 1.80 1.0 -0.179
## 5 2 11.4 0.210 0.28 1.2 0.038
## 6 5 17.6 0.040 -1.15 1.4 0.535
## FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol
## 1 23 398 0.98527 5.02 0.64 12.30
## 2 -37 68 0.99048 3.37 1.09 16.00
## 3 9 76 1.04641 4.61 0.68 8.55
## 4 104 89 0.98877 3.20 2.11 12.30
## 5 70 53 1.02899 2.54 -0.07 4.80
## 6 -250 140 0.95028 3.06 -0.02 11.40
## LabelAppeal AcidIndex STARS
## 1 -1 6 2
## 2 0 6 2
## 3 0 8 1
## 4 -1 8 1
## 5 0 10 2
## 6 1 8 4
Add 3 new column in wine_test
for a classification of wine - poor, okay, and good. Poor will be those with scores of 0,1,2,3; 4 will be okay and more than 4 will be good.
These newly added 3 column will show TRUE and FALSE values based on TARGET.
wine_test$Poor <- wine_test$TARGET <= 3
wine_test$Okay <- wine_test$TARGET == 4
wine_test$Good <- wine_test$TARGET >= 5
# created copy of test_wine and move last 3 col to 2,3,4 index
wine_test_new <- wine_test %>% dplyr::select(TARGET, Poor, Okay, Good, everything())
kable(wine_test_new[1:20,]) %>%
kable_styling(full_width = T)
TARGET | Poor | Okay | Good | FixedAcidity | VolatileAcidity | CitricAcid | ResidualSugar | Chlorides | FreeSulfurDioxide | TotalSulfurDioxide | Density | pH | Sulphates | Alcohol | LabelAppeal | AcidIndex | STARS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | TRUE | FALSE | FALSE | 5.4 | -0.860 | 0.27 | -10.7 | 0.092 | 23 | 398 | 0.98527 | 5.02 | 0.64 | 12.30 | -1 | 6 | 2 |
3 | TRUE | FALSE | FALSE | 12.4 | 0.385 | -0.76 | -19.7 | 1.169 | -37 | 68 | 0.99048 | 3.37 | 1.09 | 16.00 | 0 | 6 | 2 |
2 | TRUE | FALSE | FALSE | 7.2 | 1.750 | 0.17 | -33.0 | 0.065 | 9 | 76 | 1.04641 | 4.61 | 0.68 | 8.55 | 0 | 8 | 1 |
2 | TRUE | FALSE | FALSE | 6.2 | 0.100 | 1.80 | 1.0 | -0.179 | 104 | 89 | 0.98877 | 3.20 | 2.11 | 12.30 | -1 | 8 | 1 |
2 | TRUE | FALSE | FALSE | 11.4 | 0.210 | 0.28 | 1.2 | 0.038 | 70 | 53 | 1.02899 | 2.54 | -0.07 | 4.80 | 0 | 10 | 2 |
5 | FALSE | FALSE | TRUE | 17.6 | 0.040 | -1.15 | 1.4 | 0.535 | -250 | 140 | 0.95028 | 3.06 | -0.02 | 11.40 | 1 | 8 | 4 |
2 | TRUE | FALSE | FALSE | 15.5 | 0.530 | -0.53 | 4.6 | 1.263 | 10 | 17 | 1.00020 | 3.07 | 0.75 | 8.50 | 0 | 12 | 3 |
4 | FALSE | TRUE | FALSE | 15.9 | 1.190 | 1.14 | 31.9 | -0.299 | 115 | 381 | 1.03416 | 2.99 | 0.31 | 11.40 | 1 | 7 | 2 |
1 | TRUE | FALSE | FALSE | 11.6 | 0.320 | 0.55 | -50.9 | 0.076 | 35 | 83 | 1.00020 | 3.32 | 2.18 | -0.50 | 0 | 12 | 2 |
3 | TRUE | FALSE | FALSE | 3.8 | 0.220 | 0.31 | -7.7 | 0.039 | 40 | 129 | 0.90610 | 4.72 | -0.64 | 10.90 | 0 | 7 | 2 |
2 | TRUE | FALSE | FALSE | 6.8 | 1.680 | 0.44 | -13.3 | 0.046 | 30 | 583 | 1.00833 | 3.21 | 1.64 | 12.60 | 0 | 8 | 1 |
3 | TRUE | FALSE | FALSE | 9.0 | -0.210 | 0.04 | 51.4 | 0.237 | -213 | -527 | 0.99516 | 3.16 | 0.70 | 14.70 | 1 | 10 | 2 |
3 | TRUE | FALSE | FALSE | 24.6 | 0.030 | -1.20 | 1.3 | 0.035 | 241 | 297 | 0.99232 | 2.22 | 0.50 | 9.80 | 0 | 9 | 2 |
3 | TRUE | FALSE | FALSE | 13.0 | 0.210 | 0.32 | -3.2 | -0.263 | 111 | 141 | 0.95918 | 3.20 | 0.50 | 4.20 | 0 | 8 | 2 |
2 | TRUE | FALSE | FALSE | 17.9 | -0.420 | -0.91 | 7.1 | 0.045 | -177 | 169 | 0.95307 | 3.17 | -1.12 | 13.20 | -1 | 9 | 2 |
1 | TRUE | FALSE | FALSE | 10.0 | 0.200 | 1.27 | 30.9 | 0.050 | 19 | 152 | 0.99400 | 3.21 | 0.42 | 13.80 | -1 | 11 | 2 |
2 | TRUE | FALSE | FALSE | 7.4 | 0.290 | 0.50 | 8.5 | -0.480 | 178 | 647 | 0.97275 | 3.45 | 0.50 | 10.20 | -1 | 8 | 1 |
1 | TRUE | FALSE | FALSE | 11.7 | 1.180 | -0.94 | -62.0 | 0.675 | 7 | -393 | 0.99974 | 3.96 | 0.69 | 5.20 | 1 | 13 | 2 |
3 | TRUE | FALSE | FALSE | 9.7 | 0.410 | -1.00 | 3.6 | -0.235 | 24 | 113 | 0.99772 | 3.44 | 0.53 | 9.80 | 0 | 7 | 2 |
5 | FALSE | FALSE | TRUE | -5.2 | -0.980 | -0.08 | 6.4 | 0.046 | 180 | 166 | 0.99400 | 3.30 | 2.18 | 9.90 | 1 | 5 | 3 |
Reference:
https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/