Data 621 Homework5

Introduction

In this assignment, we are tasked with predicting the number of cases of wine sold to wine distribution companies following a sampling. The target variable, cases of wine sold, is count data and therefore will be modeled using appropriate techniques such as Poisson and Negative Binomial regressions.

data_train <- read.csv("https://raw.githubusercontent.com/salma71/Data_621/master/HW_5/wine-training-data.csv", header = TRUE)
data_test <- read.csv("https://raw.githubusercontent.com/salma71/Data_621/master/HW_5/wine-evaluation-data.csv", header = TRUE)

Data Exploration

All the variable in this dataset are numeric and continuous except for AcidIndex, STARS and LabelAppeal which are discrete. The target variable TARGET is also discrete. There are a number of missing observations for certain chemical compositiion variables as well as a large number of wines with no STARS rating. The distribution of the continous variables appear well centered.

skim(data_train)

Data summary

Name	data_train
Number of rows	12795
Number of columns	16
_______________________
Column type frequency:
numeric	16
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
INDEX	0	1.00	8069.98	4656.91	1.00	4037.50	8110.00	12106.50	16129.00	▇▇▇▇▇
TARGET	0	1.00	3.03	1.93	0.00	2.00	3.00	4.00	8.00	▆▇▇▆▁
FixedAcidity	0	1.00	7.08	6.32	-18.10	5.20	6.90	9.50	34.40	▁▂▇▂▁
VolatileAcidity	0	1.00	0.32	0.78	-2.79	0.13	0.28	0.64	3.68	▁▂▇▂▁
CitricAcid	0	1.00	0.31	0.86	-3.24	0.03	0.31	0.58	3.86	▁▂▇▂▁
ResidualSugar	616	0.95	5.42	33.75	-127.80	-2.00	3.90	15.90	141.15	▁▂▇▂▁
Chlorides	638	0.95	0.05	0.32	-1.17	-0.03	0.05	0.15	1.35	▁▂▇▂▁
FreeSulfurDioxide	647	0.95	30.85	148.71	-555.00	0.00	30.00	70.00	623.00	▁▂▇▂▁
TotalSulfurDioxide	682	0.95	120.71	231.91	-823.00	27.00	123.00	208.00	1057.00	▁▂▇▂▁
Density	0	1.00	0.99	0.03	0.89	0.99	0.99	1.00	1.10	▁▂▇▂▁
pH	395	0.97	3.21	0.68	0.48	2.96	3.20	3.47	6.13	▁▂▇▂▁
Sulphates	1210	0.91	0.53	0.93	-3.13	0.28	0.50	0.86	4.24	▁▂▇▂▁
Alcohol	653	0.95	10.49	3.73	-4.70	9.00	10.40	12.40	26.50	▁▂▇▂▁
LabelAppeal	0	1.00	-0.01	0.89	-2.00	-1.00	0.00	1.00	2.00	▁▅▇▅▁
AcidIndex	0	1.00	7.77	1.32	4.00	7.00	8.00	8.00	17.00	▁▇▁▁▁
STARS	3359	0.74	2.04	0.90	1.00	1.00	2.00	3.00	4.00	▇▇▁▅▂

skim(data_test)

Data summary

Name	data_test
Number of rows	3335
Number of columns	16
_______________________
Column type frequency:
logical	1
numeric	15
________________________
Group variables	None

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
TARGET	3335	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
IN	0	1.00	8048.31	4655.48	3.00	4018.50	7906.00	12061.00	16130.00	▇▇▇▇▇
FixedAcidity	0	1.00	6.86	6.32	-18.20	5.20	6.90	9.00	33.50	▁▂▇▂▁
VolatileAcidity	0	1.00	0.31	0.81	-2.83	0.08	0.28	0.63	3.61	▁▂▇▂▁
CitricAcid	0	1.00	0.31	0.87	-3.12	0.00	0.31	0.60	3.76	▁▂▇▂▁
ResidualSugar	168	0.95	5.32	34.37	-128.30	-2.60	3.60	17.20	145.40	▁▂▇▂▁
Chlorides	138	0.96	0.06	0.31	-1.15	0.02	0.05	0.17	1.26	▁▂▇▂▁
FreeSulfurDioxide	152	0.95	34.95	149.63	-563.00	3.00	30.00	79.25	617.00	▁▂▇▂▁
TotalSulfurDioxide	157	0.95	123.41	225.80	-769.00	27.25	124.00	210.00	1004.00	▁▂▇▂▁
Density	0	1.00	0.99	0.03	0.89	0.99	0.99	1.00	1.10	▁▂▇▂▁
pH	104	0.97	3.24	0.68	0.60	2.98	3.21	3.49	6.21	▁▂▇▂▁
Sulphates	310	0.91	0.53	0.91	-3.07	0.33	0.50	0.82	4.18	▁▂▇▂▁
Alcohol	185	0.94	10.58	3.76	-4.20	9.00	10.40	12.50	25.60	▁▂▇▂▁
LabelAppeal	0	1.00	0.01	0.89	-2.00	-1.00	0.00	1.00	2.00	▁▅▇▅▁
AcidIndex	0	1.00	7.75	1.32	5.00	7.00	8.00	8.00	17.00	▇▇▁▁▁
STARS	841	0.75	2.04	0.91	1.00	1.00	2.00	3.00	4.00	▇▇▁▅▂

# remove index column as it is not needed
data_train <- data_train %>% 
  dplyr::select(-"INDEX")
data_test <- data_test %>% 
  dplyr::select(-"IN")

Visualization

The visualizations below display the distributions of the dataset. All continuous variables are centered and close to normally distributed. We note that some variables are centered around zero and take negative values which is unexpected and will be investigated further and transformed for our analysis. The distribution of the TARGET variable looks like it could be well described by the poisson distribution which has equal mean and variance but the high number of zero values justifies the use of a zero-inflated model as well. The mean and variance of TARGET are 3.02 and 3.71 respectively, which is close enough to satisfy the equal mean-variance assumption of the poisson distribution.

# histogram
data_train %>% 
  dplyr::select(-c("AcidIndex", "STARS", "TARGET", "LabelAppeal")) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 3) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="pink") +
  theme_minimal()

data_train %>% 
  ggplot(aes(x=TARGET)) + 
  geom_histogram(fill='pink')

In the box plot below, we plot several variables in to two panels. TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar variables have large ranges compared to other variables. Therefore, we separated those variables in to a different panel to view their distribution. From both panel, we can tell a high number of variables have numerous outliers.

library(ggpubr)
# boxplot
p1 <- data_train %>% 
  dplyr::select(-c("TotalSulfurDioxide", "FreeSulfurDioxide", "ResidualSugar")) %>% 
  gather(na.rm = TRUE) %>% 
  ggplot(aes(factor(key), value)) +
  geom_boxplot(outlier.colour = "#e281cf", outlier.shape = 1,  color = "#5aa1ed") +
  coord_flip() +
  labs(title = "Boxplot of Chemical Properties of Wine", x = "Chemical Properties", y = "Values") +
  theme_minimal()
p2 <- data_train %>% 
  dplyr::select(c("TotalSulfurDioxide", "FreeSulfurDioxide", "ResidualSugar")) %>% 
  gather(na.rm = TRUE) %>% 
  ggplot(aes(factor(key), value)) +
  geom_boxplot(outlier.colour = "#e281cf", outlier.shape = 1, color = "#5aa1ed") +
  #labs(title = "Boxplot of Chemical Properties of Wine", x = "Chemical Properties", y = "Values") +
  theme_minimal()
ggarrange(p1, p2)

In the bar char below, AcidIndex tells us large quantity of wine were sold with the index number 7 and 8. LabelAppeal tells us generic labeled wine sells the most; however, better label does yield higher number of wine samples per order. Lastly, STARS tells us excellent quality does not result in high wine orders. It could be due to high star wine bottle’s high price tag.

# barchart
p3 <- data_train %>% 
  dplyr::select(TARGET, STARS) %>% 
  mutate(STARS = as.factor(STARS),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(STARS)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()
p4 <- data_train %>%
  dplyr::select(TARGET, LabelAppeal) %>% 
  mutate(STARS = as.factor(LabelAppeal),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(LabelAppeal)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()
p5 <- data_train %>% 
  dplyr::select(TARGET, AcidIndex) %>% 
  mutate(STARS = as.factor(AcidIndex),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(AcidIndex)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()
ggarrange(p5, ggarrange(p3, p4, ncol = 2, nrow = 1, legend = "none"), nrow = 2, common.legend = TRUE)

Correlations with Response Variable

# top correlation
wine_train_corr <- data_train %>% 
  drop_na() %>% 
  cor()
kable(sort(wine_train_corr[,1], decreasing = T), col.names = c("Correlation")) %>% 
  kable_styling(full_width = F)

	Correlation
TARGET	1.0000000
STARS	0.5546857
LabelAppeal	0.4979465
Alcohol	0.0737771
FreeSulfurDioxide	0.0226398
TotalSulfurDioxide	0.0216021
ResidualSugar	0.0035196
CitricAcid	0.0023450
pH	0.0002199
FixedAcidity	-0.0125381
Sulphates	-0.0212204
Chlorides	-0.0304301
Density	-0.0475989
VolatileAcidity	-0.0759979
AcidIndex	-0.1676431

In the correlation table and plot below, we see, STARS and LabelAppeal are most positively correlated variables with the response variable. We expected this because our variable description mentions these variable’s theoretical affect are higher than other variables. Also, we some mild negative correlation between the response variable and AcidIndex variable.

library(corrplot)
library(RColorBrewer)
# correlation plot
corrplot(wine_train_corr, 
         method = "number", 
         type = "lower",
         col = brewer.pal(n = 15, name = "Reds"),
         number.cex = .7, tl.cex = .7,
         tl.col = "black", tl.srt = 45)

Data Preparation

Using the aggr function from VIM package, We see several variables have missing values. According to UCI Machine Learning, who published this dataset, all wine contain some natural sulfites. Therefore, we will impute the missing values for sulfite chemical properties. Also, it’s rare to find wines with less than 1 gram/liter of sugar. We will impute this as well. Matter of fact, since all missing values are missing at random, we will impute all the missing values using the mice package and random forest method. Mice uses multivariate imputations to estimate the missing values. Using multiple imputations helps in resolving the uncertainty for the missingness. Our target variable will be removed as a predictor variable but still will be imputed. Our response variables will be removed as predictor variables but still will be imputed.

library(VIM)
# missing value columns
aggr(data_train, 
     sortVars=TRUE, 
     labels=names(data_train), 
     cex.axis=.5, 
     bars = FALSE, 
     col = c("white", "#E46726"),
     combined = TRUE,
     #border = NA,
     ylab = "Missing Values")


 Variables sorted by number of missings: 
           Variable Count
              STARS  3359
          Sulphates  1210
 TotalSulfurDioxide   682
            Alcohol   653
  FreeSulfurDioxide   647
          Chlorides   638
      ResidualSugar   616
                 pH   395
             TARGET     0
       FixedAcidity     0
    VolatileAcidity     0
         CitricAcid     0
            Density     0
        LabelAppeal     0
          AcidIndex     0

library(mice)
# imputating train data
init <- mice(data_train)
meth <- init$method
predM <- init$predictorMatrix
predM[, c("TARGET")] <- 0 #this code will remove the variable as a predictor but still will be imputed
data_train_impute <- mice(data_train, method = 'rf', predictorMatrix=predM)
data_train_imputed <- complete(data_train_impute)
print(paste0("Missing value after imputation: ", sum(is.na(data_train_imputed))))

Model Building

Model 1: Poisson (Raw data)

# poisson model with the missing values
model1 <- glm(TARGET ~ ., family = poisson, data_train)
summary(model1)


Call:
glm(formula = TARGET ~ ., family = poisson, data = data_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.2158  -0.2734   0.0616   0.3732   1.6830  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.593e+00  2.506e-01   6.359 2.03e-10 ***
FixedAcidity        3.293e-04  1.053e-03   0.313  0.75447    
VolatileAcidity    -2.560e-02  8.353e-03  -3.065  0.00218 ** 
CitricAcid         -7.259e-04  7.575e-03  -0.096  0.92365    
ResidualSugar      -6.141e-05  1.941e-04  -0.316  0.75165    
Chlorides          -3.007e-02  2.056e-02  -1.463  0.14346    
FreeSulfurDioxide   6.734e-05  4.404e-05   1.529  0.12620    
TotalSulfurDioxide  2.081e-05  2.855e-05   0.729  0.46618    
Density            -3.725e-01  2.462e-01  -1.513  0.13026    
pH                 -4.661e-03  9.598e-03  -0.486  0.62722    
Sulphates          -5.164e-03  7.051e-03  -0.732  0.46398    
Alcohol             3.948e-03  1.771e-03   2.229  0.02579 *  
LabelAppeal         1.771e-01  7.954e-03  22.271  < 2e-16 ***
AcidIndex          -4.870e-02  5.903e-03  -8.251  < 2e-16 ***
STARS               1.871e-01  7.487e-03  24.993  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5844.1  on 6435  degrees of freedom
Residual deviance: 4009.1  on 6421  degrees of freedom
  (6359 observations deleted due to missingness)
AIC: 23172

Number of Fisher Scoring iterations: 5

print('Goodness of Fit Test:')

[1] "Goodness of Fit Test:"

with(model1, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))

     res.deviance   df p
[1,]     4009.142 6421 1

The deviance residuals is quite symmetrical. This means that the predicted points are close to actual observed points. As expected and seen in our correlation table, STARS, LabelAppeal and AcidIndex are significant variables. And the variation in standard error is low. The goodness of fit test has a high p value which indicates that the model fits the data well.

Model 2: Poisson (Imputed Data)

# poisson model with the imputed values
model2 <- glm(TARGET ~ ., family = poisson, data_train_imputed)
summary(model2)


Call:
glm(formula = TARGET ~ ., family = poisson, data = data_train_imputed)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.9897  -0.5121   0.2106   0.6255   2.5442  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         2.075e+00  1.953e-01  10.628  < 2e-16 ***
FixedAcidity       -3.273e-04  8.202e-04  -0.399  0.68988    
VolatileAcidity    -4.780e-02  6.492e-03  -7.362 1.81e-13 ***
CitricAcid          1.280e-02  5.894e-03   2.171  0.02989 *  
ResidualSugar       1.490e-04  1.517e-04   0.982  0.32609    
Chlorides          -6.662e-02  1.611e-02  -4.136 3.54e-05 ***
FreeSulfurDioxide   1.321e-04  3.452e-05   3.826  0.00013 ***
TotalSulfurDioxide  1.055e-04  2.219e-05   4.756 1.97e-06 ***
Density            -4.285e-01  1.917e-01  -2.235  0.02542 *  
pH                 -2.443e-02  7.518e-03  -3.250  0.00116 ** 
Sulphates          -1.610e-02  5.527e-03  -2.914  0.00357 ** 
Alcohol             5.455e-03  1.380e-03   3.953 7.71e-05 ***
LabelAppeal         1.989e-01  6.015e-03  33.071  < 2e-16 ***
AcidIndex          -1.213e-01  4.469e-03 -27.149  < 2e-16 ***
STARS               1.894e-01  5.790e-03  32.713  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 22861  on 12794  degrees of freedom
Residual deviance: 18495  on 12780  degrees of freedom
AIC: 50467

Number of Fisher Scoring iterations: 5

print('Goodness of Fit Test:')

[1] "Goodness of Fit Test:"

with(model2, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE)))

     res.deviance    df             p
[1,]     18495.18 12780 6.080355e-218

Again the deviance residuals are quite the same. However, with imputation, there are more significant variables introduced in the model. Additionally, the AIC score improved dramatically from 23172 to 50384. Also, the deviance residuals decreased significantly to 18,412 from 40,000. However, the goodness of fit test has a very small p value, suggesting that this model does not fit the data well. Since the residual deviance is greater than the degrees of freedom, then some over-dispersion exists.

Given what we observed when looking at the distribution of TARGET, we should expext the inflated count of zeros to affect the model and bias results. For this reason, we move to a zero-inflated model to reflect this.

Model 3: Quasipoisson Model

We try a quasipoisson model to account for nay overdispersion and to see if the results change significantly. As seen in the summary below, the models are nearly identical.

# poisson model with the imputed values
model3 <- glm(TARGET ~ ., family = quasipoisson(link='log'), data_train_imputed)
summary(model3)


Call:
glm(formula = TARGET ~ ., family = quasipoisson(link = "log"), 
    data = data_train_imputed)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.9897  -0.5121   0.2106   0.6255   2.5442  

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.075e+00  1.912e-01  10.854  < 2e-16 ***
FixedAcidity       -3.273e-04  8.031e-04  -0.408 0.683619    
VolatileAcidity    -4.780e-02  6.356e-03  -7.520 5.86e-14 ***
CitricAcid          1.280e-02  5.771e-03   2.218 0.026582 *  
ResidualSugar       1.490e-04  1.485e-04   1.003 0.315892    
Chlorides          -6.662e-02  1.577e-02  -4.224 2.41e-05 ***
FreeSulfurDioxide   1.321e-04  3.380e-05   3.907 9.38e-05 ***
TotalSulfurDioxide  1.055e-04  2.172e-05   4.858 1.20e-06 ***
Density            -4.285e-01  1.877e-01  -2.283 0.022466 *  
pH                 -2.443e-02  7.361e-03  -3.319 0.000906 ***
Sulphates          -1.610e-02  5.411e-03  -2.976 0.002925 ** 
Alcohol             5.455e-03  1.351e-03   4.038 5.43e-05 ***
LabelAppeal         1.989e-01  5.889e-03  33.777  < 2e-16 ***
AcidIndex          -1.213e-01  4.376e-03 -27.729  < 2e-16 ***
STARS               1.894e-01  5.669e-03  33.411  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 0.9586313)

    Null deviance: 22861  on 12794  degrees of freedom
Residual deviance: 18495  on 12780  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

print('Goodness of Fit Test:')

[1] "Goodness of Fit Test:"

with(model3, cbind(res.deviance = deviance, df = df.residual, p = pchisq(deviance, df.residual, lower.tail=FALSE)))

     res.deviance    df             p
[1,]     18495.18 12780 6.080355e-218

Model 4: Zero Inflated

We saw earlier that the dependent variable had an excess number of zeros which skewed the distribution from a typical poisson. The zero inflated model generates coefficients for the zero count part of the model as well as for the count part.

model4 <- zeroinfl(TARGET ~., data_train_imputed, dist = 'poisson')
summary(model4)


Call:
zeroinfl(formula = TARGET ~ ., data = data_train_imputed, dist = "poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.3594 -0.3697  0.1600  0.4991  4.0626 

Count model coefficients (poisson with log link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.462e+00  2.068e-01   7.071 1.54e-12 ***
FixedAcidity        3.460e-04  8.588e-04   0.403  0.68700    
VolatileAcidity    -1.248e-02  6.860e-03  -1.819  0.06897 .  
CitricAcid         -1.818e-04  6.143e-03  -0.030  0.97639    
ResidualSugar      -9.720e-05  1.590e-04  -0.611  0.54110    
Chlorides          -2.046e-02  1.698e-02  -1.205  0.22828    
FreeSulfurDioxide   2.811e-05  3.541e-05   0.794  0.42732    
TotalSulfurDioxide -2.378e-05  2.247e-05  -1.058  0.29001    
Density            -3.329e-01  2.029e-01  -1.641  0.10075    
pH                  5.900e-03  7.923e-03   0.745  0.45647    
Sulphates           1.382e-05  5.855e-03   0.002  0.99812    
Alcohol             7.407e-03  1.436e-03   5.157 2.51e-07 ***
LabelAppeal         2.512e-01  6.379e-03  39.378  < 2e-16 ***
AcidIndex          -1.706e-02  4.992e-03  -3.419  0.00063 ***
STARS               9.259e-02  6.250e-03  14.816  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -5.6784393  1.0502155  -5.407 6.41e-08 ***
FixedAcidity        0.0030266  0.0043405   0.697  0.48561    
VolatileAcidity     0.2296997  0.0349216   6.578 4.78e-11 ***
CitricAcid         -0.0775271  0.0316287  -2.451  0.01424 *  
ResidualSugar      -0.0013992  0.0008099  -1.728  0.08404 .  
Chlorides           0.2792240  0.0861694   3.240  0.00119 ** 
FreeSulfurDioxide  -0.0006491  0.0001830  -3.547  0.00039 ***
TotalSulfurDioxide -0.0008063  0.0001170  -6.889 5.62e-12 ***
Density             0.8137179  1.0328093   0.788  0.43077    
pH                  0.1907648  0.0403594   4.727 2.28e-06 ***
Sulphates           0.1040291  0.0297462   3.497  0.00047 ***
Alcohol             0.0084735  0.0073599   1.151  0.24961    
LabelAppeal         0.3329681  0.0332265  10.021  < 2e-16 ***
AcidIndex           0.4724425  0.0198762  23.769  < 2e-16 ***
STARS              -0.6268896  0.0362024 -17.316  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 35 
Log-likelihood: -2.25e+04 on 30 Df

model4b <- zeroinfl(TARGET ~ . - FixedAcidity - Density, data_train_imputed, dist = 'poisson')
summary(model4b)


Call:
zeroinfl(formula = TARGET ~ . - FixedAcidity - Density, data = data_train_imputed, 
    dist = "poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.3359 -0.3699  0.1592  0.4988  3.7857 

Count model coefficients (poisson with log link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.134e+00  5.211e-02  21.763  < 2e-16 ***
VolatileAcidity    -1.256e-02  6.860e-03  -1.830 0.067228 .  
CitricAcid         -9.605e-05  6.142e-03  -0.016 0.987523    
ResidualSugar      -9.721e-05  1.590e-04  -0.611 0.541086    
Chlorides          -2.121e-02  1.697e-02  -1.249 0.211532    
FreeSulfurDioxide   2.711e-05  3.539e-05   0.766 0.443629    
TotalSulfurDioxide -2.462e-05  2.246e-05  -1.096 0.273068    
pH                  5.949e-03  7.924e-03   0.751 0.452753    
Sulphates           1.413e-04  5.853e-03   0.024 0.980736    
Alcohol             7.416e-03  1.436e-03   5.164 2.41e-07 ***
LabelAppeal         2.511e-01  6.378e-03  39.376  < 2e-16 ***
AcidIndex          -1.718e-02  4.945e-03  -3.474 0.000512 ***
STARS               9.272e-02  6.250e-03  14.835  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -4.8715049  0.2403040 -20.272  < 2e-16 ***
VolatileAcidity     0.2304532  0.0349022   6.603 4.03e-11 ***
CitricAcid         -0.0778846  0.0315992  -2.465 0.013710 *  
ResidualSugar      -0.0014010  0.0008097  -1.730 0.083585 .  
Chlorides           0.2793632  0.0861372   3.243 0.001182 ** 
FreeSulfurDioxide  -0.0006467  0.0001830  -3.534 0.000409 ***
TotalSulfurDioxide -0.0008059  0.0001170  -6.887 5.68e-12 ***
pH                  0.1908997  0.0403516   4.731 2.24e-06 ***
Sulphates           0.1043426  0.0297334   3.509 0.000449 ***
Alcohol             0.0084314  0.0073580   1.146 0.251840    
LabelAppeal         0.3321148  0.0332068  10.001  < 2e-16 ***
AcidIndex           0.4754149  0.0195356  24.336  < 2e-16 ***
STARS              -0.6265672  0.0361863 -17.315  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 31 
Log-likelihood: -2.25e+04 on 26 Df

Model Selection

Based on the models tested, we select the Zero Inflated model due to the highest accuracy of the 4 models

pred_train <- data.frame(TARGET=data_train_imputed$TARGET, 
                         model2=model2$fitted.values, 
                         model3=model3$fitted.values, 
                         model4b=model4b$fitted.values)
pred_train <- round(pred_train, 0)
colnames(pred_train) <- c("TARGET","Poisson (Imputed)" ,"Quasipoisson Model", "Zero Inflated")
pred_train %>%
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 4) +
  geom_bar(fill="pink") +
  theme_minimal() + labs(x="Cases Bought", y = "Count", title = "Prediction Histogram")

model2_fitted.values <- factor(round(model2$fitted.values),levels=rev(0:9))
model3_fitted.values <- factor(round(model3$fitted.values),levels=rev(0:9))
model4b_fitted.values <- factor(round(model4b$fitted.values),levels=rev(0:9))
m2_cfm <- confusionMatrix(model2_fitted.values, factor(data_train_imputed$TARGET,levels=rev(0:9)))
m3_cfm <- confusionMatrix(model3_fitted.values, factor(data_train_imputed$TARGET,levels=rev(0:9)))
m4_cfm <- confusionMatrix(model4b_fitted.values, factor(data_train_imputed$TARGET,levels=rev(0:9)))
models_sum <- data.frame(m2_cfm$overall, m3_cfm$overall, m4_cfm$overall)
colnames(models_sum) <- c("Poisson (Imputed)" ,"Quasipoisson Model", "Zero Inflated")
round(models_sum, 2)

init <- mice(data_test)
meth <- init$method
predM <- init$predictorMatrix
predM[, c("TARGET")] <- 0 #this code will remove the variable as a predictor but still will be imputed
data_test_impute <- mice(data_test, method = 'rf', predictorMatrix=predM)
data_test_imputed <- complete(data_test_impute)
print(paste0("Missing value after imputation: ", sum(is.na(data_test_imputed))))

test_predict <- predict(model4b, newdata=data_test_imputed)
test_predict <- round(test_predict,0)
data_pred <- data.frame(TARGET=test_predict)
ggplot(data_pred, aes(x=TARGET)) + geom_bar(fill="orchid3") + theme_minimal() +
  labs(y="Count", title = "Prediction: Zero Inflated Model (Model4b)") + 
    scale_x_discrete(name = "Cases Bought", limits=c("1","2","3","4", "5", "6", "7", "8"))

write.csv(test_predict, "WinePredictions.csv")

Data 621 - Homework 5

Dhairav Chhatbar, Mael Illien, Salma Elshahawy

11/23/2020

Data 621 Homework5

Introduction

Data Exploration

Data Exploration

Visualization

Correlations with Response Variable

Data Preparation

Model Building

Model 1: Poisson (Raw data)

Model 2: Poisson (Imputed Data)

Model 3: Quasipoisson Model

Model 4: Zero Inflated

Model Selection