Overview

In this homework assignment, we will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Our objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. We will only use the variables given to us (or variables that we derive from the variables provided). Below is a short description of the variables of interest in the data set:

VARIABLE NAME DEFINITION THEORETICAL EFFECT

  • INDEX: Identification Variable (do not use)

    • EFFECT: None
  • TARGET Number of Cases Purchased

    • EFFECT: None
  • AcidIndex: Proprietary method of testing total acidity of wine by using a weighted average

  • Alcohol: Alcohol Content

  • Chlorides: Chloride content of wine

  • CitricAcid: Citric Acid Content

  • Density: Density of Wine

  • FixedAcidity: Fixed Acidity of Wine

  • FreeSulfurDioxide: Sulfur Dioxide content of wine

  • LabelAppeal: Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don’t like the design.

    • EFFECT: Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.
  • ResidualSugar: Residual Sugar of wine STARS Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor

    • EFFECT: A high number of stars suggests high sales
  • Sulphates: Sulfate conten of wine

  • TotalSulfurDioxide: Total Sulfur Dioxide of Wine

  • VolatileAcidity: Volatile Acid content of wine

  • pH: pH of wine

library(tidyverse)
library(caret)
library(e1071)
library(pracma)
library(pROC)
library(psych)
library(kableExtra)
library(Hmisc)
library(VIF)
library(FactoMineR)
library(corrplot)
library(purrr)
library(dplyr)
library(MASS)
library(mice)
library(tidyverse)
library(gridExtra)
library(kableExtra)
library(MASS)
library(lindia)
library(DT)
library(corrplot)
library(psych)
library(VIM)
library(mice)
library(car)
library(caret)
library(e1071)
wine_train <- read.csv("https://raw.githubusercontent.com/javernw/DATA621-Business-Analytics-and-Data-Mining/master/wine-training-data.csv")

wine_eval <- read.csv("https://raw.githubusercontent.com/javernw/DATA621-Business-Analytics-and-Data-Mining/master/wine-evaluation-data.csv")

DATA EXPLORATION

Preview

head(wine_train) %>% as_tibble()
str(wine_train)
## 'data.frame':    12795 obs. of  16 variables:
##  $ ï..INDEX          : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...
summary(wine_train)
##     ï..INDEX         TARGET       FixedAcidity     VolatileAcidity  
##  Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
##  1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
##  Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
##  Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
##  3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
##  Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
##                                                                     
##    CitricAcid      ResidualSugar        Chlorides       FreeSulfurDioxide
##  Min.   :-3.2400   Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00  
##  1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00  
##  Median : 0.3100   Median :   3.900   Median : 0.0460   Median :  30.00  
##  Mean   : 0.3084   Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85  
##  3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00  
##  Max.   : 3.8600   Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00  
##                    NA's   :616        NA's   :638       NA's   :647      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
##  1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
##  Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
##  Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
##  3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
##  Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
##  NA's   :682                         NA's   :395     NA's   :1210     
##     Alcohol       LabelAppeal          AcidIndex          STARS      
##  Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##  NA's   :653                                          NA's   :3359

Top Amount of cases purchased

cases_purchased <- table(wine_train$TARGET) %>% data.frame()
cases_purchased %>% ggplot(aes(x = Var1, y = Freq)) + geom_bar(stat = "identity", fill = "blue") + labs(x = "Cases", y = "Counts")

Skewness in Data

w1 = melt(wine_train[,-1])
ggplot(w1, aes(x= value)) + 
    geom_density(fill='blue') + facet_wrap(~variable, scales = 'free') 

A few of the variables have multimodal distribution (TARGET, LabelAppeal, STARS) while the others seem to be normally distrbuted due to bell curve they display.

Marketing Scores

m_scores <- wine_train$LabelAppeal %>% table() %>% data.frame() %>%  mutate(per = (Freq/sum(Freq))*100)
names(m_scores)[1]<-"score"
lbls <- paste(m_scores$score, "\n", round(m_scores$per, 2)) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(m_scores$Freq,labels = lbls, col= c("#990000", "#336600", "#CC6600", "#CCCC00", "#4CC099"), main="Marketing Scores Proportioned")

About 28% of the wine are not favored by customers based on their label designs

Boxplot: Exploring Outliers

ggplot(stack(wine_train[,-1]), aes(x = ind, y = values)) + 
  geom_boxplot() +
   theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) + 
  theme(panel.background = element_rect(fill = 'grey'))
## Warning: Removed 8200 rows containing non-finite values (stat_boxplot).

Correlation

wine_corr <- cor(wine_train[,-1], use = "na.or.complete")
corrplot(wine_corr)

We can see that there is come moderate but postive corrleation among the target variable and predictors STARS and LabelAppeal.

Missing Values

Amelia::missmap(wine_train, col = c("#999900", "#660033"))

4% of the data is missing which we will later handle as we move forward

DATA PREPARATION

Handling Missing Values

wine_train <- wine_train[,-1]
temp <- mice(wine_train[,-1],m=5,maxit=10,meth='pmm',seed=500, printFlag = F)
temp <- complete(temp)
temp$TARGET <- wine_train$TARGET
wine_train <- temp

BUILD MODELS

(at least two for each)

Poisson Models

p_mod1 <- glm(TARGET ~., family="poisson", data=wine_train)
summary(p_mod1)
## 
## Call:
## glm(formula = TARGET ~ ., family = "poisson", data = wine_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8802  -0.4949   0.2216   0.6259   2.6109  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.167e+00  1.952e-01  11.099  < 2e-16 ***
## FixedAcidity       -4.152e-04  8.196e-04  -0.507  0.61240    
## VolatileAcidity    -5.151e-02  6.487e-03  -7.940 2.02e-15 ***
## CitricAcid          1.393e-02  5.894e-03   2.364  0.01810 *  
## ResidualSugar       1.854e-04  1.505e-04   1.232  0.21811    
## Chlorides          -6.561e-02  1.593e-02  -4.119 3.80e-05 ***
## FreeSulfurDioxide   1.360e-04  3.431e-05   3.965 7.35e-05 ***
## TotalSulfurDioxide  9.759e-05  2.196e-05   4.445 8.80e-06 ***
## Density            -4.446e-01  1.919e-01  -2.317  0.02051 *  
## pH                 -2.442e-02  7.520e-03  -3.247  0.00117 ** 
## Sulphates          -1.481e-02  5.460e-03  -2.712  0.00668 ** 
## Alcohol             5.430e-03  1.373e-03   3.954 7.70e-05 ***
## LabelAppeal         1.999e-01  6.118e-03  32.680  < 2e-16 ***
## AcidIndex          -1.235e-01  4.464e-03 -27.672  < 2e-16 ***
## STARS               1.606e-01  5.836e-03  27.523  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18809  on 12780  degrees of freedom
## AIC: 50782
## 
## Number of Fisher Scoring iterations: 5
p_mod2 <- stepAIC(p_mod1, trace = F)
summary(p_mod2)
## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, family = "poisson", 
##     data = wine_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8956  -0.4974   0.2212   0.6256   2.6176  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.168e+00  1.952e-01  11.102  < 2e-16 ***
## VolatileAcidity    -5.161e-02  6.487e-03  -7.956 1.78e-15 ***
## CitricAcid          1.383e-02  5.893e-03   2.346  0.01898 *  
## Chlorides          -6.560e-02  1.593e-02  -4.119 3.80e-05 ***
## FreeSulfurDioxide   1.363e-04  3.431e-05   3.974 7.08e-05 ***
## TotalSulfurDioxide  9.806e-05  2.195e-05   4.467 7.94e-06 ***
## Density            -4.444e-01  1.919e-01  -2.316  0.02057 *  
## pH                 -2.427e-02  7.519e-03  -3.228  0.00125 ** 
## Sulphates          -1.491e-02  5.459e-03  -2.732  0.00630 ** 
## Alcohol             5.397e-03  1.373e-03   3.931 8.46e-05 ***
## LabelAppeal         2.000e-01  6.118e-03  32.689  < 2e-16 ***
## AcidIndex          -1.239e-01  4.411e-03 -28.088  < 2e-16 ***
## STARS               1.607e-01  5.835e-03  27.535  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18811  on 12782  degrees of freedom
## AIC: 50779
## 
## Number of Fisher Scoring iterations: 5

Negative Binomial Models

nb_mod1 <- glm.nb(TARGET ~., data = wine_train)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(nb_mod1)
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train, init.theta = 32958.95846, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8800  -0.4949   0.2215   0.6259   2.6109  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.167e+00  1.952e-01  11.099  < 2e-16 ***
## FixedAcidity       -4.153e-04  8.197e-04  -0.507  0.61240    
## VolatileAcidity    -5.151e-02  6.488e-03  -7.940 2.02e-15 ***
## CitricAcid          1.393e-02  5.894e-03   2.364  0.01810 *  
## ResidualSugar       1.854e-04  1.505e-04   1.232  0.21812    
## Chlorides          -6.561e-02  1.593e-02  -4.119 3.80e-05 ***
## FreeSulfurDioxide   1.360e-04  3.431e-05   3.965 7.35e-05 ***
## TotalSulfurDioxide  9.759e-05  2.196e-05   4.445 8.80e-06 ***
## Density            -4.446e-01  1.919e-01  -2.317  0.02052 *  
## pH                 -2.442e-02  7.521e-03  -3.247  0.00117 ** 
## Sulphates          -1.481e-02  5.460e-03  -2.712  0.00668 ** 
## Alcohol             5.430e-03  1.373e-03   3.953 7.71e-05 ***
## LabelAppeal         1.999e-01  6.118e-03  32.678  < 2e-16 ***
## AcidIndex          -1.235e-01  4.464e-03 -27.671  < 2e-16 ***
## STARS               1.606e-01  5.836e-03  27.522  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(32958.96) family taken to be 1)
## 
##     Null deviance: 22859  on 12794  degrees of freedom
## Residual deviance: 18808  on 12780  degrees of freedom
## AIC: 50784
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  32959 
##           Std. Err.:  59343 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50751.61
nb_mod2 <- stepAIC(nb_mod1, trace = F)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid
## + : alternation limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(nb_mod2)
## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train, 
##     init.theta = 32956.86552, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8954  -0.4974   0.2212   0.6256   2.6176  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.168e+00  1.953e-01  11.101  < 2e-16 ***
## VolatileAcidity    -5.161e-02  6.487e-03  -7.956 1.78e-15 ***
## CitricAcid          1.383e-02  5.894e-03   2.346  0.01899 *  
## Chlorides          -6.561e-02  1.593e-02  -4.119 3.81e-05 ***
## FreeSulfurDioxide   1.363e-04  3.431e-05   3.973 7.08e-05 ***
## TotalSulfurDioxide  9.806e-05  2.195e-05   4.467 7.94e-06 ***
## Density            -4.444e-01  1.919e-01  -2.316  0.02057 *  
## pH                 -2.427e-02  7.519e-03  -3.228  0.00125 ** 
## Sulphates          -1.491e-02  5.459e-03  -2.732  0.00630 ** 
## Alcohol             5.397e-03  1.373e-03   3.931 8.47e-05 ***
## LabelAppeal         2.000e-01  6.118e-03  32.687  < 2e-16 ***
## AcidIndex          -1.239e-01  4.411e-03 -28.087  < 2e-16 ***
## STARS               1.607e-01  5.835e-03  27.534  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(32956.87) family taken to be 1)
## 
##     Null deviance: 22859  on 12794  degrees of freedom
## Residual deviance: 18810  on 12782  degrees of freedom
## AIC: 50781
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  32957 
##           Std. Err.:  59361 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50753.41

Multiple Linear Regression Models

lm_mod1 <- lm(TARGET ~., data = wine_train)
summary(lm_mod1)
## 
## Call:
## lm(formula = TARGET ~ ., data = wine_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9212 -0.7015  0.3945  1.1197  4.3514 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.829e+00  5.628e-01  10.356  < 2e-16 ***
## FixedAcidity       -9.759e-04  2.365e-03  -0.413  0.67985    
## VolatileAcidity    -1.555e-01  1.878e-02  -8.283  < 2e-16 ***
## CitricAcid          3.964e-02  1.709e-02   2.320  0.02038 *  
## ResidualSugar       5.833e-04  4.355e-04   1.339  0.18045    
## Chlorides          -2.042e-01  4.587e-02  -4.452 8.59e-06 ***
## FreeSulfurDioxide   4.137e-04  9.914e-05   4.173 3.03e-05 ***
## TotalSulfurDioxide  2.792e-04  6.324e-05   4.415 1.02e-05 ***
## Density            -1.286e+00  5.544e-01  -2.319  0.02042 *  
## pH                 -6.329e-02  2.166e-02  -2.922  0.00348 ** 
## Sulphates          -4.275e-02  1.576e-02  -2.712  0.00669 ** 
## Alcohol             1.857e-02  3.954e-03   4.697 2.67e-06 ***
## LabelAppeal         6.010e-01  1.756e-02  34.217  < 2e-16 ***
## AcidIndex          -3.256e-01  1.144e-02 -28.463  < 2e-16 ***
## STARS               5.223e-01  1.751e-02  29.826  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.661 on 12780 degrees of freedom
## Multiple R-squared:  0.257,  Adjusted R-squared:  0.2562 
## F-statistic: 315.7 on 14 and 12780 DF,  p-value: < 2.2e-16
lm_mod2 <- stepAIC(lm_mod1, trace = F)
summary(lm_mod2)
## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9462 -0.7008  0.3930  1.1189  4.3585 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.828e+00  5.628e-01  10.356  < 2e-16 ***
## VolatileAcidity    -1.557e-01  1.878e-02  -8.290  < 2e-16 ***
## CitricAcid          3.945e-02  1.709e-02   2.308  0.02099 *  
## Chlorides          -2.046e-01  4.587e-02  -4.460 8.25e-06 ***
## FreeSulfurDioxide   4.151e-04  9.912e-05   4.188 2.84e-05 ***
## TotalSulfurDioxide  2.808e-04  6.323e-05   4.441 9.04e-06 ***
## Density            -1.282e+00  5.544e-01  -2.313  0.02075 *  
## pH                 -6.294e-02  2.166e-02  -2.906  0.00367 ** 
## Sulphates          -4.308e-02  1.575e-02  -2.735  0.00625 ** 
## Alcohol             1.847e-02  3.953e-03   4.672 3.01e-06 ***
## LabelAppeal         6.010e-01  1.756e-02  34.221  < 2e-16 ***
## AcidIndex          -3.265e-01  1.126e-02 -28.999  < 2e-16 ***
## STARS               5.226e-01  1.751e-02  29.844  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.661 on 12782 degrees of freedom
## Multiple R-squared:  0.2569, Adjusted R-squared:  0.2562 
## F-statistic: 368.2 on 12 and 12782 DF,  p-value: < 2.2e-16

Zero Infated Models

Poisson

Negative Binomial

SELECT MODELS