DATA 621 Homework 5, Group : Banu Boopalan, Gregg Maloy, Alexander Moyse, Umais Siddiqui

Running Code

Code
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, error = FALSE)

Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Code
set.seed(200)


wine_train <- read.csv("C:/Users/Banu/Documents/RScriptfiles/DATA621/Homework 5/HOMEWORK 5/wine-training-data.csv",stringsAsFactors = F,header=TRUE)

wine_test <- read.csv("C:/Users/Banu/Documents/RScriptfiles/DATA621/Homework 5/HOMEWORK 5/wine-evaluation-data.csv",stringsAsFactors = F,header=TRUE)



variables <- c("INDEX", "TARGET", " ", " ", "ACID INDEX", "ALCOHOL", "CHLORIDES", "CITRIC ACID", "DENSITY", "FIXED ACIDITY", "FREE SULFUR DIOXIDE", "LABEL APPEAL", "RESIDUAL SUGAR", "STARS", "SULPHATES", "TOTAL SULFUR DIOXIDE", "VOLATILE ACIDITY", "pH")
define <- c("Identification Variable (do not use)", "Number of Cases Purchased", " ", " ", "Proprietary method of testing total acidity of wine by using a weighted average", "Alcohol Content", "Chloride content of wine", "Citric Acid Content", "Density of Wine", "Fixed Acidity of Wine", "Sulfur Dioxide content of wine", "Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don't like design.", "Residual Sugar of wine", "Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor", "Sulfate content of wine", "Total Sulfur Dioxide of Wine", "Volatile Acid content of wine.", "pH of wine")
theoreffect <- c("None", "None", " ", " ", " ", " ", " ", " ", " ", " ", " ", "Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.", " ", "A high number of stars suggests high sales", " ", " ", " ", " ")

kable(cbind(variables, define, theoreffect), col.names = c("VARIABLE NAME", "DEFINITION", "THEORETICAL EFFECT")) %>% 
 kable_paper(full_width = T)
VARIABLE NAME DEFINITION THEORETICAL EFFECT
INDEX Identification Variable (do not use) None
TARGET Number of Cases Purchased None
ACID INDEX Proprietary method of testing total acidity of wine by using a weighted average
ALCOHOL Alcohol Content
CHLORIDES Chloride content of wine
CITRIC ACID Citric Acid Content
DENSITY Density of Wine
FIXED ACIDITY Fixed Acidity of Wine
FREE SULFUR DIOXIDE Sulfur Dioxide content of wine
LABEL APPEAL Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don't like design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.
RESIDUAL SUGAR Residual Sugar of wine
STARS Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor A high number of stars suggests high sales
SULPHATES Sulfate content of wine
TOTAL SULFUR DIOXIDE Total Sulfur Dioxide of Wine
VOLATILE ACIDITY Volatile Acid content of wine.
pH pH of wine
Code
# remove index column as it is not needed
wine_train <- wine_train %>% 
  dplyr::select(-"INDEX")

wine_test <- wine_test %>% 
  dplyr::select(-"IN")


head(wine_train)
  TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
1      3          3.2           1.160      -0.98          54.2    -0.567
2      3          4.5           0.160      -0.81          26.1    -0.425
3      5          7.1           2.640      -0.88          14.8     0.037
4      3          5.7           0.385       0.04          18.8    -0.425
5      4          8.0           0.330      -1.26           9.4        NA
6      0         11.3           0.320       0.59           2.2     0.556
  FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
1                NA                268 0.99280 3.33     -0.59     9.9
2                15               -327 1.02792 3.38      0.70      NA
3               214                142 0.99518 3.12      0.48    22.0
4                22                115 0.99640 2.24      1.83     6.2
5              -167                108 0.99457 3.12      1.77    13.7
6               -37                 15 0.99940 3.20      1.29    15.4
  LabelAppeal AcidIndex STARS
1           0         8     2
2          -1         7     3
3          -1         8     3
4          -1         6     1
5           0         9     2
6           0        11    NA
Code
head(wine_test)
  TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
1     NA          5.4          -0.860       0.27         -10.7     0.092
2     NA         12.4           0.385      -0.76         -19.7     1.169
3     NA          7.2           1.750       0.17         -33.0     0.065
4     NA          6.2           0.100       1.80           1.0    -0.179
5     NA         11.4           0.210       0.28           1.2     0.038
6     NA         17.6           0.040      -1.15           1.4     0.535
  FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
1                23                398 0.98527 5.02      0.64   12.30
2               -37                 68 0.99048 3.37      1.09   16.00
3                 9                 76 1.04641 4.61      0.68    8.55
4               104                 89 0.98877 3.20      2.11   12.30
5                70                 53 1.02899 2.54     -0.07    4.80
6              -250                140 0.95028 3.06     -0.02   11.40
  LabelAppeal AcidIndex STARS
1          -1         6    NA
2           0         6     2
3           0         8     1
4          -1         8     1
5           0        10    NA
6           1         8     4
Code
skim(wine_test)
Data summary
Name wine_test
Number of rows 3335
Number of columns 15
_______________________
Column type frequency:
logical 1
numeric 14
________________________
Group variables None

Variable type: logical

skim_variable n_missing complete_rate mean count
TARGET 3335 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
FixedAcidity 0 1.00 6.86 6.32 -18.20 5.20 6.90 9.00 33.50 ▁▂▇▂▁
VolatileAcidity 0 1.00 0.31 0.81 -2.83 0.08 0.28 0.63 3.61 ▁▂▇▂▁
CitricAcid 0 1.00 0.31 0.87 -3.12 0.00 0.31 0.60 3.76 ▁▂▇▂▁
ResidualSugar 168 0.95 5.32 34.37 -128.30 -2.60 3.60 17.20 145.40 ▁▂▇▂▁
Chlorides 138 0.96 0.06 0.31 -1.15 0.02 0.05 0.17 1.26 ▁▂▇▂▁
FreeSulfurDioxide 152 0.95 34.95 149.63 -563.00 3.00 30.00 79.25 617.00 ▁▂▇▂▁
TotalSulfurDioxide 157 0.95 123.41 225.80 -769.00 27.25 124.00 210.00 1004.00 ▁▂▇▂▁
Density 0 1.00 0.99 0.03 0.89 0.99 0.99 1.00 1.10 ▁▂▇▂▁
pH 104 0.97 3.24 0.68 0.60 2.98 3.21 3.49 6.21 ▁▂▇▂▁
Sulphates 310 0.91 0.53 0.91 -3.07 0.33 0.50 0.82 4.18 ▁▂▇▂▁
Alcohol 185 0.94 10.58 3.76 -4.20 9.00 10.40 12.50 25.60 ▁▂▇▂▁
LabelAppeal 0 1.00 0.01 0.89 -2.00 -1.00 0.00 1.00 2.00 ▁▅▇▅▁
AcidIndex 0 1.00 7.75 1.32 5.00 7.00 8.00 8.00 17.00 ▇▇▁▁▁
STARS 841 0.75 2.04 0.91 1.00 1.00 2.00 3.00 4.00 ▇▇▁▅▂

DATA EXPLORATION

Missing/Summary Review

Code
skim(wine_train)
Data summary
Name wine_train
Number of rows 12795
Number of columns 15
_______________________
Column type frequency:
numeric 15
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
TARGET 0 1.00 3.03 1.93 0.00 2.00 3.00 4.00 8.00 ▆▇▇▆▁
FixedAcidity 0 1.00 7.08 6.32 -18.10 5.20 6.90 9.50 34.40 ▁▂▇▂▁
VolatileAcidity 0 1.00 0.32 0.78 -2.79 0.13 0.28 0.64 3.68 ▁▂▇▂▁
CitricAcid 0 1.00 0.31 0.86 -3.24 0.03 0.31 0.58 3.86 ▁▂▇▂▁
ResidualSugar 616 0.95 5.42 33.75 -127.80 -2.00 3.90 15.90 141.15 ▁▂▇▂▁
Chlorides 638 0.95 0.05 0.32 -1.17 -0.03 0.05 0.15 1.35 ▁▂▇▂▁
FreeSulfurDioxide 647 0.95 30.85 148.71 -555.00 0.00 30.00 70.00 623.00 ▁▂▇▂▁
TotalSulfurDioxide 682 0.95 120.71 231.91 -823.00 27.00 123.00 208.00 1057.00 ▁▂▇▂▁
Density 0 1.00 0.99 0.03 0.89 0.99 0.99 1.00 1.10 ▁▂▇▂▁
pH 395 0.97 3.21 0.68 0.48 2.96 3.20 3.47 6.13 ▁▂▇▂▁
Sulphates 1210 0.91 0.53 0.93 -3.13 0.28 0.50 0.86 4.24 ▁▂▇▂▁
Alcohol 653 0.95 10.49 3.73 -4.70 9.00 10.40 12.40 26.50 ▁▂▇▂▁
LabelAppeal 0 1.00 -0.01 0.89 -2.00 -1.00 0.00 1.00 2.00 ▁▅▇▅▁
AcidIndex 0 1.00 7.77 1.32 4.00 7.00 8.00 8.00 17.00 ▁▇▁▁▁
STARS 3359 0.74 2.04 0.90 1.00 1.00 2.00 3.00 4.00 ▇▇▁▅▂
Code
skim(wine_test)
Data summary
Name wine_test
Number of rows 3335
Number of columns 15
_______________________
Column type frequency:
logical 1
numeric 14
________________________
Group variables None

Variable type: logical

skim_variable n_missing complete_rate mean count
TARGET 3335 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
FixedAcidity 0 1.00 6.86 6.32 -18.20 5.20 6.90 9.00 33.50 ▁▂▇▂▁
VolatileAcidity 0 1.00 0.31 0.81 -2.83 0.08 0.28 0.63 3.61 ▁▂▇▂▁
CitricAcid 0 1.00 0.31 0.87 -3.12 0.00 0.31 0.60 3.76 ▁▂▇▂▁
ResidualSugar 168 0.95 5.32 34.37 -128.30 -2.60 3.60 17.20 145.40 ▁▂▇▂▁
Chlorides 138 0.96 0.06 0.31 -1.15 0.02 0.05 0.17 1.26 ▁▂▇▂▁
FreeSulfurDioxide 152 0.95 34.95 149.63 -563.00 3.00 30.00 79.25 617.00 ▁▂▇▂▁
TotalSulfurDioxide 157 0.95 123.41 225.80 -769.00 27.25 124.00 210.00 1004.00 ▁▂▇▂▁
Density 0 1.00 0.99 0.03 0.89 0.99 0.99 1.00 1.10 ▁▂▇▂▁
pH 104 0.97 3.24 0.68 0.60 2.98 3.21 3.49 6.21 ▁▂▇▂▁
Sulphates 310 0.91 0.53 0.91 -3.07 0.33 0.50 0.82 4.18 ▁▂▇▂▁
Alcohol 185 0.94 10.58 3.76 -4.20 9.00 10.40 12.50 25.60 ▁▂▇▂▁
LabelAppeal 0 1.00 0.01 0.89 -2.00 -1.00 0.00 1.00 2.00 ▁▅▇▅▁
AcidIndex 0 1.00 7.75 1.32 5.00 7.00 8.00 8.00 17.00 ▇▇▁▁▁
STARS 841 0.75 2.04 0.91 1.00 1.00 2.00 3.00 4.00 ▇▇▁▅▂

Visualizations

Code
# histogram
wine_train %>% 
  dplyr::select(-c("AcidIndex", "STARS", "TARGET", "LabelAppeal")) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 3) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="light blue") +
  theme_minimal()

Code
wine_train %>% 
  ggplot(aes(x=TARGET)) + 
  geom_histogram(fill='light blue')

Code
# Prepare data for ggplot
wine_df <- wine_train %>% 
  gather(key = 'variable', value = 'value')
# Boxplots for each variable
ggplot(wine_df, aes(variable, value)) + 
  geom_boxplot() + 
  facet_wrap(. ~variable, scales='free', ncol=6)

Code
wine_factor <- wine_train %>% 
  dplyr::select(TARGET, STARS, LabelAppeal, AcidIndex) %>%
  pivot_longer(cols = -TARGET, names_to="variable", values_to="value") %>%
  arrange(variable, value)

wine_factor %>% 
  ggplot(mapping = aes(x = factor(value), y = TARGET)) +
    geom_boxplot() + 
    facet_wrap(.~variable, scales="free") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90))

Code
#if(!require(devtools)) install.packages("devtools")
#devtools::install_github("kassambara/ggpubr")


library(ggpubr)

# barchart
w3 <- wine_train %>% 
  dplyr::select(TARGET, STARS) %>% 
  mutate(STARS = as.factor(STARS),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(STARS)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()
w4 <- wine_train %>%
  dplyr::select(TARGET, LabelAppeal) %>% 
  mutate(STARS = as.factor(LabelAppeal),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(LabelAppeal)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()
w5 <- wine_train %>% 
  dplyr::select(TARGET, AcidIndex) %>% 
  mutate(STARS = as.factor(AcidIndex),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(AcidIndex)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()
ggarrange(w5, ggarrange(w3, w4, ncol = 2, nrow = 1, legend = "none"), nrow = 2, common.legend = TRUE)

Code
# top correlation
wtrain_corr <- wine_train %>% 
  drop_na() %>% 
  cor()
kable(sort(wtrain_corr[,1], decreasing = T), col.names = c("Correlation")) %>% 
  kable_styling(full_width = F)
Correlation
TARGET 1.0000000
STARS 0.5546857
LabelAppeal 0.4979465
Alcohol 0.0737771
FreeSulfurDioxide 0.0226398
TotalSulfurDioxide 0.0216021
ResidualSugar 0.0035196
CitricAcid 0.0023450
pH 0.0002199
FixedAcidity -0.0125381
Sulphates -0.0212204
Chlorides -0.0304301
Density -0.0475989
VolatileAcidity -0.0759979
AcidIndex -0.1676431
Code
library(VIM)
aggr(wine_train, 
     sortVars=TRUE, 
     labels=names(wine_train), 
     cex.axis=.5, 
     bars = FALSE, 
     col = c("white", "#E46726"),
     combined = TRUE,
     #border = NA,
     ylab = "Missing Values") 


 Variables sorted by number of missings: 
           Variable Count
              STARS  3359
          Sulphates  1210
 TotalSulfurDioxide   682
            Alcohol   653
  FreeSulfurDioxide   647
          Chlorides   638
      ResidualSugar   616
                 pH   395
             TARGET     0
       FixedAcidity     0
    VolatileAcidity     0
         CitricAcid     0
            Density     0
        LabelAppeal     0
          AcidIndex     0
Code
library(corrplot)
library(RColorBrewer)
# correlation plot
corrplot(wtrain_corr, 
         method = "number", 
         type = "lower",
         col = brewer.pal(n = 15, name = "Reds"),
         number.cex = .7, tl.cex = .7,
         tl.col = "black", tl.srt = 45)

Code
corrplot(wtrain_corr, tl.cex = .9, tl.col="black", method = 'color', addCoef.col = "black",
         type="upper", order="hclust", number.cex=0.6, diag=FALSE)

DATA PREPARATION

Code
library(VIM)
# missing value columns
aggr(wine_train, 
     sortVars=TRUE, 
     labels=names(wine_train), 
     cex.axis=.5, 
     bars = FALSE, 
     col = c("white", "light blue"),
     combined = TRUE,
     #border = NA,
     ylab = "Missing Values") 


 Variables sorted by number of missings: 
           Variable Count
              STARS  3359
          Sulphates  1210
 TotalSulfurDioxide   682
            Alcohol   653
  FreeSulfurDioxide   647
          Chlorides   638
      ResidualSugar   616
                 pH   395
             TARGET     0
       FixedAcidity     0
    VolatileAcidity     0
         CitricAcid     0
            Density     0
        LabelAppeal     0
          AcidIndex     0
Code
library(VIM)
# missing value columns
aggr(wine_test, 
     sortVars=TRUE, 
     labels=names(wine_test), 
     cex.axis=.5, 
     bars = FALSE, 
     col = c("white", "light blue"),
     combined = TRUE,
     #border = NA,
     ylab = "Missing Values") 


 Variables sorted by number of missings: 
           Variable Count
             TARGET  3335
              STARS   841
          Sulphates   310
            Alcohol   185
      ResidualSugar   168
 TotalSulfurDioxide   157
  FreeSulfurDioxide   152
          Chlorides   138
                 pH   104
       FixedAcidity     0
    VolatileAcidity     0
         CitricAcid     0
            Density     0
        LabelAppeal     0
          AcidIndex     0
Code
library(mice)
# imputing train data 

wimputetrain <- mice(wine_train)

 iter imp variable
  1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
Code
meth <- wimputetrain$method
predM <- wimputetrain$predictorMatrix
predM[, c("TARGET")] <- 0 
final_impute <- mice(wine_train, method = 'rf', predictorMatrix=predM)

 iter imp variable
  1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
Code
final_impute1 <- complete(final_impute)
print(paste0("Missing value after imputation: ", sum(is.na(final_impute1))))
[1] "Missing value after imputation: 0"
Code
aggr(final_impute1, 
     sortVars=TRUE, 
     labels=names(wine_train), 
     cex.axis=.5, 
     bars = FALSE, 
     col = c("white", "light blue"),
     combined = TRUE,
     #border = NA,
     ylab = "Missing Values") 


 Variables sorted by number of missings: 
           Variable Count
             TARGET     0
       FixedAcidity     0
    VolatileAcidity     0
         CitricAcid     0
      ResidualSugar     0
          Chlorides     0
  FreeSulfurDioxide     0
 TotalSulfurDioxide     0
            Density     0
                 pH     0
          Sulphates     0
            Alcohol     0
        LabelAppeal     0
          AcidIndex     0
              STARS     0
Code
# histogram
final_impute1 %>% 
  dplyr::select(-c("AcidIndex", "STARS", "TARGET", "LabelAppeal")) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 3) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="light blue") +
  theme_minimal()

Code
# wine_train$ResidualSugar[is.na(wine_train$ResidualSugar)] <- median(wine_train$ResidualSugar, na.rm=TRUE)
# wine_train$Chlorides[is.na(wine_train$Chlorides)] <- median(wine_train$Chlorides, na.rm=TRUE)
# wine_train$FreeSulfurDioxide[is.na(wine_train$FreeSulfurDioxide)] <- median(wine_train$FreeSulfurDioxide, na.rm=TRUE)
# wine_train$TotalSulfurDioxide[is.na(wine_train$TotalSulfurDioxide)] <- median(wine_train$TotalSulfurDioxide, na.rm=TRUE)
# wine_train$pH[is.na(wine_train$pH)] <- median(wine_train$pH, na.rm=TRUE)
# wine_train$Sulphates[is.na(wine_train$Sulphates)] <- median(wine_train$Sulphates, na.rm=TRUE)
# wine_train$Alcohol[is.na(wine_train$Alcohol)] <- median(wine_train$Alcohol, na.rm=TRUE)
# wine_train$STARS[is.na(wine_train$STARS)] <- median(wine_train$STARS, na.rm=TRUE)

BUILD MODELS

  • Using the training data set, build at least two different poisson regression models, at least two different negative binomial regression models, and at least two multiple linear regression models, using different variables (or the same variables with different transformations).

  • Sometimes poisson and negative binomial regression models give the same results. If that is the case, comment on that. Consider changing the input variables if that occurs so that you get different models.

  • Although not covered in class, you may also want to consider building zero-inflated poisson and negative binomial regression models. You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques.

  • Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

  • Discuss the coefficients in the models, do they make sense? In this case, about the only thing you can comment on is the number of stars and the wine label appeal.

  • However, you might comment on the coefficient and magnitude of variables and how they are similar or different from model to model. For example, you might say “pH seems to have a major positive impact in my poisson regression model, but a negative effect in my multiple linear regression model”.

  • Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

Code
r2glm <- function(model) {

  summaryLog <- summary(model)
  1 - summaryLog$deviance / summaryLog$null.deviance

}
Code
set.seed(100)
model_metrics <- function(model) {

  data.frame("RSE" = model$sigma,
             "Adj R2" = model$adj.r.squared,
             "F-Statistic" = model$fstatistic[1])
  
  performance <- data.frame("RSE" = model$sigma,
                                    "Adj R2" = model$adj.r.squared,
                                    "F-Statistic" = model$fstatistic[1])
  return(performance)
}
Code
set.seed(100)
#poisson models with all variables
model1 <- glm(TARGET ~ ., family = poisson, wine_train)
summary(model1)

Call:
glm(formula = TARGET ~ ., family = poisson, data = wine_train)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.593e+00  2.506e-01   6.359 2.03e-10 ***
FixedAcidity        3.293e-04  1.053e-03   0.313  0.75447    
VolatileAcidity    -2.560e-02  8.353e-03  -3.065  0.00218 ** 
CitricAcid         -7.259e-04  7.575e-03  -0.096  0.92365    
ResidualSugar      -6.141e-05  1.941e-04  -0.316  0.75165    
Chlorides          -3.007e-02  2.056e-02  -1.463  0.14346    
FreeSulfurDioxide   6.734e-05  4.404e-05   1.529  0.12620    
TotalSulfurDioxide  2.081e-05  2.855e-05   0.729  0.46618    
Density            -3.725e-01  2.462e-01  -1.513  0.13026    
pH                 -4.661e-03  9.598e-03  -0.486  0.62722    
Sulphates          -5.164e-03  7.051e-03  -0.732  0.46398    
Alcohol             3.948e-03  1.771e-03   2.229  0.02579 *  
LabelAppeal         1.771e-01  7.954e-03  22.271  < 2e-16 ***
AcidIndex          -4.870e-02  5.903e-03  -8.251  < 2e-16 ***
STARS               1.871e-01  7.487e-03  24.993  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5844.1  on 6435  degrees of freedom
Residual deviance: 4009.1  on 6421  degrees of freedom
  (6359 observations deleted due to missingness)
AIC: 23172

Number of Fisher Scoring iterations: 5
Code
r2glm(model1)
[1] 0.3139796
Code
anova(model1, test = "Chisq")
Analysis of Deviance Table

Model: poisson, link: log

Response: TARGET

Terms added sequentially (first to last)

                   Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                6435     5844.1              
FixedAcidity        1     0.66      6434     5843.4  0.416923    
VolatileAcidity     1    24.07      6433     5819.3 9.305e-07 ***
CitricAcid          1     0.00      6432     5819.3  0.961864    
ResidualSugar       1     0.05      6431     5819.3  0.825953    
Chlorides           1     3.62      6430     5815.7  0.057171 .  
FreeSulfurDioxide   1     1.90      6429     5813.8  0.167654    
TotalSulfurDioxide  1     1.85      6428     5811.9  0.173597    
Density             1     8.95      6427     5803.0  0.002776 ** 
pH                  1     0.00      6426     5803.0  0.992117    
Sulphates           1     1.97      6425     5801.0  0.160678    
Alcohol             1    22.97      6424     5778.0 1.641e-06 ***
LabelAppeal         1  1031.50      6423     4746.5 < 2.2e-16 ***
AcidIndex           1   118.65      6422     4627.9 < 2.2e-16 ***
STARS               1   618.72      6421     4009.1 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
plot(model1)

Code
#install.packages("faraway")
library("faraway")
Code
set.seed(100)
print('Goodness of Fit Test:')
[1] "Goodness of Fit Test:"
Code
with(model1, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))
     res.deviance   df p
[1,]     4009.142 6421 1
Code
model_metrics(model1)
data frame with 0 columns and 0 rows
Code
set.seed(100)
# poisson model with the imputed values
model2 <- glm(TARGET ~ ., family = poisson, final_impute1)
summary(model2)

Call:
glm(formula = TARGET ~ ., family = poisson, data = final_impute1)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         2.153e+00  1.955e-01  11.012  < 2e-16 ***
FixedAcidity       -5.151e-04  8.197e-04  -0.628 0.529739    
VolatileAcidity    -5.110e-02  6.488e-03  -7.876 3.37e-15 ***
CitricAcid          1.366e-02  5.896e-03   2.318 0.020467 *  
ResidualSugar       1.757e-04  1.507e-04   1.166 0.243545    
Chlorides          -6.050e-02  1.610e-02  -3.759 0.000171 ***
FreeSulfurDioxide   1.395e-04  3.419e-05   4.080 4.51e-05 ***
TotalSulfurDioxide  1.058e-04  2.202e-05   4.806 1.54e-06 ***
Density            -4.637e-01  1.920e-01  -2.415 0.015744 *  
pH                 -2.507e-02  7.519e-03  -3.335 0.000854 ***
Sulphates          -1.266e-02  5.470e-03  -2.315 0.020593 *  
Alcohol             5.042e-03  1.371e-03   3.677 0.000236 ***
LabelAppeal         1.956e-01  6.108e-03  32.018  < 2e-16 ***
AcidIndex          -1.227e-01  4.465e-03 -27.488  < 2e-16 ***
STARS               1.764e-01  5.809e-03  30.378  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 22861  on 12794  degrees of freedom
Residual deviance: 18649  on 12780  degrees of freedom
AIC: 50621

Number of Fisher Scoring iterations: 5
Code
plot(model2)

Code
halfnorm(residuals(model2))

Code
plot(log(fitted(model2)),log((final_impute1$TARGET-fitted(model2))^2), 
     xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2))
abline(0,1)

Code
(dp<- sum(residuals(model2,type="pearson")^2)/model2$df.res)
[1] 0.9634397
Code
summary(model2,dispersion=dp)

Call:
glm(formula = TARGET ~ ., family = poisson, data = final_impute1)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         2.153e+00  1.919e-01  11.219  < 2e-16 ***
FixedAcidity       -5.151e-04  8.046e-04  -0.640 0.522032    
VolatileAcidity    -5.110e-02  6.368e-03  -8.024 1.02e-15 ***
CitricAcid          1.366e-02  5.787e-03   2.361 0.018214 *  
ResidualSugar       1.757e-04  1.479e-04   1.188 0.234797    
Chlorides          -6.050e-02  1.580e-02  -3.829 0.000129 ***
FreeSulfurDioxide   1.395e-04  3.356e-05   4.156 3.23e-05 ***
TotalSulfurDioxide  1.058e-04  2.161e-05   4.896 9.77e-07 ***
Density            -4.637e-01  1.885e-01  -2.460 0.013886 *  
pH                 -2.507e-02  7.380e-03  -3.397 0.000680 ***
Sulphates          -1.266e-02  5.369e-03  -2.359 0.018330 *  
Alcohol             5.042e-03  1.346e-03   3.746 0.000180 ***
LabelAppeal         1.956e-01  5.995e-03  32.620  < 2e-16 ***
AcidIndex          -1.227e-01  4.382e-03 -28.005  < 2e-16 ***
STARS               1.764e-01  5.701e-03  30.949  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 0.9634397)

    Null deviance: 22861  on 12794  degrees of freedom
Residual deviance: 18649  on 12780  degrees of freedom
AIC: 50621

Number of Fisher Scoring iterations: 5
Code
set.seed(100)
print('Goodness of Fit Test:')
[1] "Goodness of Fit Test:"
Code
with(model2, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))
     res.deviance    df             p
[1,]     18648.82 12780 2.330118e-228
Code
model_metrics(model2)
data frame with 0 columns and 0 rows
Code
r2glm(model2)
[1] 0.1842479
Code
anova(model2, test = "Chisq")
Analysis of Deviance Table

Model: poisson, link: log

Response: TARGET

Terms added sequentially (first to last)

                   Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                               12794      22861              
FixedAcidity        1    37.64     12793      22823 8.527e-10 ***
VolatileAcidity     1   121.94     12792      22701 < 2.2e-16 ***
CitricAcid          1     0.96     12791      22700 0.3275400    
ResidualSugar       1     3.94     12790      22696 0.0470345 *  
Chlorides           1    20.12     12789      22676 7.255e-06 ***
FreeSulfurDioxide   1    23.83     12788      22653 1.053e-06 ***
TotalSulfurDioxide  1    33.27     12787      22619 8.018e-09 ***
Density             1    18.07     12786      22601 2.124e-05 ***
pH                  1     1.47     12785      22600 0.2253360    
Sulphates           1    13.36     12784      22586 0.0002573 ***
Alcohol             1    55.95     12783      22530 7.442e-14 ***
LabelAppeal         1  1978.17     12782      20552 < 2.2e-16 ***
AcidIndex           1   991.12     12781      19561 < 2.2e-16 ***
STARS               1   912.23     12780      18649 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
set.seed(100)
model3 <- glm(TARGET ~ ., family = quasipoisson(link='log'), final_impute1)
summary(model3)

Call:
glm(formula = TARGET ~ ., family = quasipoisson(link = "log"), 
    data = final_impute1)

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.153e+00  1.919e-01  11.219  < 2e-16 ***
FixedAcidity       -5.151e-04  8.046e-04  -0.640 0.522044    
VolatileAcidity    -5.110e-02  6.368e-03  -8.024 1.11e-15 ***
CitricAcid          1.366e-02  5.787e-03   2.361 0.018229 *  
ResidualSugar       1.757e-04  1.479e-04   1.188 0.234820    
Chlorides          -6.050e-02  1.580e-02  -3.829 0.000129 ***
FreeSulfurDioxide   1.395e-04  3.356e-05   4.156 3.25e-05 ***
TotalSulfurDioxide  1.058e-04  2.161e-05   4.896 9.89e-07 ***
Density            -4.637e-01  1.885e-01  -2.460 0.013900 *  
pH                 -2.507e-02  7.380e-03  -3.397 0.000682 ***
Sulphates          -1.266e-02  5.369e-03  -2.359 0.018345 *  
Alcohol             5.042e-03  1.346e-03   3.746 0.000180 ***
LabelAppeal         1.956e-01  5.995e-03  32.620  < 2e-16 ***
AcidIndex          -1.227e-01  4.382e-03 -28.005  < 2e-16 ***
STARS               1.764e-01  5.701e-03  30.949  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 0.9634428)

    Null deviance: 22861  on 12794  degrees of freedom
Residual deviance: 18649  on 12780  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5
Code
plot(model3)

Code
drop1(model3,test="F")
Single term deletions

Model:
TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
    Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
    pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
                   Df Deviance  F value    Pr(>F)    
<none>                   18649                       
FixedAcidity        1    18649   0.2706 0.6029343    
VolatileAcidity     1    18711  42.4910 7.368e-11 ***
CitricAcid          1    18654   3.6814 0.0550449 .  
ResidualSugar       1    18650   0.9321 0.3343401    
Chlorides           1    18663   9.6817 0.0018652 ** 
FreeSulfurDioxide   1    18666  11.4049 0.0007347 ***
TotalSulfurDioxide  1    18672  15.8225 6.995e-05 ***
Density             1    18655   3.9953 0.0456492 *  
pH                  1    18660   7.6213 0.0057765 ** 
Sulphates           1    18654   3.6737 0.0553011 .  
Alcohol             1    18662   9.2670 0.0023379 ** 
LabelAppeal         1    19674 702.6412 < 2.2e-16 ***
AcidIndex           1    19457 553.6452 < 2.2e-16 ***
STARS               1    19561 625.1480 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
model_metrics(model3)
data frame with 0 columns and 0 rows
Code
r2glm(model3)
[1] 0.1842479
Code
anova(model3, test = "Chisq")
Analysis of Deviance Table

Model: quasipoisson, link: log

Response: TARGET

Terms added sequentially (first to last)

                   Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                               12794      22861              
FixedAcidity        1    37.64     12793      22823 4.102e-10 ***
VolatileAcidity     1   121.94     12792      22701 < 2.2e-16 ***
CitricAcid          1     0.96     12791      22700 0.3185300    
ResidualSugar       1     3.94     12790      22696 0.0430405 *  
Chlorides           1    20.12     12789      22676 4.868e-06 ***
FreeSulfurDioxide   1    23.83     12788      22653 6.588e-07 ***
TotalSulfurDioxide  1    33.27     12787      22619 4.190e-09 ***
Density             1    18.07     12786      22601 1.482e-05 ***
pH                  1     1.47     12785      22600 0.2167376    
Sulphates           1    13.36     12784      22586 0.0001964 ***
Alcohol             1    55.95     12783      22530 2.529e-14 ***
LabelAppeal         1  1978.17     12782      20552 < 2.2e-16 ***
AcidIndex           1   991.12     12781      19561 < 2.2e-16 ***
STARS               1   912.23     12780      18649 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
set.seed(100)
print('Goodness of Fit Test:')
[1] "Goodness of Fit Test:"
Code
with(model3, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))
     res.deviance    df             p
[1,]     18648.82 12780 2.330118e-228
Code
set.seed(100)
model4 <- glm(TARGET ~ ., family = negative.binomial(1), final_impute1)
summary(model4)

Call:
glm(formula = TARGET ~ ., family = negative.binomial(1), data = final_impute1)

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.267e+00  2.066e-01  10.976  < 2e-16 ***
FixedAcidity       -4.834e-04  8.664e-04  -0.558 0.576935    
VolatileAcidity    -5.736e-02  6.880e-03  -8.336  < 2e-16 ***
CitricAcid          1.257e-02  6.255e-03   2.010 0.044504 *  
ResidualSugar       2.333e-04  1.596e-04   1.462 0.143803    
Chlorides          -6.870e-02  1.700e-02  -4.041 5.35e-05 ***
FreeSulfurDioxide   1.555e-04  3.620e-05   4.297 1.75e-05 ***
TotalSulfurDioxide  1.203e-04  2.323e-05   5.179 2.26e-07 ***
Density            -4.729e-01  2.031e-01  -2.328 0.019924 *  
pH                 -3.062e-02  7.942e-03  -3.855 0.000116 ***
Sulphates          -1.634e-02  5.784e-03  -2.826 0.004726 ** 
Alcohol             4.712e-03  1.450e-03   3.250 0.001158 ** 
LabelAppeal         1.991e-01  6.440e-03  30.919  < 2e-16 ***
AcidIndex          -1.373e-01  4.398e-03 -31.223  < 2e-16 ***
STARS               1.904e-01  6.321e-03  30.120  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1) family taken to be 0.2722089)

    Null deviance: 9042.5  on 12794  degrees of freedom
Residual deviance: 7931.5  on 12780  degrees of freedom
AIC: 56693

Number of Fisher Scoring iterations: 5
Code
print('Goodness of Fit Test:')
[1] "Goodness of Fit Test:"
Code
with(model4, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))
     res.deviance    df p
[1,]     7931.463 12780 1
Code
model_metrics(model4)
data frame with 0 columns and 0 rows
Code
model4

Call:  glm(formula = TARGET ~ ., family = negative.binomial(1), data = final_impute1)

Coefficients:
       (Intercept)        FixedAcidity     VolatileAcidity          CitricAcid  
         2.2670936          -0.0004834          -0.0573578           0.0125690  
     ResidualSugar           Chlorides   FreeSulfurDioxide  TotalSulfurDioxide  
         0.0002333          -0.0686994           0.0001555           0.0001203  
           Density                  pH           Sulphates             Alcohol  
        -0.4729458          -0.0306159          -0.0163440           0.0047118  
       LabelAppeal           AcidIndex               STARS  
         0.1991308          -0.1373164           0.1904056  

Degrees of Freedom: 12794 Total (i.e. Null);  12780 Residual
Null Deviance:      9043 
Residual Deviance: 7931     AIC: 56690
Code
#khat = 35593 and standard error = 59638

model4nb <- glm.nb(TARGET ~ .,final_impute1)
summary(model4nb)

Call:
glm.nb(formula = TARGET ~ ., data = final_impute1, init.theta = 34870.30494, 
    link = log)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         2.153e+00  1.955e-01  11.011  < 2e-16 ***
FixedAcidity       -5.151e-04  8.197e-04  -0.628 0.529762    
VolatileAcidity    -5.110e-02  6.488e-03  -7.876 3.38e-15 ***
CitricAcid          1.366e-02  5.896e-03   2.318 0.020475 *  
ResidualSugar       1.757e-04  1.507e-04   1.166 0.243545    
Chlorides          -6.050e-02  1.610e-02  -3.758 0.000171 ***
FreeSulfurDioxide   1.395e-04  3.419e-05   4.080 4.51e-05 ***
TotalSulfurDioxide  1.058e-04  2.202e-05   4.806 1.54e-06 ***
Density            -4.637e-01  1.920e-01  -2.415 0.015748 *  
pH                 -2.507e-02  7.519e-03  -3.335 0.000854 ***
Sulphates          -1.266e-02  5.470e-03  -2.315 0.020595 *  
Alcohol             5.042e-03  1.371e-03   3.677 0.000236 ***
LabelAppeal         1.956e-01  6.108e-03  32.017  < 2e-16 ***
AcidIndex          -1.227e-01  4.465e-03 -27.487  < 2e-16 ***
STARS               1.765e-01  5.809e-03  30.376  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(34870.3) family taken to be 1)

    Null deviance: 22860  on 12794  degrees of freedom
Residual deviance: 18648  on 12780  degrees of freedom
AIC: 50623

Number of Fisher Scoring iterations: 1

              Theta:  34870 
          Std. Err.:  59563 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -50590.95 
Code
halfnorm(residuals(model4nb))

Code
plot(log(fitted(model4nb)),log((final_impute1$TARGET-fitted(model4nb))^2), 
     xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2))
abline(0,1)

Code
(dp<- sum(residuals(model4nb,type="pearson")^2)/model4nb$df.res)
[1] 0.963362
Code
summary(model4nb,dispersion=dp)

Call:
glm.nb(formula = TARGET ~ ., data = final_impute1, init.theta = 34870.30494, 
    link = log)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         2.153e+00  1.919e-01  11.219  < 2e-16 ***
FixedAcidity       -5.151e-04  8.046e-04  -0.640 0.522038    
VolatileAcidity    -5.110e-02  6.368e-03  -8.024 1.02e-15 ***
CitricAcid          1.366e-02  5.787e-03   2.361 0.018216 *  
ResidualSugar       1.757e-04  1.479e-04   1.188 0.234778    
Chlorides          -6.050e-02  1.580e-02  -3.829 0.000129 ***
FreeSulfurDioxide   1.395e-04  3.356e-05   4.156 3.23e-05 ***
TotalSulfurDioxide  1.058e-04  2.161e-05   4.896 9.77e-07 ***
Density            -4.637e-01  1.885e-01  -2.460 0.013887 *  
pH                 -2.507e-02  7.380e-03  -3.397 0.000680 ***
Sulphates          -1.266e-02  5.369e-03  -2.359 0.018327 *  
Alcohol             5.042e-03  1.346e-03   3.746 0.000180 ***
LabelAppeal         1.956e-01  5.995e-03  32.620  < 2e-16 ***
AcidIndex          -1.227e-01  4.382e-03 -28.005  < 2e-16 ***
STARS               1.765e-01  5.701e-03  30.949  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(34870.3) family taken to be 0.963362)

    Null deviance: 22860  on 12794  degrees of freedom
Residual deviance: 18648  on 12780  degrees of freedom
AIC: 50623

Number of Fisher Scoring iterations: 1

              Theta:  34870 
          Std. Err.:  59563 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -50590.95 

Negative Binomial reduced (remove insignificant variables first)

Code
set.seed(100)
model5 <- glm(TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - Density , family = negative.binomial(1), final_impute1)
summary(model5)

Call:
glm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
    Density, family = negative.binomial(1), data = final_impute1)

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.800e+00  4.913e-02  36.628  < 2e-16 ***
VolatileAcidity    -5.805e-02  6.876e-03  -8.443  < 2e-16 ***
Chlorides          -7.013e-02  1.699e-02  -4.129 3.67e-05 ***
FreeSulfurDioxide   1.559e-04  3.617e-05   4.311 1.64e-05 ***
TotalSulfurDioxide  1.204e-04  2.321e-05   5.189 2.14e-07 ***
pH                 -3.037e-02  7.938e-03  -3.826 0.000131 ***
Sulphates          -1.646e-02  5.779e-03  -2.849 0.004390 ** 
Alcohol             4.750e-03  1.449e-03   3.279 0.001044 ** 
LabelAppeal         1.994e-01  6.437e-03  30.975  < 2e-16 ***
AcidIndex          -1.376e-01  4.323e-03 -31.823  < 2e-16 ***
STARS               1.906e-01  6.318e-03  30.164  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1) family taken to be 0.2719912)

    Null deviance: 9042.5  on 12794  degrees of freedom
Residual deviance: 7934.7  on 12784  degrees of freedom
AIC: 56688

Number of Fisher Scoring iterations: 5
Code
print('Goodness of Fit Test:')
[1] "Goodness of Fit Test:"
Code
with(model5, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))
     res.deviance    df p
[1,]     7934.724 12784 1
Code
model_metrics(model5)
data frame with 0 columns and 0 rows
Code
halfnorm(residuals(model5))

Code
plot(log(fitted(model5)),log((final_impute1$TARGET-fitted(model5))^2), 
     xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2))
abline(0,1)

Code
(dp<- sum(residuals(model5,type="pearson")^2)/model5$df.res)
[1] 0.2719901
Code
summary(model5,dispersion=dp)

Call:
glm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
    Density, family = negative.binomial(1), data = final_impute1)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.800e+00  4.913e-02  36.628  < 2e-16 ***
VolatileAcidity    -5.805e-02  6.876e-03  -8.443  < 2e-16 ***
Chlorides          -7.013e-02  1.699e-02  -4.129 3.65e-05 ***
FreeSulfurDioxide   1.559e-04  3.617e-05   4.311 1.63e-05 ***
TotalSulfurDioxide  1.204e-04  2.321e-05   5.189 2.11e-07 ***
pH                 -3.037e-02  7.938e-03  -3.826  0.00013 ***
Sulphates          -1.646e-02  5.779e-03  -2.849  0.00438 ** 
Alcohol             4.750e-03  1.449e-03   3.279  0.00104 ** 
LabelAppeal         1.994e-01  6.437e-03  30.975  < 2e-16 ***
AcidIndex          -1.376e-01  4.323e-03 -31.823  < 2e-16 ***
STARS               1.906e-01  6.318e-03  30.164  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1) family taken to be 0.2719901)

    Null deviance: 9042.5  on 12794  degrees of freedom
Residual deviance: 7934.7  on 12784  degrees of freedom
AIC: 56688

Number of Fisher Scoring iterations: 5

Negative Binomial reduced (remove additional significant variables)

Code
set.seed(100)
model6 <- glm(TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - Density - VolatileAcidity - Chlorides - FreeSulfurDioxide - TotalSulfurDioxide , family = negative.binomial(1), final_impute1)
summary(model6)

Call:
glm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
    Density - VolatileAcidity - Chlorides - FreeSulfurDioxide - 
    TotalSulfurDioxide, family = negative.binomial(1), data = final_impute1)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.818737   0.048649  37.385  < 2e-16 ***
pH          -0.030350   0.007909  -3.837 0.000125 ***
Sulphates   -0.015844   0.005760  -2.751 0.005952 ** 
Alcohol      0.004461   0.001443   3.091 0.002001 ** 
LabelAppeal  0.199783   0.006413  31.152  < 2e-16 ***
AcidIndex   -0.139989   0.004294 -32.597  < 2e-16 ***
STARS        0.191107   0.006291  30.376  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1) family taken to be 0.2704147)

    Null deviance: 9042.5  on 12794  degrees of freedom
Residual deviance: 7971.7  on 12788  degrees of freedom
AIC: 56717

Number of Fisher Scoring iterations: 5
Code
print('Goodness of Fit Test:')
[1] "Goodness of Fit Test:"
Code
with(model6, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))
     res.deviance    df p
[1,]     7971.736 12788 1
Code
model_metrics(model6)
data frame with 0 columns and 0 rows
Code
plot(model6)

Code
set.seed(100)

model7 <- lm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
as.factor(STARS),data=final_impute1)
summary(model7)

Call:
lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
    ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
    Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
    as.factor(AcidIndex) + as.factor(STARS), data = final_impute1)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6149 -0.6851  0.3624  1.0981  4.8820 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               1.716e+00  1.094e+00   1.569  0.11666    
FixedAcidity             -6.841e-04  2.334e-03  -0.293  0.76943    
VolatileAcidity          -1.493e-01  1.854e-02  -8.053 8.78e-16 ***
CitricAcid                3.785e-02  1.686e-02   2.244  0.02484 *  
ResidualSugar             4.296e-04  4.307e-04   0.997  0.31863    
Chlorides                -1.839e-01  4.576e-02  -4.019 5.89e-05 ***
FreeSulfurDioxide         3.895e-04  9.757e-05   3.991 6.61e-05 ***
TotalSulfurDioxide        2.792e-04  6.255e-05   4.465 8.09e-06 ***
Density                  -1.402e+00  5.472e-01  -2.562  0.01041 *  
pH                       -6.141e-02  2.139e-02  -2.872  0.00409 ** 
Sulphates                -3.158e-02  1.557e-02  -2.028  0.04258 *  
Alcohol                   1.976e-02  3.909e-03   5.056 4.34e-07 ***
as.factor(LabelAppeal)-1  5.641e-01  7.894e-02   7.146 9.40e-13 ***
as.factor(LabelAppeal)0   1.160e+00  7.711e-02  15.043  < 2e-16 ***
as.factor(LabelAppeal)1   1.730e+00  8.077e-02  21.419  < 2e-16 ***
as.factor(LabelAppeal)2   2.396e+00  1.062e-01  22.565  < 2e-16 ***
as.factor(AcidIndex)5     1.199e+00  9.665e-01   1.241  0.21476    
as.factor(AcidIndex)6     1.293e+00  9.491e-01   1.362  0.17311    
as.factor(AcidIndex)7     1.180e+00  9.483e-01   1.245  0.21328    
as.factor(AcidIndex)8     1.002e+00  9.485e-01   1.056  0.29078    
as.factor(AcidIndex)9     5.423e-01  9.492e-01   0.571  0.56783    
as.factor(AcidIndex)10   -9.396e-02  9.509e-01  -0.099  0.92129    
as.factor(AcidIndex)11   -7.273e-01  9.539e-01  -0.762  0.44583    
as.factor(AcidIndex)12   -9.939e-01  9.595e-01  -1.036  0.30028    
as.factor(AcidIndex)13   -3.783e-01  9.688e-01  -0.390  0.69619    
as.factor(AcidIndex)14   -4.739e-01  9.785e-01  -0.484  0.62820    
as.factor(AcidIndex)15    2.691e-01  1.112e+00   0.242  0.80875    
as.factor(AcidIndex)16   -1.093e+00  1.199e+00  -0.911  0.36232    
as.factor(AcidIndex)17   -1.538e+00  1.134e+00  -1.356  0.17509    
as.factor(STARS)2         7.323e-01  3.488e-02  20.996  < 2e-16 ***
as.factor(STARS)3         1.182e+00  4.137e-02  28.560  < 2e-16 ***
as.factor(STARS)4         1.657e+00  6.656e-02  24.897  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.638 on 12763 degrees of freedom
Multiple R-squared:  0.2784,    Adjusted R-squared:  0.2766 
F-statistic: 158.8 on 31 and 12763 DF,  p-value: < 2.2e-16
Code
sqrt(sum(model7$residuals^2) / model7$df.residual)
[1] 1.63838
Code
set.seed(100)
lm8 <- stepAIC(model7, direction = "both",
               scope = list(upper = model7, lower = ~ 1),
               scale = 0, trace = FALSE)
summary(lm8)

Call:
lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
    FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
    Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) + 
    as.factor(STARS), data = final_impute1)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5993 -0.6856  0.3608  1.0993  4.8883 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               1.746e+00  1.093e+00   1.597  0.11020    
VolatileAcidity          -1.495e-01  1.854e-02  -8.060 8.28e-16 ***
CitricAcid                3.776e-02  1.686e-02   2.239  0.02517 *  
Chlorides                -1.841e-01  4.576e-02  -4.022 5.79e-05 ***
FreeSulfurDioxide         3.908e-04  9.755e-05   4.006 6.20e-05 ***
TotalSulfurDioxide        2.809e-04  6.252e-05   4.492 7.10e-06 ***
Density                  -1.400e+00  5.471e-01  -2.559  0.01051 *  
pH                       -6.119e-02  2.138e-02  -2.862  0.00422 ** 
Sulphates                -3.176e-02  1.557e-02  -2.040  0.04134 *  
Alcohol                   1.969e-02  3.908e-03   5.039 4.74e-07 ***
as.factor(LabelAppeal)-1  5.647e-01  7.893e-02   7.154 8.87e-13 ***
as.factor(LabelAppeal)0   1.160e+00  7.710e-02  15.043  < 2e-16 ***
as.factor(LabelAppeal)1   1.731e+00  8.076e-02  21.428  < 2e-16 ***
as.factor(LabelAppeal)2   2.397e+00  1.062e-01  22.576  < 2e-16 ***
as.factor(AcidIndex)5     1.166e+00  9.659e-01   1.207  0.22736    
as.factor(AcidIndex)6     1.258e+00  9.484e-01   1.327  0.18464    
as.factor(AcidIndex)7     1.145e+00  9.476e-01   1.209  0.22685    
as.factor(AcidIndex)8     9.674e-01  9.477e-01   1.021  0.30738    
as.factor(AcidIndex)9     5.060e-01  9.484e-01   0.534  0.59367    
as.factor(AcidIndex)10   -1.314e-01  9.500e-01  -0.138  0.89001    
as.factor(AcidIndex)11   -7.653e-01  9.530e-01  -0.803  0.42193    
as.factor(AcidIndex)12   -1.033e+00  9.585e-01  -1.078  0.28107    
as.factor(AcidIndex)13   -4.182e-01  9.677e-01  -0.432  0.66565    
as.factor(AcidIndex)14   -5.129e-01  9.774e-01  -0.525  0.59980    
as.factor(AcidIndex)15    2.338e-01  1.111e+00   0.210  0.83330    
as.factor(AcidIndex)16   -1.122e+00  1.198e+00  -0.936  0.34916    
as.factor(AcidIndex)17   -1.578e+00  1.133e+00  -1.393  0.16358    
as.factor(STARS)2         7.328e-01  3.487e-02  21.015  < 2e-16 ***
as.factor(STARS)3         1.182e+00  4.136e-02  28.578  < 2e-16 ***
as.factor(STARS)4         1.657e+00  6.655e-02  24.902  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.638 on 12765 degrees of freedom
Multiple R-squared:  0.2783,    Adjusted R-squared:  0.2767 
F-statistic: 169.8 on 29 and 12765 DF,  p-value: < 2.2e-16
Code
plot(lm8)

Code
set.seed(100)
# model9 <- zeroinfl(TARGET ~. | STARS, data=final_impute1)
# summary(model9)

model9 <- zeroinfl(TARGET ~., final_impute1, dist = 'poisson')
summary(model9)

Call:
zeroinfl(formula = TARGET ~ ., data = final_impute1, dist = "poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.2428 -0.3653  0.1667  0.5012  4.3875 

Count model coefficients (poisson with log link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.449e+00  2.065e-01   7.017 2.26e-12 ***
FixedAcidity        2.377e-04  8.575e-04   0.277  0.78165    
VolatileAcidity    -1.340e-02  6.872e-03  -1.950  0.05120 .  
CitricAcid         -2.172e-05  6.143e-03  -0.004  0.99718    
ResidualSugar      -6.728e-05  1.581e-04  -0.426  0.67037    
Chlorides          -1.801e-02  1.693e-02  -1.064  0.28743    
FreeSulfurDioxide   3.035e-05  3.496e-05   0.868  0.38530    
TotalSulfurDioxide -2.622e-05  2.228e-05  -1.177  0.23917    
Density            -3.341e-01  2.026e-01  -1.649  0.09917 .  
pH                  6.368e-03  7.930e-03   0.803  0.42195    
Sulphates           1.363e-03  5.772e-03   0.236  0.81328    
Alcohol             7.214e-03  1.427e-03   5.055 4.30e-07 ***
LabelAppeal         2.466e-01  6.428e-03  38.369  < 2e-16 ***
AcidIndex          -1.589e-02  4.983e-03  -3.188  0.00143 ** 
STARS               9.578e-02  6.294e-03  15.217  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -5.9321816  1.0390266  -5.709 1.13e-08 ***
FixedAcidity        0.0026326  0.0043097   0.611 0.541298    
VolatileAcidity     0.2346703  0.0347026   6.762 1.36e-11 ***
CitricAcid         -0.0803708  0.0314027  -2.559 0.010486 *  
ResidualSugar      -0.0013743  0.0008004  -1.717 0.085974 .  
Chlorides           0.2554050  0.0850914   3.002 0.002686 ** 
FreeSulfurDioxide  -0.0006658  0.0001796  -3.708 0.000209 ***
TotalSulfurDioxide -0.0008123  0.0001156  -7.026 2.12e-12 ***
Density             0.8564562  1.0224154   0.838 0.402211    
pH                  0.1938946  0.0400659   4.839 1.30e-06 ***
Sulphates           0.0911598  0.0290604   3.137 0.001707 ** 
Alcohol             0.0109721  0.0072459   1.514 0.129960    
LabelAppeal         0.3333973  0.0331303  10.063  < 2e-16 ***
AcidIndex           0.4764164  0.0197806  24.085  < 2e-16 ***
STARS              -0.5334391  0.0347708 -15.342  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 37 
Log-likelihood: -2.254e+04 on 30 Df
Code
set.seed(100)
library(randomForest)
model10 <- randomForest(TARGET ~., final_impute1)

model10

Call:
 randomForest(formula = TARGET ~ ., data = final_impute1) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 4

          Mean of squared residuals: 2.554585
                    % Var explained: 31.15
Code
set.seed(100)
hurd.mod2 <- hurdle(TARGET ~ AcidIndex + Alcohol + LabelAppeal + STARS |
                        VolatileAcidity + FreeSulfurDioxide +
                        TotalSulfurDioxide + pH + Sulphates + Alcohol +
                        LabelAppeal + AcidIndex + STARS,
                    data = final_impute1)
hurd.summ <- summary(hurd.mod2)

hurd.summ

Call:
hurdle(formula = TARGET ~ AcidIndex + Alcohol + LabelAppeal + STARS | 
    VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + pH + Sulphates + 
        Alcohol + LabelAppeal + AcidIndex + STARS, data = final_impute1)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.2746 -0.3759  0.1674  0.5034  3.6135 

Count model coefficients (truncated poisson with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.138239   0.043803  25.986  < 2e-16 ***
AcidIndex   -0.017645   0.004924  -3.584 0.000339 ***
Alcohol      0.007507   0.001439   5.217 1.81e-07 ***
LabelAppeal  0.249872   0.006512  38.371  < 2e-16 ***
STARS        0.097081   0.006264  15.499  < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         4.370e+00  2.030e-01  21.525  < 2e-16 ***
VolatileAcidity    -2.145e-01  2.938e-02  -7.299 2.90e-13 ***
FreeSulfurDioxide   6.149e-04  1.540e-04   3.993 6.52e-05 ***
TotalSulfurDioxide  6.756e-04  9.873e-05   6.843 7.76e-12 ***
pH                 -1.645e-01  3.381e-02  -4.866 1.14e-06 ***
Sulphates          -7.999e-02  2.453e-02  -3.261  0.00111 ** 
Alcohol            -6.325e-03  6.163e-03  -1.026  0.30476    
LabelAppeal        -1.570e-01  2.710e-02  -5.793 6.91e-09 ***
AcidIndex          -4.270e-01  1.688e-02 -25.305  < 2e-16 ***
STARS               5.063e-01  2.931e-02  17.272  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 11 
Log-likelihood: -2.256e+04 on 15 Df
Code
plot(fitted(hurd.mod2), fitted(model9),xlab="Hurdle pred",ylab="Zero infl poisson")
abline(0,1)

Code
set.seed(100)
model11 <- zeroinfl(TARGET ~ AcidIndex+ Alcohol + LabelAppeal + STARS | VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide  + pH + Sulphates +    Alcohol + LabelAppeal + AcidIndex + STARS, data = final_impute1)

summary(model11)

Call:
zeroinfl(formula = TARGET ~ AcidIndex + Alcohol + LabelAppeal + STARS | 
    VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + pH + Sulphates + 
        Alcohol + LabelAppeal + AcidIndex + STARS, data = final_impute1)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.2910 -0.3650  0.1668  0.4996  4.1425 

Count model coefficients (poisson with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.138204   0.043655  26.073  < 2e-16 ***
AcidIndex   -0.016738   0.004915  -3.406  0.00066 ***
Alcohol      0.007265   0.001426   5.096 3.47e-07 ***
LabelAppeal  0.247012   0.006423  38.460  < 2e-16 ***
STARS        0.095769   0.006290  15.226  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -5.0456291  0.2381556 -21.186  < 2e-16 ***
VolatileAcidity     0.2468154  0.0343463   7.186 6.67e-13 ***
FreeSulfurDioxide  -0.0007116  0.0001780  -3.997 6.42e-05 ***
TotalSulfurDioxide -0.0007974  0.0001146  -6.955 3.52e-12 ***
pH                  0.1876248  0.0396506   4.732 2.22e-06 ***
Sulphates           0.0923097  0.0287160   3.215  0.00131 ** 
Alcohol             0.0104456  0.0072259   1.446  0.14829    
LabelAppeal         0.3323864  0.0330600  10.054  < 2e-16 ***
AcidIndex           0.4755215  0.0193439  24.583  < 2e-16 ***
STARS              -0.5340782  0.0346984 -15.392  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 22 
Log-likelihood: -2.256e+04 on 15 Df
Code
(lrt <- 2* (model9$loglik-model11$loglik))
[1] 31.94336
Code
1-pchisq(31.82583,15)
[1] 0.006798843
Code
exp(coef(model11))
      count_(Intercept)         count_AcidIndex           count_Alcohol 
            3.121157914             0.983401152             1.007291473 
      count_LabelAppeal             count_STARS        zero_(Intercept) 
            1.280194333             1.100504780             0.006437409 
   zero_VolatileAcidity  zero_FreeSulfurDioxide zero_TotalSulfurDioxide 
            1.279942812             0.999288666             0.999202910 
                zero_pH          zero_Sulphates            zero_Alcohol 
            1.206380833             1.096704402             1.010500378 
       zero_LabelAppeal          zero_AcidIndex              zero_STARS 
            1.394291527             1.608852970             0.586209398 

SELECT MODELS

For the count regression model, will you use a metric such as AIC, average squared error, etc.?

Be sure to explain how you can make inferences from the model, and discuss other relevant model output.

If you like the multiple linear regression model the best, please say why.

However, you must select a count regression model for model deployment.

Using the training data set, evaluate the performance of the count regression model. Make predictions using the evaluation data set.

Code
set.seed(100)

predtrain <- data.frame(TARGET=final_impute1$TARGET,
                         model2=model2$fitted.values,
                         model3=model3$fitted.values,
                         model4=model4$fitted.values,
                         model5=model5$fitted.values,
                         model6=model6$fitted.values,
                         model7=model7$fitted.values,
                         model8=lm8$fitted.values,
                         model9=model9$fitted.values,
                        #model10=model10$fitted.values)
                        model11=model11$fitted.values,
                        model12=hurd.mod2$fitted.values)
predtrain <- round(predtrain, 0)
colnames(predtrain) <- c("TARGET","Poisson-Imputed" ,"Quasipoisson", "negative.binom", "negative.binom remove insig", "negative.binom reduce insig/sig v","linear", "linear step AIC", "Zero Inflated","zeroinfl with reduced", "hurdle model")

predtrain %>%
  gather() %>%
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 5) +
  geom_bar(fill="pink") +
  theme_minimal() + labs(x="Cases Bought", y = "Count", title = "Prediction Histogram")

Code
hist(hurd.mod2$fitted.values)

Code
set.seed(100)

aic2 <- model2$aic
aic3 <- model3$aic
aic4 <- model4$aic
aic5 <- model5$aic
aic6 <- model6$aic
aic7 <- model7$aic
aic8 <- lm8$aic


bic2 <- model2$bic
bic3 <- model3$bic
bic4 <- model4$bic
bic5 <- model5$bic
bic6 <- model6$bic
bic7 <- model7$bic
bic8 <- lm8$bic

#aic9 <- model9@aic 
#mse1 <- mean((final_impute1$TARGET - predict(model1))^2)
mse2 <- mean((final_impute1$TARGET - predict(model2))^2)
mse3 <- mean((final_impute1$TARGET - predict(model3))^2)
mse4 <- mean((final_impute1$TARGET - predict(model4))^2)
mse5 <- mean((final_impute1$TARGET - predict(model5))^2)
mse6 <- mean((final_impute1$TARGET - predict(model6))^2)
mse7 <- mean((final_impute1$TARGET - predict(model7))^2)
mse8 <- mean((final_impute1$TARGET - predict(lm8))^2)
#mse9 <- mean((final_impute1$TARGET - predict(model9))^2)
MSE <- c(mse2, mse3, mse4, mse5, mse6, mse7, mse8)
AIC <- c(aic2, aic3, aic4, aic5, aic6, aic7, aic8)
compare_df <- cbind(MSE,AIC)
rownames(compare_df) <- c("M2Poisson-Imputed" ,"M3Quasipoisson", "M4negative.binom", "M5negative.binom remove insig", "M6negative.binom reduce insig/sig v","M7linear", "M8linear step AIC")
DT::datatable(compare_df)
Code
final_impute1$PoissonImputed = predict(model2,final_impute1,type = "response")
final_impute1$Quasipoisson= predict(model3,final_impute1,type = "response")
final_impute1$NB_M1= predict(model4,final_impute1,type = "response")
final_impute1$NB_RI= predict(model5,final_impute1,type = "response")
final_impute1$NB_SIS= predict(model6,final_impute1,type = "response")
final_impute1$Linear_M1= predict(model7,final_impute1)
final_impute1$Linear_STEP= predict(lm8,final_impute1)
final_impute1$ZeroInf_M1= predict(model9,final_impute1,type = "response")
final_impute1$RF= predict(model10,final_impute1,type = "response")
final_impute1$ZeroInf_M2= predict(model11,final_impute1,type = "response")
final_impute1$Hurdle= predict(hurd.mod2,final_impute1,type = "response")


Final_test_df = data.frame(
  Models =c("M2Poisson-Imputed" ,"M3Quasipoisson", "M4negative.binom", "M5negative.binom remove insig", "M6negative.binom reduce insig/sig v","M7linear", "M8linear step AIC", "Zero Inflated","Random Forest","zeroinfl with reduced", "hurdle model"),
RMSE=c( sqrt(mean((final_impute1$TARGET - final_impute1$PoissonImputed)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$Quasipoisson)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$NB_M1)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$NB_RI)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$NB_SIS)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$Linear_M1)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$Linear_STEP)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$ZeroInf_M1)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$RF)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$ZeroInf_M2)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$Hurdle)^2))))


( sqrt(mean((final_impute1$TARGET - final_impute1$PoissonImputed)^2)))
[1] 1.659053
Code
Final_test_df
                                Models      RMSE
1                    M2Poisson-Imputed 1.6590528
2                       M3Quasipoisson 1.6590528
3                     M4negative.binom 1.6659795
4        M5negative.binom remove insig 1.6668970
5  M6negative.binom reduce insig/sig v 1.6726298
6                             M7linear 1.6363296
7                    M8linear step AIC 1.6363996
8                        Zero Inflated 1.6399992
9                        Random Forest 0.6974797
10               zeroinfl with reduced 1.6429352
11                        hurdle model 1.6452958
Code
set.seed(100)

model2_fitted.values <- factor(round(model2$fitted.values),levels=rev(0:9))
lm8_fitted.values <- factor(round(lm8$fitted.values),levels=rev(0:9))
model9_fitted.values <- factor(round(model9$fitted.values),levels=rev(0:9))
model11_fitted.values <- factor(round(model11$fitted.values),levels=rev(0:9))
model12_fitted.values <- factor(round(model9$fitted.values),levels=rev(0:9))


m2_cfm <- confusionMatrix(model2_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))
m8_cfm <- confusionMatrix(lm8_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))
m9_cfm <- confusionMatrix(model9_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))
m11_cfm <- confusionMatrix(model11_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))
m12_cfm <- confusionMatrix(model12_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))

models_sum <- data.frame(m2_cfm$overall, m8_cfm$overall, m9_cfm$overall,m11_cfm$overall,m12_cfm$overall)
colnames(models_sum) <- c("Poisson (Imputed)" ,"Linear (stepaic)", "Zero Inflated","zeroinfl with reduced", "hurdle model")
round(models_sum, 2)
               Poisson (Imputed) Linear (stepaic) Zero Inflated
Accuracy                    0.25             0.27          0.27
Kappa                       0.10             0.12          0.12
AccuracyLower               0.24             0.27          0.26
AccuracyUpper               0.25             0.28          0.28
AccuracyNull                0.25             0.25          0.25
AccuracyPValue              0.62             0.00          0.00
McnemarPValue                NaN              NaN           NaN
               zeroinfl with reduced hurdle model
Accuracy                        0.27         0.27
Kappa                           0.12         0.12
AccuracyLower                   0.26         0.26
AccuracyUpper                   0.27         0.28
AccuracyNull                    0.25         0.25
AccuracyPValue                  0.00         0.00
McnemarPValue                    NaN          NaN

Make Predictions on the Test Evaluation Data set

Code
imputetest <- mice(wine_test)

 iter imp variable
  1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
Code
meth <- imputetest$method
predM <- imputetest$predictorMatrix
predM[, c("TARGET")] <- 0
data_test_impute <- mice(wine_test, method = 'rf', predictorMatrix=predM)

 iter imp variable
  1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
Code
final_testimpute <- complete(data_test_impute)
print(paste0("Missing value after imputation: ", sum(is.na(final_testimpute))))
[1] "Missing value after imputation: 3335"
Code
test_predict <- predict(model9, newdata=final_testimpute)
test_predict <- round(test_predict,0)
data_pred <- data.frame(TARGET=test_predict)

ggplot(data_pred, aes(x=TARGET)) + geom_bar(fill="light pink") + theme_minimal() +
  labs(y="Count", title = "Prediction: Zero Inflated Model (Model9)") + 
    scale_x_discrete(name = "Cases Bought", limits=c("1","2","3","4", "5", "6", "7", "8"))

Code
ggplot(wine_train, aes(x=TARGET)) + geom_bar(fill="light blue") + theme_minimal() +
  labs(y="Count", title = "Actual") + 
    scale_x_discrete(name = "Training dataset Cases Bought", limits=c("1","2","3","4", "5", "6", "7", "8"))

Code
write.csv(test_predict, "WinePredictionsFinal.csv")