DATA 621 Homework 5, Group : Banu Boopalan, Gregg Maloy, Alexander Moyse, Umais Siddiqui

Running Code

Code

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, error = FALSE)

Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Code

set.seed(200)


wine_train <- read.csv("C:/Users/Banu/Documents/RScriptfiles/DATA621/Homework 5/HOMEWORK 5/wine-training-data.csv",stringsAsFactors = F,header=TRUE)

wine_test <- read.csv("C:/Users/Banu/Documents/RScriptfiles/DATA621/Homework 5/HOMEWORK 5/wine-evaluation-data.csv",stringsAsFactors = F,header=TRUE)



variables <- c("INDEX", "TARGET", " ", " ", "ACID INDEX", "ALCOHOL", "CHLORIDES", "CITRIC ACID", "DENSITY", "FIXED ACIDITY", "FREE SULFUR DIOXIDE", "LABEL APPEAL", "RESIDUAL SUGAR", "STARS", "SULPHATES", "TOTAL SULFUR DIOXIDE", "VOLATILE ACIDITY", "pH")
define <- c("Identification Variable (do not use)", "Number of Cases Purchased", " ", " ", "Proprietary method of testing total acidity of wine by using a weighted average", "Alcohol Content", "Chloride content of wine", "Citric Acid Content", "Density of Wine", "Fixed Acidity of Wine", "Sulfur Dioxide content of wine", "Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don't like design.", "Residual Sugar of wine", "Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor", "Sulfate content of wine", "Total Sulfur Dioxide of Wine", "Volatile Acid content of wine.", "pH of wine")
theoreffect <- c("None", "None", " ", " ", " ", " ", " ", " ", " ", " ", " ", "Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.", " ", "A high number of stars suggests high sales", " ", " ", " ", " ")

kable(cbind(variables, define, theoreffect), col.names = c("VARIABLE NAME", "DEFINITION", "THEORETICAL EFFECT")) %>% 
 kable_paper(full_width = T)

VARIABLE NAME	DEFINITION	THEORETICAL EFFECT
INDEX	Identification Variable (do not use)	None
TARGET	Number of Cases Purchased	None


ACID INDEX	Proprietary method of testing total acidity of wine by using a weighted average
ALCOHOL	Alcohol Content
CHLORIDES	Chloride content of wine
CITRIC ACID	Citric Acid Content
DENSITY	Density of Wine
FIXED ACIDITY	Fixed Acidity of Wine
FREE SULFUR DIOXIDE	Sulfur Dioxide content of wine
LABEL APPEAL	Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don't like design.	Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales.
RESIDUAL SUGAR	Residual Sugar of wine
STARS	Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor	A high number of stars suggests high sales
SULPHATES	Sulfate content of wine
TOTAL SULFUR DIOXIDE	Total Sulfur Dioxide of Wine
VOLATILE ACIDITY	Volatile Acid content of wine.
pH	pH of wine

Code

# remove index column as it is not needed
wine_train <- wine_train %>% 
  dplyr::select(-"INDEX")

wine_test <- wine_test %>% 
  dplyr::select(-"IN")


head(wine_train)

  TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
1      3          3.2           1.160      -0.98          54.2    -0.567
2      3          4.5           0.160      -0.81          26.1    -0.425
3      5          7.1           2.640      -0.88          14.8     0.037
4      3          5.7           0.385       0.04          18.8    -0.425
5      4          8.0           0.330      -1.26           9.4        NA
6      0         11.3           0.320       0.59           2.2     0.556
  FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
1                NA                268 0.99280 3.33     -0.59     9.9
2                15               -327 1.02792 3.38      0.70      NA
3               214                142 0.99518 3.12      0.48    22.0
4                22                115 0.99640 2.24      1.83     6.2
5              -167                108 0.99457 3.12      1.77    13.7
6               -37                 15 0.99940 3.20      1.29    15.4
  LabelAppeal AcidIndex STARS
1           0         8     2
2          -1         7     3
3          -1         8     3
4          -1         6     1
5           0         9     2
6           0        11    NA

Code

head(wine_test)

  TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
1     NA          5.4          -0.860       0.27         -10.7     0.092
2     NA         12.4           0.385      -0.76         -19.7     1.169
3     NA          7.2           1.750       0.17         -33.0     0.065
4     NA          6.2           0.100       1.80           1.0    -0.179
5     NA         11.4           0.210       0.28           1.2     0.038
6     NA         17.6           0.040      -1.15           1.4     0.535
  FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
1                23                398 0.98527 5.02      0.64   12.30
2               -37                 68 0.99048 3.37      1.09   16.00
3                 9                 76 1.04641 4.61      0.68    8.55
4               104                 89 0.98877 3.20      2.11   12.30
5                70                 53 1.02899 2.54     -0.07    4.80
6              -250                140 0.95028 3.06     -0.02   11.40
  LabelAppeal AcidIndex STARS
1          -1         6    NA
2           0         6     2
3           0         8     1
4          -1         8     1
5           0        10    NA
6           1         8     4

Code

skim(wine_test)

Data summary
Name	wine_test
Number of rows	3335
Number of columns	15
_______________________
Column type frequency:
logical	1
numeric	14
________________________
Group variables	None

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
TARGET	3335	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
FixedAcidity	0	1.00	6.86	6.32	-18.20	5.20	6.90	9.00	33.50	▁▂▇▂▁
VolatileAcidity	0	1.00	0.31	0.81	-2.83	0.08	0.28	0.63	3.61	▁▂▇▂▁
CitricAcid	0	1.00	0.31	0.87	-3.12	0.00	0.31	0.60	3.76	▁▂▇▂▁
ResidualSugar	168	0.95	5.32	34.37	-128.30	-2.60	3.60	17.20	145.40	▁▂▇▂▁
Chlorides	138	0.96	0.06	0.31	-1.15	0.02	0.05	0.17	1.26	▁▂▇▂▁
FreeSulfurDioxide	152	0.95	34.95	149.63	-563.00	3.00	30.00	79.25	617.00	▁▂▇▂▁
TotalSulfurDioxide	157	0.95	123.41	225.80	-769.00	27.25	124.00	210.00	1004.00	▁▂▇▂▁
Density	0	1.00	0.99	0.03	0.89	0.99	0.99	1.00	1.10	▁▂▇▂▁
pH	104	0.97	3.24	0.68	0.60	2.98	3.21	3.49	6.21	▁▂▇▂▁
Sulphates	310	0.91	0.53	0.91	-3.07	0.33	0.50	0.82	4.18	▁▂▇▂▁
Alcohol	185	0.94	10.58	3.76	-4.20	9.00	10.40	12.50	25.60	▁▂▇▂▁
LabelAppeal	0	1.00	0.01	0.89	-2.00	-1.00	0.00	1.00	2.00	▁▅▇▅▁
AcidIndex	0	1.00	7.75	1.32	5.00	7.00	8.00	8.00	17.00	▇▇▁▁▁
STARS	841	0.75	2.04	0.91	1.00	1.00	2.00	3.00	4.00	▇▇▁▅▂

DATA EXPLORATION

Missing/Summary Review

Code

skim(wine_train)

Data summary
Name	wine_train
Number of rows	12795
Number of columns	15
_______________________
Column type frequency:
numeric	15
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
TARGET	0	1.00	3.03	1.93	0.00	2.00	3.00	4.00	8.00	▆▇▇▆▁
FixedAcidity	0	1.00	7.08	6.32	-18.10	5.20	6.90	9.50	34.40	▁▂▇▂▁
VolatileAcidity	0	1.00	0.32	0.78	-2.79	0.13	0.28	0.64	3.68	▁▂▇▂▁
CitricAcid	0	1.00	0.31	0.86	-3.24	0.03	0.31	0.58	3.86	▁▂▇▂▁
ResidualSugar	616	0.95	5.42	33.75	-127.80	-2.00	3.90	15.90	141.15	▁▂▇▂▁
Chlorides	638	0.95	0.05	0.32	-1.17	-0.03	0.05	0.15	1.35	▁▂▇▂▁
FreeSulfurDioxide	647	0.95	30.85	148.71	-555.00	0.00	30.00	70.00	623.00	▁▂▇▂▁
TotalSulfurDioxide	682	0.95	120.71	231.91	-823.00	27.00	123.00	208.00	1057.00	▁▂▇▂▁
Density	0	1.00	0.99	0.03	0.89	0.99	0.99	1.00	1.10	▁▂▇▂▁
pH	395	0.97	3.21	0.68	0.48	2.96	3.20	3.47	6.13	▁▂▇▂▁
Sulphates	1210	0.91	0.53	0.93	-3.13	0.28	0.50	0.86	4.24	▁▂▇▂▁
Alcohol	653	0.95	10.49	3.73	-4.70	9.00	10.40	12.40	26.50	▁▂▇▂▁
LabelAppeal	0	1.00	-0.01	0.89	-2.00	-1.00	0.00	1.00	2.00	▁▅▇▅▁
AcidIndex	0	1.00	7.77	1.32	4.00	7.00	8.00	8.00	17.00	▁▇▁▁▁
STARS	3359	0.74	2.04	0.90	1.00	1.00	2.00	3.00	4.00	▇▇▁▅▂

Code

skim(wine_test)

Data summary
Name	wine_test
Number of rows	3335
Number of columns	15
_______________________
Column type frequency:
logical	1
numeric	14
________________________
Group variables	None

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
TARGET	3335	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
FixedAcidity	0	1.00	6.86	6.32	-18.20	5.20	6.90	9.00	33.50	▁▂▇▂▁
VolatileAcidity	0	1.00	0.31	0.81	-2.83	0.08	0.28	0.63	3.61	▁▂▇▂▁
CitricAcid	0	1.00	0.31	0.87	-3.12	0.00	0.31	0.60	3.76	▁▂▇▂▁
ResidualSugar	168	0.95	5.32	34.37	-128.30	-2.60	3.60	17.20	145.40	▁▂▇▂▁
Chlorides	138	0.96	0.06	0.31	-1.15	0.02	0.05	0.17	1.26	▁▂▇▂▁
FreeSulfurDioxide	152	0.95	34.95	149.63	-563.00	3.00	30.00	79.25	617.00	▁▂▇▂▁
TotalSulfurDioxide	157	0.95	123.41	225.80	-769.00	27.25	124.00	210.00	1004.00	▁▂▇▂▁
Density	0	1.00	0.99	0.03	0.89	0.99	0.99	1.00	1.10	▁▂▇▂▁
pH	104	0.97	3.24	0.68	0.60	2.98	3.21	3.49	6.21	▁▂▇▂▁
Sulphates	310	0.91	0.53	0.91	-3.07	0.33	0.50	0.82	4.18	▁▂▇▂▁
Alcohol	185	0.94	10.58	3.76	-4.20	9.00	10.40	12.50	25.60	▁▂▇▂▁
LabelAppeal	0	1.00	0.01	0.89	-2.00	-1.00	0.00	1.00	2.00	▁▅▇▅▁
AcidIndex	0	1.00	7.75	1.32	5.00	7.00	8.00	8.00	17.00	▇▇▁▁▁
STARS	841	0.75	2.04	0.91	1.00	1.00	2.00	3.00	4.00	▇▇▁▅▂

Visualizations

Code

# histogram
wine_train %>% 
  dplyr::select(-c("AcidIndex", "STARS", "TARGET", "LabelAppeal")) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 3) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="light blue") +
  theme_minimal()

Code

wine_train %>% 
  ggplot(aes(x=TARGET)) + 
  geom_histogram(fill='light blue')

Code

# Prepare data for ggplot
wine_df <- wine_train %>% 
  gather(key = 'variable', value = 'value')
# Boxplots for each variable
ggplot(wine_df, aes(variable, value)) + 
  geom_boxplot() + 
  facet_wrap(. ~variable, scales='free', ncol=6)

Code

wine_factor <- wine_train %>% 
  dplyr::select(TARGET, STARS, LabelAppeal, AcidIndex) %>%
  pivot_longer(cols = -TARGET, names_to="variable", values_to="value") %>%
  arrange(variable, value)

wine_factor %>% 
  ggplot(mapping = aes(x = factor(value), y = TARGET)) +
    geom_boxplot() + 
    facet_wrap(.~variable, scales="free") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90))

Code

#if(!require(devtools)) install.packages("devtools")
#devtools::install_github("kassambara/ggpubr")


library(ggpubr)

# barchart
w3 <- wine_train %>% 
  dplyr::select(TARGET, STARS) %>% 
  mutate(STARS = as.factor(STARS),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(STARS)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()
w4 <- wine_train %>%
  dplyr::select(TARGET, LabelAppeal) %>% 
  mutate(STARS = as.factor(LabelAppeal),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(LabelAppeal)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()
w5 <- wine_train %>% 
  dplyr::select(TARGET, AcidIndex) %>% 
  mutate(STARS = as.factor(AcidIndex),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(AcidIndex)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()
ggarrange(w5, ggarrange(w3, w4, ncol = 2, nrow = 1, legend = "none"), nrow = 2, common.legend = TRUE)

Code

# top correlation
wtrain_corr <- wine_train %>% 
  drop_na() %>% 
  cor()
kable(sort(wtrain_corr[,1], decreasing = T), col.names = c("Correlation")) %>% 
  kable_styling(full_width = F)

	Correlation
TARGET	1.0000000
STARS	0.5546857
LabelAppeal	0.4979465
Alcohol	0.0737771
FreeSulfurDioxide	0.0226398
TotalSulfurDioxide	0.0216021
ResidualSugar	0.0035196
CitricAcid	0.0023450
pH	0.0002199
FixedAcidity	-0.0125381
Sulphates	-0.0212204
Chlorides	-0.0304301
Density	-0.0475989
VolatileAcidity	-0.0759979
AcidIndex	-0.1676431

Code

library(VIM)
aggr(wine_train, 
     sortVars=TRUE, 
     labels=names(wine_train), 
     cex.axis=.5, 
     bars = FALSE, 
     col = c("white", "#E46726"),
     combined = TRUE,
     #border = NA,
     ylab = "Missing Values")


 Variables sorted by number of missings: 
           Variable Count
              STARS  3359
          Sulphates  1210
 TotalSulfurDioxide   682
            Alcohol   653
  FreeSulfurDioxide   647
          Chlorides   638
      ResidualSugar   616
                 pH   395
             TARGET     0
       FixedAcidity     0
    VolatileAcidity     0
         CitricAcid     0
            Density     0
        LabelAppeal     0
          AcidIndex     0

Code

library(corrplot)
library(RColorBrewer)
# correlation plot
corrplot(wtrain_corr, 
         method = "number", 
         type = "lower",
         col = brewer.pal(n = 15, name = "Reds"),
         number.cex = .7, tl.cex = .7,
         tl.col = "black", tl.srt = 45)

Code

corrplot(wtrain_corr, tl.cex = .9, tl.col="black", method = 'color', addCoef.col = "black",
         type="upper", order="hclust", number.cex=0.6, diag=FALSE)

DATA PREPARATION

Code

library(VIM)
# missing value columns
aggr(wine_train, 
     sortVars=TRUE, 
     labels=names(wine_train), 
     cex.axis=.5, 
     bars = FALSE, 
     col = c("white", "light blue"),
     combined = TRUE,
     #border = NA,
     ylab = "Missing Values")


 Variables sorted by number of missings: 
           Variable Count
              STARS  3359
          Sulphates  1210
 TotalSulfurDioxide   682
            Alcohol   653
  FreeSulfurDioxide   647
          Chlorides   638
      ResidualSugar   616
                 pH   395
             TARGET     0
       FixedAcidity     0
    VolatileAcidity     0
         CitricAcid     0
            Density     0
        LabelAppeal     0
          AcidIndex     0

Code

library(VIM)
# missing value columns
aggr(wine_test, 
     sortVars=TRUE, 
     labels=names(wine_test), 
     cex.axis=.5, 
     bars = FALSE, 
     col = c("white", "light blue"),
     combined = TRUE,
     #border = NA,
     ylab = "Missing Values")


 Variables sorted by number of missings: 
           Variable Count
             TARGET  3335
              STARS   841
          Sulphates   310
            Alcohol   185
      ResidualSugar   168
 TotalSulfurDioxide   157
  FreeSulfurDioxide   152
          Chlorides   138
                 pH   104
       FixedAcidity     0
    VolatileAcidity     0
         CitricAcid     0
            Density     0
        LabelAppeal     0
          AcidIndex     0

Code

library(mice)
# imputing train data 

wimputetrain <- mice(wine_train)


 iter imp variable
  1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

Code

meth <- wimputetrain$method
predM <- wimputetrain$predictorMatrix
predM[, c("TARGET")] <- 0 
final_impute <- mice(wine_train, method = 'rf', predictorMatrix=predM)


 iter imp variable
  1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

Code

final_impute1 <- complete(final_impute)
print(paste0("Missing value after imputation: ", sum(is.na(final_impute1))))

[1] "Missing value after imputation: 0"

Code

aggr(final_impute1, 
     sortVars=TRUE, 
     labels=names(wine_train), 
     cex.axis=.5, 
     bars = FALSE, 
     col = c("white", "light blue"),
     combined = TRUE,
     #border = NA,
     ylab = "Missing Values")


 Variables sorted by number of missings: 
           Variable Count
             TARGET     0
       FixedAcidity     0
    VolatileAcidity     0
         CitricAcid     0
      ResidualSugar     0
          Chlorides     0
  FreeSulfurDioxide     0
 TotalSulfurDioxide     0
            Density     0
                 pH     0
          Sulphates     0
            Alcohol     0
        LabelAppeal     0
          AcidIndex     0
              STARS     0

Code

# histogram
final_impute1 %>% 
  dplyr::select(-c("AcidIndex", "STARS", "TARGET", "LabelAppeal")) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 3) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="light blue") +
  theme_minimal()

Code

# wine_train$ResidualSugar[is.na(wine_train$ResidualSugar)] <- median(wine_train$ResidualSugar, na.rm=TRUE)
# wine_train$Chlorides[is.na(wine_train$Chlorides)] <- median(wine_train$Chlorides, na.rm=TRUE)
# wine_train$FreeSulfurDioxide[is.na(wine_train$FreeSulfurDioxide)] <- median(wine_train$FreeSulfurDioxide, na.rm=TRUE)
# wine_train$TotalSulfurDioxide[is.na(wine_train$TotalSulfurDioxide)] <- median(wine_train$TotalSulfurDioxide, na.rm=TRUE)
# wine_train$pH[is.na(wine_train$pH)] <- median(wine_train$pH, na.rm=TRUE)
# wine_train$Sulphates[is.na(wine_train$Sulphates)] <- median(wine_train$Sulphates, na.rm=TRUE)
# wine_train$Alcohol[is.na(wine_train$Alcohol)] <- median(wine_train$Alcohol, na.rm=TRUE)
# wine_train$STARS[is.na(wine_train$STARS)] <- median(wine_train$STARS, na.rm=TRUE)

BUILD MODELS

Using the training data set, build at least two different poisson regression models, at least two different negative binomial regression models, and at least two multiple linear regression models, using different variables (or the same variables with different transformations).
Sometimes poisson and negative binomial regression models give the same results. If that is the case, comment on that. Consider changing the input variables if that occurs so that you get different models.
Although not covered in class, you may also want to consider building zero-inflated poisson and negative binomial regression models. You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques.
Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
Discuss the coefficients in the models, do they make sense? In this case, about the only thing you can comment on is the number of stars and the wine label appeal.
However, you might comment on the coefficient and magnitude of variables and how they are similar or different from model to model. For example, you might say “pH seems to have a major positive impact in my poisson regression model, but a negative effect in my multiple linear regression model”.
Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

Code

r2glm <- function(model) {

  summaryLog <- summary(model)
  1 - summaryLog$deviance / summaryLog$null.deviance

}

Code

set.seed(100)
model_metrics <- function(model) {

  data.frame("RSE" = model$sigma,
             "Adj R2" = model$adj.r.squared,
             "F-Statistic" = model$fstatistic[1])
  
  performance <- data.frame("RSE" = model$sigma,
                                    "Adj R2" = model$adj.r.squared,
                                    "F-Statistic" = model$fstatistic[1])
  return(performance)
}

Code

set.seed(100)
#poisson models with all variables
model1 <- glm(TARGET ~ ., family = poisson, wine_train)
summary(model1)


Call:
glm(formula = TARGET ~ ., family = poisson, data = wine_train)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.593e+00  2.506e-01   6.359 2.03e-10 ***
FixedAcidity        3.293e-04  1.053e-03   0.313  0.75447    
VolatileAcidity    -2.560e-02  8.353e-03  -3.065  0.00218 ** 
CitricAcid         -7.259e-04  7.575e-03  -0.096  0.92365    
ResidualSugar      -6.141e-05  1.941e-04  -0.316  0.75165    
Chlorides          -3.007e-02  2.056e-02  -1.463  0.14346    
FreeSulfurDioxide   6.734e-05  4.404e-05   1.529  0.12620    
TotalSulfurDioxide  2.081e-05  2.855e-05   0.729  0.46618    
Density            -3.725e-01  2.462e-01  -1.513  0.13026    
pH                 -4.661e-03  9.598e-03  -0.486  0.62722    
Sulphates          -5.164e-03  7.051e-03  -0.732  0.46398    
Alcohol             3.948e-03  1.771e-03   2.229  0.02579 *  
LabelAppeal         1.771e-01  7.954e-03  22.271  < 2e-16 ***
AcidIndex          -4.870e-02  5.903e-03  -8.251  < 2e-16 ***
STARS               1.871e-01  7.487e-03  24.993  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5844.1  on 6435  degrees of freedom
Residual deviance: 4009.1  on 6421  degrees of freedom
  (6359 observations deleted due to missingness)
AIC: 23172

Number of Fisher Scoring iterations: 5

Code

r2glm(model1)

[1] 0.3139796

Code

anova(model1, test = "Chisq")

Analysis of Deviance Table

Model: poisson, link: log

Response: TARGET

Terms added sequentially (first to last)

                   Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                                6435     5844.1              
FixedAcidity        1     0.66      6434     5843.4  0.416923    
VolatileAcidity     1    24.07      6433     5819.3 9.305e-07 ***
CitricAcid          1     0.00      6432     5819.3  0.961864    
ResidualSugar       1     0.05      6431     5819.3  0.825953    
Chlorides           1     3.62      6430     5815.7  0.057171 .  
FreeSulfurDioxide   1     1.90      6429     5813.8  0.167654    
TotalSulfurDioxide  1     1.85      6428     5811.9  0.173597    
Density             1     8.95      6427     5803.0  0.002776 ** 
pH                  1     0.00      6426     5803.0  0.992117    
Sulphates           1     1.97      6425     5801.0  0.160678    
Alcohol             1    22.97      6424     5778.0 1.641e-06 ***
LabelAppeal         1  1031.50      6423     4746.5 < 2.2e-16 ***
AcidIndex           1   118.65      6422     4627.9 < 2.2e-16 ***
STARS               1   618.72      6421     4009.1 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

plot(model1)

Code

#install.packages("faraway")
library("faraway")

Code

set.seed(100)
print('Goodness of Fit Test:')

[1] "Goodness of Fit Test:"

Code

with(model1, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))

     res.deviance   df p
[1,]     4009.142 6421 1

Code

model_metrics(model1)

data frame with 0 columns and 0 rows

Code

set.seed(100)
# poisson model with the imputed values
model2 <- glm(TARGET ~ ., family = poisson, final_impute1)
summary(model2)


Call:
glm(formula = TARGET ~ ., family = poisson, data = final_impute1)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         2.153e+00  1.955e-01  11.012  < 2e-16 ***
FixedAcidity       -5.151e-04  8.197e-04  -0.628 0.529739    
VolatileAcidity    -5.110e-02  6.488e-03  -7.876 3.37e-15 ***
CitricAcid          1.366e-02  5.896e-03   2.318 0.020467 *  
ResidualSugar       1.757e-04  1.507e-04   1.166 0.243545    
Chlorides          -6.050e-02  1.610e-02  -3.759 0.000171 ***
FreeSulfurDioxide   1.395e-04  3.419e-05   4.080 4.51e-05 ***
TotalSulfurDioxide  1.058e-04  2.202e-05   4.806 1.54e-06 ***
Density            -4.637e-01  1.920e-01  -2.415 0.015744 *  
pH                 -2.507e-02  7.519e-03  -3.335 0.000854 ***
Sulphates          -1.266e-02  5.470e-03  -2.315 0.020593 *  
Alcohol             5.042e-03  1.371e-03   3.677 0.000236 ***
LabelAppeal         1.956e-01  6.108e-03  32.018  < 2e-16 ***
AcidIndex          -1.227e-01  4.465e-03 -27.488  < 2e-16 ***
STARS               1.764e-01  5.809e-03  30.378  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 22861  on 12794  degrees of freedom
Residual deviance: 18649  on 12780  degrees of freedom
AIC: 50621

Number of Fisher Scoring iterations: 5

Code

plot(model2)

Code

halfnorm(residuals(model2))

Code

plot(log(fitted(model2)),log((final_impute1$TARGET-fitted(model2))^2), 
     xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2))
abline(0,1)

Code

(dp<- sum(residuals(model2,type="pearson")^2)/model2$df.res)

[1] 0.9634397

Code

summary(model2,dispersion=dp)


Call:
glm(formula = TARGET ~ ., family = poisson, data = final_impute1)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         2.153e+00  1.919e-01  11.219  < 2e-16 ***
FixedAcidity       -5.151e-04  8.046e-04  -0.640 0.522032    
VolatileAcidity    -5.110e-02  6.368e-03  -8.024 1.02e-15 ***
CitricAcid          1.366e-02  5.787e-03   2.361 0.018214 *  
ResidualSugar       1.757e-04  1.479e-04   1.188 0.234797    
Chlorides          -6.050e-02  1.580e-02  -3.829 0.000129 ***
FreeSulfurDioxide   1.395e-04  3.356e-05   4.156 3.23e-05 ***
TotalSulfurDioxide  1.058e-04  2.161e-05   4.896 9.77e-07 ***
Density            -4.637e-01  1.885e-01  -2.460 0.013886 *  
pH                 -2.507e-02  7.380e-03  -3.397 0.000680 ***
Sulphates          -1.266e-02  5.369e-03  -2.359 0.018330 *  
Alcohol             5.042e-03  1.346e-03   3.746 0.000180 ***
LabelAppeal         1.956e-01  5.995e-03  32.620  < 2e-16 ***
AcidIndex          -1.227e-01  4.382e-03 -28.005  < 2e-16 ***
STARS               1.764e-01  5.701e-03  30.949  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 0.9634397)

    Null deviance: 22861  on 12794  degrees of freedom
Residual deviance: 18649  on 12780  degrees of freedom
AIC: 50621

Number of Fisher Scoring iterations: 5

Code

set.seed(100)
print('Goodness of Fit Test:')

[1] "Goodness of Fit Test:"

Code

with(model2, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))

     res.deviance    df             p
[1,]     18648.82 12780 2.330118e-228

Code

model_metrics(model2)

data frame with 0 columns and 0 rows

Code

r2glm(model2)

[1] 0.1842479

Code

anova(model2, test = "Chisq")

Analysis of Deviance Table

Model: poisson, link: log

Response: TARGET

Terms added sequentially (first to last)

                   Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                               12794      22861              
FixedAcidity        1    37.64     12793      22823 8.527e-10 ***
VolatileAcidity     1   121.94     12792      22701 < 2.2e-16 ***
CitricAcid          1     0.96     12791      22700 0.3275400    
ResidualSugar       1     3.94     12790      22696 0.0470345 *  
Chlorides           1    20.12     12789      22676 7.255e-06 ***
FreeSulfurDioxide   1    23.83     12788      22653 1.053e-06 ***
TotalSulfurDioxide  1    33.27     12787      22619 8.018e-09 ***
Density             1    18.07     12786      22601 2.124e-05 ***
pH                  1     1.47     12785      22600 0.2253360    
Sulphates           1    13.36     12784      22586 0.0002573 ***
Alcohol             1    55.95     12783      22530 7.442e-14 ***
LabelAppeal         1  1978.17     12782      20552 < 2.2e-16 ***
AcidIndex           1   991.12     12781      19561 < 2.2e-16 ***
STARS               1   912.23     12780      18649 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

set.seed(100)
model3 <- glm(TARGET ~ ., family = quasipoisson(link='log'), final_impute1)
summary(model3)


Call:
glm(formula = TARGET ~ ., family = quasipoisson(link = "log"), 
    data = final_impute1)

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.153e+00  1.919e-01  11.219  < 2e-16 ***
FixedAcidity       -5.151e-04  8.046e-04  -0.640 0.522044    
VolatileAcidity    -5.110e-02  6.368e-03  -8.024 1.11e-15 ***
CitricAcid          1.366e-02  5.787e-03   2.361 0.018229 *  
ResidualSugar       1.757e-04  1.479e-04   1.188 0.234820    
Chlorides          -6.050e-02  1.580e-02  -3.829 0.000129 ***
FreeSulfurDioxide   1.395e-04  3.356e-05   4.156 3.25e-05 ***
TotalSulfurDioxide  1.058e-04  2.161e-05   4.896 9.89e-07 ***
Density            -4.637e-01  1.885e-01  -2.460 0.013900 *  
pH                 -2.507e-02  7.380e-03  -3.397 0.000682 ***
Sulphates          -1.266e-02  5.369e-03  -2.359 0.018345 *  
Alcohol             5.042e-03  1.346e-03   3.746 0.000180 ***
LabelAppeal         1.956e-01  5.995e-03  32.620  < 2e-16 ***
AcidIndex          -1.227e-01  4.382e-03 -28.005  < 2e-16 ***
STARS               1.764e-01  5.701e-03  30.949  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 0.9634428)

    Null deviance: 22861  on 12794  degrees of freedom
Residual deviance: 18649  on 12780  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

Code

plot(model3)

Code

drop1(model3,test="F")

Single term deletions

Model:
TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
    Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
    pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
                   Df Deviance  F value    Pr(>F)    
<none>                   18649                       
FixedAcidity        1    18649   0.2706 0.6029343    
VolatileAcidity     1    18711  42.4910 7.368e-11 ***
CitricAcid          1    18654   3.6814 0.0550449 .  
ResidualSugar       1    18650   0.9321 0.3343401    
Chlorides           1    18663   9.6817 0.0018652 ** 
FreeSulfurDioxide   1    18666  11.4049 0.0007347 ***
TotalSulfurDioxide  1    18672  15.8225 6.995e-05 ***
Density             1    18655   3.9953 0.0456492 *  
pH                  1    18660   7.6213 0.0057765 ** 
Sulphates           1    18654   3.6737 0.0553011 .  
Alcohol             1    18662   9.2670 0.0023379 ** 
LabelAppeal         1    19674 702.6412 < 2.2e-16 ***
AcidIndex           1    19457 553.6452 < 2.2e-16 ***
STARS               1    19561 625.1480 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

model_metrics(model3)

data frame with 0 columns and 0 rows

Code

r2glm(model3)

[1] 0.1842479

Code

anova(model3, test = "Chisq")

Analysis of Deviance Table

Model: quasipoisson, link: log

Response: TARGET

Terms added sequentially (first to last)

                   Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                               12794      22861              
FixedAcidity        1    37.64     12793      22823 4.102e-10 ***
VolatileAcidity     1   121.94     12792      22701 < 2.2e-16 ***
CitricAcid          1     0.96     12791      22700 0.3185300    
ResidualSugar       1     3.94     12790      22696 0.0430405 *  
Chlorides           1    20.12     12789      22676 4.868e-06 ***
FreeSulfurDioxide   1    23.83     12788      22653 6.588e-07 ***
TotalSulfurDioxide  1    33.27     12787      22619 4.190e-09 ***
Density             1    18.07     12786      22601 1.482e-05 ***
pH                  1     1.47     12785      22600 0.2167376    
Sulphates           1    13.36     12784      22586 0.0001964 ***
Alcohol             1    55.95     12783      22530 2.529e-14 ***
LabelAppeal         1  1978.17     12782      20552 < 2.2e-16 ***
AcidIndex           1   991.12     12781      19561 < 2.2e-16 ***
STARS               1   912.23     12780      18649 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

set.seed(100)
print('Goodness of Fit Test:')

[1] "Goodness of Fit Test:"

Code

with(model3, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))

     res.deviance    df             p
[1,]     18648.82 12780 2.330118e-228

Code

set.seed(100)
model4 <- glm(TARGET ~ ., family = negative.binomial(1), final_impute1)
summary(model4)


Call:
glm(formula = TARGET ~ ., family = negative.binomial(1), data = final_impute1)

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         2.267e+00  2.066e-01  10.976  < 2e-16 ***
FixedAcidity       -4.834e-04  8.664e-04  -0.558 0.576935    
VolatileAcidity    -5.736e-02  6.880e-03  -8.336  < 2e-16 ***
CitricAcid          1.257e-02  6.255e-03   2.010 0.044504 *  
ResidualSugar       2.333e-04  1.596e-04   1.462 0.143803    
Chlorides          -6.870e-02  1.700e-02  -4.041 5.35e-05 ***
FreeSulfurDioxide   1.555e-04  3.620e-05   4.297 1.75e-05 ***
TotalSulfurDioxide  1.203e-04  2.323e-05   5.179 2.26e-07 ***
Density            -4.729e-01  2.031e-01  -2.328 0.019924 *  
pH                 -3.062e-02  7.942e-03  -3.855 0.000116 ***
Sulphates          -1.634e-02  5.784e-03  -2.826 0.004726 ** 
Alcohol             4.712e-03  1.450e-03   3.250 0.001158 ** 
LabelAppeal         1.991e-01  6.440e-03  30.919  < 2e-16 ***
AcidIndex          -1.373e-01  4.398e-03 -31.223  < 2e-16 ***
STARS               1.904e-01  6.321e-03  30.120  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1) family taken to be 0.2722089)

    Null deviance: 9042.5  on 12794  degrees of freedom
Residual deviance: 7931.5  on 12780  degrees of freedom
AIC: 56693

Number of Fisher Scoring iterations: 5

Code

print('Goodness of Fit Test:')

[1] "Goodness of Fit Test:"

Code

with(model4, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))

     res.deviance    df p
[1,]     7931.463 12780 1

Code

model_metrics(model4)

data frame with 0 columns and 0 rows

Code

model4


Call:  glm(formula = TARGET ~ ., family = negative.binomial(1), data = final_impute1)

Coefficients:
       (Intercept)        FixedAcidity     VolatileAcidity          CitricAcid  
         2.2670936          -0.0004834          -0.0573578           0.0125690  
     ResidualSugar           Chlorides   FreeSulfurDioxide  TotalSulfurDioxide  
         0.0002333          -0.0686994           0.0001555           0.0001203  
           Density                  pH           Sulphates             Alcohol  
        -0.4729458          -0.0306159          -0.0163440           0.0047118  
       LabelAppeal           AcidIndex               STARS  
         0.1991308          -0.1373164           0.1904056  

Degrees of Freedom: 12794 Total (i.e. Null);  12780 Residual
Null Deviance:      9043 
Residual Deviance: 7931     AIC: 56690

Code

#khat = 35593 and standard error = 59638

model4nb <- glm.nb(TARGET ~ .,final_impute1)
summary(model4nb)


Call:
glm.nb(formula = TARGET ~ ., data = final_impute1, init.theta = 34870.30494, 
    link = log)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         2.153e+00  1.955e-01  11.011  < 2e-16 ***
FixedAcidity       -5.151e-04  8.197e-04  -0.628 0.529762    
VolatileAcidity    -5.110e-02  6.488e-03  -7.876 3.38e-15 ***
CitricAcid          1.366e-02  5.896e-03   2.318 0.020475 *  
ResidualSugar       1.757e-04  1.507e-04   1.166 0.243545    
Chlorides          -6.050e-02  1.610e-02  -3.758 0.000171 ***
FreeSulfurDioxide   1.395e-04  3.419e-05   4.080 4.51e-05 ***
TotalSulfurDioxide  1.058e-04  2.202e-05   4.806 1.54e-06 ***
Density            -4.637e-01  1.920e-01  -2.415 0.015748 *  
pH                 -2.507e-02  7.519e-03  -3.335 0.000854 ***
Sulphates          -1.266e-02  5.470e-03  -2.315 0.020595 *  
Alcohol             5.042e-03  1.371e-03   3.677 0.000236 ***
LabelAppeal         1.956e-01  6.108e-03  32.017  < 2e-16 ***
AcidIndex          -1.227e-01  4.465e-03 -27.487  < 2e-16 ***
STARS               1.765e-01  5.809e-03  30.376  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(34870.3) family taken to be 1)

    Null deviance: 22860  on 12794  degrees of freedom
Residual deviance: 18648  on 12780  degrees of freedom
AIC: 50623

Number of Fisher Scoring iterations: 1

              Theta:  34870 
          Std. Err.:  59563 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -50590.95

Code

halfnorm(residuals(model4nb))

Code

plot(log(fitted(model4nb)),log((final_impute1$TARGET-fitted(model4nb))^2), 
     xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2))
abline(0,1)

Code

(dp<- sum(residuals(model4nb,type="pearson")^2)/model4nb$df.res)

[1] 0.963362

Code

summary(model4nb,dispersion=dp)


Call:
glm.nb(formula = TARGET ~ ., data = final_impute1, init.theta = 34870.30494, 
    link = log)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         2.153e+00  1.919e-01  11.219  < 2e-16 ***
FixedAcidity       -5.151e-04  8.046e-04  -0.640 0.522038    
VolatileAcidity    -5.110e-02  6.368e-03  -8.024 1.02e-15 ***
CitricAcid          1.366e-02  5.787e-03   2.361 0.018216 *  
ResidualSugar       1.757e-04  1.479e-04   1.188 0.234778    
Chlorides          -6.050e-02  1.580e-02  -3.829 0.000129 ***
FreeSulfurDioxide   1.395e-04  3.356e-05   4.156 3.23e-05 ***
TotalSulfurDioxide  1.058e-04  2.161e-05   4.896 9.77e-07 ***
Density            -4.637e-01  1.885e-01  -2.460 0.013887 *  
pH                 -2.507e-02  7.380e-03  -3.397 0.000680 ***
Sulphates          -1.266e-02  5.369e-03  -2.359 0.018327 *  
Alcohol             5.042e-03  1.346e-03   3.746 0.000180 ***
LabelAppeal         1.956e-01  5.995e-03  32.620  < 2e-16 ***
AcidIndex          -1.227e-01  4.382e-03 -28.005  < 2e-16 ***
STARS               1.765e-01  5.701e-03  30.949  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(34870.3) family taken to be 0.963362)

    Null deviance: 22860  on 12794  degrees of freedom
Residual deviance: 18648  on 12780  degrees of freedom
AIC: 50623

Number of Fisher Scoring iterations: 1

              Theta:  34870 
          Std. Err.:  59563 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -50590.95

Negative Binomial reduced (remove insignificant variables first)

Code

set.seed(100)
model5 <- glm(TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - Density , family = negative.binomial(1), final_impute1)
summary(model5)


Call:
glm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
    Density, family = negative.binomial(1), data = final_impute1)

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.800e+00  4.913e-02  36.628  < 2e-16 ***
VolatileAcidity    -5.805e-02  6.876e-03  -8.443  < 2e-16 ***
Chlorides          -7.013e-02  1.699e-02  -4.129 3.67e-05 ***
FreeSulfurDioxide   1.559e-04  3.617e-05   4.311 1.64e-05 ***
TotalSulfurDioxide  1.204e-04  2.321e-05   5.189 2.14e-07 ***
pH                 -3.037e-02  7.938e-03  -3.826 0.000131 ***
Sulphates          -1.646e-02  5.779e-03  -2.849 0.004390 ** 
Alcohol             4.750e-03  1.449e-03   3.279 0.001044 ** 
LabelAppeal         1.994e-01  6.437e-03  30.975  < 2e-16 ***
AcidIndex          -1.376e-01  4.323e-03 -31.823  < 2e-16 ***
STARS               1.906e-01  6.318e-03  30.164  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1) family taken to be 0.2719912)

    Null deviance: 9042.5  on 12794  degrees of freedom
Residual deviance: 7934.7  on 12784  degrees of freedom
AIC: 56688

Number of Fisher Scoring iterations: 5

Code

print('Goodness of Fit Test:')

[1] "Goodness of Fit Test:"

Code

with(model5, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))

     res.deviance    df p
[1,]     7934.724 12784 1

Code

model_metrics(model5)

data frame with 0 columns and 0 rows

Code

halfnorm(residuals(model5))

Code

plot(log(fitted(model5)),log((final_impute1$TARGET-fitted(model5))^2), 
     xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2))
abline(0,1)

Code

(dp<- sum(residuals(model5,type="pearson")^2)/model5$df.res)

[1] 0.2719901

Code

summary(model5,dispersion=dp)


Call:
glm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
    Density, family = negative.binomial(1), data = final_impute1)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.800e+00  4.913e-02  36.628  < 2e-16 ***
VolatileAcidity    -5.805e-02  6.876e-03  -8.443  < 2e-16 ***
Chlorides          -7.013e-02  1.699e-02  -4.129 3.65e-05 ***
FreeSulfurDioxide   1.559e-04  3.617e-05   4.311 1.63e-05 ***
TotalSulfurDioxide  1.204e-04  2.321e-05   5.189 2.11e-07 ***
pH                 -3.037e-02  7.938e-03  -3.826  0.00013 ***
Sulphates          -1.646e-02  5.779e-03  -2.849  0.00438 ** 
Alcohol             4.750e-03  1.449e-03   3.279  0.00104 ** 
LabelAppeal         1.994e-01  6.437e-03  30.975  < 2e-16 ***
AcidIndex          -1.376e-01  4.323e-03 -31.823  < 2e-16 ***
STARS               1.906e-01  6.318e-03  30.164  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1) family taken to be 0.2719901)

    Null deviance: 9042.5  on 12794  degrees of freedom
Residual deviance: 7934.7  on 12784  degrees of freedom
AIC: 56688

Number of Fisher Scoring iterations: 5

Negative Binomial reduced (remove additional significant variables)

Code

set.seed(100)
model6 <- glm(TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - Density - VolatileAcidity - Chlorides - FreeSulfurDioxide - TotalSulfurDioxide , family = negative.binomial(1), final_impute1)
summary(model6)


Call:
glm(formula = TARGET ~ . - FixedAcidity - CitricAcid - ResidualSugar - 
    Density - VolatileAcidity - Chlorides - FreeSulfurDioxide - 
    TotalSulfurDioxide, family = negative.binomial(1), data = final_impute1)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.818737   0.048649  37.385  < 2e-16 ***
pH          -0.030350   0.007909  -3.837 0.000125 ***
Sulphates   -0.015844   0.005760  -2.751 0.005952 ** 
Alcohol      0.004461   0.001443   3.091 0.002001 ** 
LabelAppeal  0.199783   0.006413  31.152  < 2e-16 ***
AcidIndex   -0.139989   0.004294 -32.597  < 2e-16 ***
STARS        0.191107   0.006291  30.376  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1) family taken to be 0.2704147)

    Null deviance: 9042.5  on 12794  degrees of freedom
Residual deviance: 7971.7  on 12788  degrees of freedom
AIC: 56717

Number of Fisher Scoring iterations: 5

Code

print('Goodness of Fit Test:')

[1] "Goodness of Fit Test:"

Code

with(model6, cbind(res.deviance = deviance, df = df.residual,  p = pchisq(deviance, df.residual, lower.tail=FALSE)))

     res.deviance    df p
[1,]     7971.736 12788 1

Code

model_metrics(model6)

data frame with 0 columns and 0 rows

Code

plot(model6)

Code

set.seed(100)

model7 <- lm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
as.factor(STARS),data=final_impute1)
summary(model7)


Call:
lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
    ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
    Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
    as.factor(AcidIndex) + as.factor(STARS), data = final_impute1)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6149 -0.6851  0.3624  1.0981  4.8820 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               1.716e+00  1.094e+00   1.569  0.11666    
FixedAcidity             -6.841e-04  2.334e-03  -0.293  0.76943    
VolatileAcidity          -1.493e-01  1.854e-02  -8.053 8.78e-16 ***
CitricAcid                3.785e-02  1.686e-02   2.244  0.02484 *  
ResidualSugar             4.296e-04  4.307e-04   0.997  0.31863    
Chlorides                -1.839e-01  4.576e-02  -4.019 5.89e-05 ***
FreeSulfurDioxide         3.895e-04  9.757e-05   3.991 6.61e-05 ***
TotalSulfurDioxide        2.792e-04  6.255e-05   4.465 8.09e-06 ***
Density                  -1.402e+00  5.472e-01  -2.562  0.01041 *  
pH                       -6.141e-02  2.139e-02  -2.872  0.00409 ** 
Sulphates                -3.158e-02  1.557e-02  -2.028  0.04258 *  
Alcohol                   1.976e-02  3.909e-03   5.056 4.34e-07 ***
as.factor(LabelAppeal)-1  5.641e-01  7.894e-02   7.146 9.40e-13 ***
as.factor(LabelAppeal)0   1.160e+00  7.711e-02  15.043  < 2e-16 ***
as.factor(LabelAppeal)1   1.730e+00  8.077e-02  21.419  < 2e-16 ***
as.factor(LabelAppeal)2   2.396e+00  1.062e-01  22.565  < 2e-16 ***
as.factor(AcidIndex)5     1.199e+00  9.665e-01   1.241  0.21476    
as.factor(AcidIndex)6     1.293e+00  9.491e-01   1.362  0.17311    
as.factor(AcidIndex)7     1.180e+00  9.483e-01   1.245  0.21328    
as.factor(AcidIndex)8     1.002e+00  9.485e-01   1.056  0.29078    
as.factor(AcidIndex)9     5.423e-01  9.492e-01   0.571  0.56783    
as.factor(AcidIndex)10   -9.396e-02  9.509e-01  -0.099  0.92129    
as.factor(AcidIndex)11   -7.273e-01  9.539e-01  -0.762  0.44583    
as.factor(AcidIndex)12   -9.939e-01  9.595e-01  -1.036  0.30028    
as.factor(AcidIndex)13   -3.783e-01  9.688e-01  -0.390  0.69619    
as.factor(AcidIndex)14   -4.739e-01  9.785e-01  -0.484  0.62820    
as.factor(AcidIndex)15    2.691e-01  1.112e+00   0.242  0.80875    
as.factor(AcidIndex)16   -1.093e+00  1.199e+00  -0.911  0.36232    
as.factor(AcidIndex)17   -1.538e+00  1.134e+00  -1.356  0.17509    
as.factor(STARS)2         7.323e-01  3.488e-02  20.996  < 2e-16 ***
as.factor(STARS)3         1.182e+00  4.137e-02  28.560  < 2e-16 ***
as.factor(STARS)4         1.657e+00  6.656e-02  24.897  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.638 on 12763 degrees of freedom
Multiple R-squared:  0.2784,    Adjusted R-squared:  0.2766 
F-statistic: 158.8 on 31 and 12763 DF,  p-value: < 2.2e-16

Code

sqrt(sum(model7$residuals^2) / model7$df.residual)

[1] 1.63838

Code

set.seed(100)
lm8 <- stepAIC(model7, direction = "both",
               scope = list(upper = model7, lower = ~ 1),
               scale = 0, trace = FALSE)
summary(lm8)


Call:
lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
    FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
    Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) + 
    as.factor(STARS), data = final_impute1)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5993 -0.6856  0.3608  1.0993  4.8883 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               1.746e+00  1.093e+00   1.597  0.11020    
VolatileAcidity          -1.495e-01  1.854e-02  -8.060 8.28e-16 ***
CitricAcid                3.776e-02  1.686e-02   2.239  0.02517 *  
Chlorides                -1.841e-01  4.576e-02  -4.022 5.79e-05 ***
FreeSulfurDioxide         3.908e-04  9.755e-05   4.006 6.20e-05 ***
TotalSulfurDioxide        2.809e-04  6.252e-05   4.492 7.10e-06 ***
Density                  -1.400e+00  5.471e-01  -2.559  0.01051 *  
pH                       -6.119e-02  2.138e-02  -2.862  0.00422 ** 
Sulphates                -3.176e-02  1.557e-02  -2.040  0.04134 *  
Alcohol                   1.969e-02  3.908e-03   5.039 4.74e-07 ***
as.factor(LabelAppeal)-1  5.647e-01  7.893e-02   7.154 8.87e-13 ***
as.factor(LabelAppeal)0   1.160e+00  7.710e-02  15.043  < 2e-16 ***
as.factor(LabelAppeal)1   1.731e+00  8.076e-02  21.428  < 2e-16 ***
as.factor(LabelAppeal)2   2.397e+00  1.062e-01  22.576  < 2e-16 ***
as.factor(AcidIndex)5     1.166e+00  9.659e-01   1.207  0.22736    
as.factor(AcidIndex)6     1.258e+00  9.484e-01   1.327  0.18464    
as.factor(AcidIndex)7     1.145e+00  9.476e-01   1.209  0.22685    
as.factor(AcidIndex)8     9.674e-01  9.477e-01   1.021  0.30738    
as.factor(AcidIndex)9     5.060e-01  9.484e-01   0.534  0.59367    
as.factor(AcidIndex)10   -1.314e-01  9.500e-01  -0.138  0.89001    
as.factor(AcidIndex)11   -7.653e-01  9.530e-01  -0.803  0.42193    
as.factor(AcidIndex)12   -1.033e+00  9.585e-01  -1.078  0.28107    
as.factor(AcidIndex)13   -4.182e-01  9.677e-01  -0.432  0.66565    
as.factor(AcidIndex)14   -5.129e-01  9.774e-01  -0.525  0.59980    
as.factor(AcidIndex)15    2.338e-01  1.111e+00   0.210  0.83330    
as.factor(AcidIndex)16   -1.122e+00  1.198e+00  -0.936  0.34916    
as.factor(AcidIndex)17   -1.578e+00  1.133e+00  -1.393  0.16358    
as.factor(STARS)2         7.328e-01  3.487e-02  21.015  < 2e-16 ***
as.factor(STARS)3         1.182e+00  4.136e-02  28.578  < 2e-16 ***
as.factor(STARS)4         1.657e+00  6.655e-02  24.902  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.638 on 12765 degrees of freedom
Multiple R-squared:  0.2783,    Adjusted R-squared:  0.2767 
F-statistic: 169.8 on 29 and 12765 DF,  p-value: < 2.2e-16

Code

plot(lm8)

Code

set.seed(100)
# model9 <- zeroinfl(TARGET ~. | STARS, data=final_impute1)
# summary(model9)

model9 <- zeroinfl(TARGET ~., final_impute1, dist = 'poisson')
summary(model9)


Call:
zeroinfl(formula = TARGET ~ ., data = final_impute1, dist = "poisson")

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.2428 -0.3653  0.1667  0.5012  4.3875 

Count model coefficients (poisson with log link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         1.449e+00  2.065e-01   7.017 2.26e-12 ***
FixedAcidity        2.377e-04  8.575e-04   0.277  0.78165    
VolatileAcidity    -1.340e-02  6.872e-03  -1.950  0.05120 .  
CitricAcid         -2.172e-05  6.143e-03  -0.004  0.99718    
ResidualSugar      -6.728e-05  1.581e-04  -0.426  0.67037    
Chlorides          -1.801e-02  1.693e-02  -1.064  0.28743    
FreeSulfurDioxide   3.035e-05  3.496e-05   0.868  0.38530    
TotalSulfurDioxide -2.622e-05  2.228e-05  -1.177  0.23917    
Density            -3.341e-01  2.026e-01  -1.649  0.09917 .  
pH                  6.368e-03  7.930e-03   0.803  0.42195    
Sulphates           1.363e-03  5.772e-03   0.236  0.81328    
Alcohol             7.214e-03  1.427e-03   5.055 4.30e-07 ***
LabelAppeal         2.466e-01  6.428e-03  38.369  < 2e-16 ***
AcidIndex          -1.589e-02  4.983e-03  -3.188  0.00143 ** 
STARS               9.578e-02  6.294e-03  15.217  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -5.9321816  1.0390266  -5.709 1.13e-08 ***
FixedAcidity        0.0026326  0.0043097   0.611 0.541298    
VolatileAcidity     0.2346703  0.0347026   6.762 1.36e-11 ***
CitricAcid         -0.0803708  0.0314027  -2.559 0.010486 *  
ResidualSugar      -0.0013743  0.0008004  -1.717 0.085974 .  
Chlorides           0.2554050  0.0850914   3.002 0.002686 ** 
FreeSulfurDioxide  -0.0006658  0.0001796  -3.708 0.000209 ***
TotalSulfurDioxide -0.0008123  0.0001156  -7.026 2.12e-12 ***
Density             0.8564562  1.0224154   0.838 0.402211    
pH                  0.1938946  0.0400659   4.839 1.30e-06 ***
Sulphates           0.0911598  0.0290604   3.137 0.001707 ** 
Alcohol             0.0109721  0.0072459   1.514 0.129960    
LabelAppeal         0.3333973  0.0331303  10.063  < 2e-16 ***
AcidIndex           0.4764164  0.0197806  24.085  < 2e-16 ***
STARS              -0.5334391  0.0347708 -15.342  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 37 
Log-likelihood: -2.254e+04 on 30 Df

Code

set.seed(100)
library(randomForest)
model10 <- randomForest(TARGET ~., final_impute1)

model10


Call:
 randomForest(formula = TARGET ~ ., data = final_impute1) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 4

          Mean of squared residuals: 2.554585
                    % Var explained: 31.15

Code

set.seed(100)
hurd.mod2 <- hurdle(TARGET ~ AcidIndex + Alcohol + LabelAppeal + STARS |
                        VolatileAcidity + FreeSulfurDioxide +
                        TotalSulfurDioxide + pH + Sulphates + Alcohol +
                        LabelAppeal + AcidIndex + STARS,
                    data = final_impute1)
hurd.summ <- summary(hurd.mod2)

hurd.summ


Call:
hurdle(formula = TARGET ~ AcidIndex + Alcohol + LabelAppeal + STARS | 
    VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + pH + Sulphates + 
        Alcohol + LabelAppeal + AcidIndex + STARS, data = final_impute1)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.2746 -0.3759  0.1674  0.5034  3.6135 

Count model coefficients (truncated poisson with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.138239   0.043803  25.986  < 2e-16 ***
AcidIndex   -0.017645   0.004924  -3.584 0.000339 ***
Alcohol      0.007507   0.001439   5.217 1.81e-07 ***
LabelAppeal  0.249872   0.006512  38.371  < 2e-16 ***
STARS        0.097081   0.006264  15.499  < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)         4.370e+00  2.030e-01  21.525  < 2e-16 ***
VolatileAcidity    -2.145e-01  2.938e-02  -7.299 2.90e-13 ***
FreeSulfurDioxide   6.149e-04  1.540e-04   3.993 6.52e-05 ***
TotalSulfurDioxide  6.756e-04  9.873e-05   6.843 7.76e-12 ***
pH                 -1.645e-01  3.381e-02  -4.866 1.14e-06 ***
Sulphates          -7.999e-02  2.453e-02  -3.261  0.00111 ** 
Alcohol            -6.325e-03  6.163e-03  -1.026  0.30476    
LabelAppeal        -1.570e-01  2.710e-02  -5.793 6.91e-09 ***
AcidIndex          -4.270e-01  1.688e-02 -25.305  < 2e-16 ***
STARS               5.063e-01  2.931e-02  17.272  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 11 
Log-likelihood: -2.256e+04 on 15 Df

Code

plot(fitted(hurd.mod2), fitted(model9),xlab="Hurdle pred",ylab="Zero infl poisson")
abline(0,1)

Code

set.seed(100)
model11 <- zeroinfl(TARGET ~ AcidIndex+ Alcohol + LabelAppeal + STARS | VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide  + pH + Sulphates +    Alcohol + LabelAppeal + AcidIndex + STARS, data = final_impute1)

summary(model11)


Call:
zeroinfl(formula = TARGET ~ AcidIndex + Alcohol + LabelAppeal + STARS | 
    VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + pH + Sulphates + 
        Alcohol + LabelAppeal + AcidIndex + STARS, data = final_impute1)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.2910 -0.3650  0.1668  0.4996  4.1425 

Count model coefficients (poisson with log link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.138204   0.043655  26.073  < 2e-16 ***
AcidIndex   -0.016738   0.004915  -3.406  0.00066 ***
Alcohol      0.007265   0.001426   5.096 3.47e-07 ***
LabelAppeal  0.247012   0.006423  38.460  < 2e-16 ***
STARS        0.095769   0.006290  15.226  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -5.0456291  0.2381556 -21.186  < 2e-16 ***
VolatileAcidity     0.2468154  0.0343463   7.186 6.67e-13 ***
FreeSulfurDioxide  -0.0007116  0.0001780  -3.997 6.42e-05 ***
TotalSulfurDioxide -0.0007974  0.0001146  -6.955 3.52e-12 ***
pH                  0.1876248  0.0396506   4.732 2.22e-06 ***
Sulphates           0.0923097  0.0287160   3.215  0.00131 ** 
Alcohol             0.0104456  0.0072259   1.446  0.14829    
LabelAppeal         0.3323864  0.0330600  10.054  < 2e-16 ***
AcidIndex           0.4755215  0.0193439  24.583  < 2e-16 ***
STARS              -0.5340782  0.0346984 -15.392  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 22 
Log-likelihood: -2.256e+04 on 15 Df

Code

(lrt <- 2* (model9$loglik-model11$loglik))

[1] 31.94336

Code

1-pchisq(31.82583,15)

[1] 0.006798843

Code

exp(coef(model11))

      count_(Intercept)         count_AcidIndex           count_Alcohol 
            3.121157914             0.983401152             1.007291473 
      count_LabelAppeal             count_STARS        zero_(Intercept) 
            1.280194333             1.100504780             0.006437409 
   zero_VolatileAcidity  zero_FreeSulfurDioxide zero_TotalSulfurDioxide 
            1.279942812             0.999288666             0.999202910 
                zero_pH          zero_Sulphates            zero_Alcohol 
            1.206380833             1.096704402             1.010500378 
       zero_LabelAppeal          zero_AcidIndex              zero_STARS 
            1.394291527             1.608852970             0.586209398

SELECT MODELS

For the count regression model, will you use a metric such as AIC, average squared error, etc.?

Be sure to explain how you can make inferences from the model, and discuss other relevant model output.

If you like the multiple linear regression model the best, please say why.

However, you must select a count regression model for model deployment.

Using the training data set, evaluate the performance of the count regression model. Make predictions using the evaluation data set.

Code

set.seed(100)

predtrain <- data.frame(TARGET=final_impute1$TARGET,
                         model2=model2$fitted.values,
                         model3=model3$fitted.values,
                         model4=model4$fitted.values,
                         model5=model5$fitted.values,
                         model6=model6$fitted.values,
                         model7=model7$fitted.values,
                         model8=lm8$fitted.values,
                         model9=model9$fitted.values,
                        #model10=model10$fitted.values)
                        model11=model11$fitted.values,
                        model12=hurd.mod2$fitted.values)
predtrain <- round(predtrain, 0)
colnames(predtrain) <- c("TARGET","Poisson-Imputed" ,"Quasipoisson", "negative.binom", "negative.binom remove insig", "negative.binom reduce insig/sig v","linear", "linear step AIC", "Zero Inflated","zeroinfl with reduced", "hurdle model")

predtrain %>%
  gather() %>%
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 5) +
  geom_bar(fill="pink") +
  theme_minimal() + labs(x="Cases Bought", y = "Count", title = "Prediction Histogram")

Code

hist(hurd.mod2$fitted.values)

Code

set.seed(100)

aic2 <- model2$aic
aic3 <- model3$aic
aic4 <- model4$aic
aic5 <- model5$aic
aic6 <- model6$aic
aic7 <- model7$aic
aic8 <- lm8$aic


bic2 <- model2$bic
bic3 <- model3$bic
bic4 <- model4$bic
bic5 <- model5$bic
bic6 <- model6$bic
bic7 <- model7$bic
bic8 <- lm8$bic

#aic9 <- model9@aic 
#mse1 <- mean((final_impute1$TARGET - predict(model1))^2)
mse2 <- mean((final_impute1$TARGET - predict(model2))^2)
mse3 <- mean((final_impute1$TARGET - predict(model3))^2)
mse4 <- mean((final_impute1$TARGET - predict(model4))^2)
mse5 <- mean((final_impute1$TARGET - predict(model5))^2)
mse6 <- mean((final_impute1$TARGET - predict(model6))^2)
mse7 <- mean((final_impute1$TARGET - predict(model7))^2)
mse8 <- mean((final_impute1$TARGET - predict(lm8))^2)
#mse9 <- mean((final_impute1$TARGET - predict(model9))^2)
MSE <- c(mse2, mse3, mse4, mse5, mse6, mse7, mse8)
AIC <- c(aic2, aic3, aic4, aic5, aic6, aic7, aic8)
compare_df <- cbind(MSE,AIC)
rownames(compare_df) <- c("M2Poisson-Imputed" ,"M3Quasipoisson", "M4negative.binom", "M5negative.binom remove insig", "M6negative.binom reduce insig/sig v","M7linear", "M8linear step AIC")
DT::datatable(compare_df)

Code

final_impute1$PoissonImputed = predict(model2,final_impute1,type = "response")
final_impute1$Quasipoisson= predict(model3,final_impute1,type = "response")
final_impute1$NB_M1= predict(model4,final_impute1,type = "response")
final_impute1$NB_RI= predict(model5,final_impute1,type = "response")
final_impute1$NB_SIS= predict(model6,final_impute1,type = "response")
final_impute1$Linear_M1= predict(model7,final_impute1)
final_impute1$Linear_STEP= predict(lm8,final_impute1)
final_impute1$ZeroInf_M1= predict(model9,final_impute1,type = "response")
final_impute1$RF= predict(model10,final_impute1,type = "response")
final_impute1$ZeroInf_M2= predict(model11,final_impute1,type = "response")
final_impute1$Hurdle= predict(hurd.mod2,final_impute1,type = "response")


Final_test_df = data.frame(
  Models =c("M2Poisson-Imputed" ,"M3Quasipoisson", "M4negative.binom", "M5negative.binom remove insig", "M6negative.binom reduce insig/sig v","M7linear", "M8linear step AIC", "Zero Inflated","Random Forest","zeroinfl with reduced", "hurdle model"),
RMSE=c( sqrt(mean((final_impute1$TARGET - final_impute1$PoissonImputed)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$Quasipoisson)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$NB_M1)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$NB_RI)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$NB_SIS)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$Linear_M1)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$Linear_STEP)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$ZeroInf_M1)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$RF)^2)),
 sqrt(mean((final_impute1$TARGET - final_impute1$ZeroInf_M2)^2)),
sqrt(mean((final_impute1$TARGET - final_impute1$Hurdle)^2))))


( sqrt(mean((final_impute1$TARGET - final_impute1$PoissonImputed)^2)))

[1] 1.659053

Code

Final_test_df

                                Models      RMSE
1                    M2Poisson-Imputed 1.6590528
2                       M3Quasipoisson 1.6590528
3                     M4negative.binom 1.6659795
4        M5negative.binom remove insig 1.6668970
5  M6negative.binom reduce insig/sig v 1.6726298
6                             M7linear 1.6363296
7                    M8linear step AIC 1.6363996
8                        Zero Inflated 1.6399992
9                        Random Forest 0.6974797
10               zeroinfl with reduced 1.6429352
11                        hurdle model 1.6452958

Code

set.seed(100)

model2_fitted.values <- factor(round(model2$fitted.values),levels=rev(0:9))
lm8_fitted.values <- factor(round(lm8$fitted.values),levels=rev(0:9))
model9_fitted.values <- factor(round(model9$fitted.values),levels=rev(0:9))
model11_fitted.values <- factor(round(model11$fitted.values),levels=rev(0:9))
model12_fitted.values <- factor(round(model9$fitted.values),levels=rev(0:9))


m2_cfm <- confusionMatrix(model2_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))
m8_cfm <- confusionMatrix(lm8_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))
m9_cfm <- confusionMatrix(model9_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))
m11_cfm <- confusionMatrix(model11_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))
m12_cfm <- confusionMatrix(model12_fitted.values, factor(final_impute1$TARGET,levels=rev(0:9)))

models_sum <- data.frame(m2_cfm$overall, m8_cfm$overall, m9_cfm$overall,m11_cfm$overall,m12_cfm$overall)
colnames(models_sum) <- c("Poisson (Imputed)" ,"Linear (stepaic)", "Zero Inflated","zeroinfl with reduced", "hurdle model")
round(models_sum, 2)

               Poisson (Imputed) Linear (stepaic) Zero Inflated
Accuracy                    0.25             0.27          0.27
Kappa                       0.10             0.12          0.12
AccuracyLower               0.24             0.27          0.26
AccuracyUpper               0.25             0.28          0.28
AccuracyNull                0.25             0.25          0.25
AccuracyPValue              0.62             0.00          0.00
McnemarPValue                NaN              NaN           NaN
               zeroinfl with reduced hurdle model
Accuracy                        0.27         0.27
Kappa                           0.12         0.12
AccuracyLower                   0.26         0.26
AccuracyUpper                   0.27         0.28
AccuracyNull                    0.25         0.25
AccuracyPValue                  0.00         0.00
McnemarPValue                    NaN          NaN

Make Predictions on the Test Evaluation Data set

Code

imputetest <- mice(wine_test)


 iter imp variable
  1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

Code

meth <- imputetest$method
predM <- imputetest$predictorMatrix
predM[, c("TARGET")] <- 0
data_test_impute <- mice(wine_test, method = 'rf', predictorMatrix=predM)


 iter imp variable
  1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
  5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

Code

final_testimpute <- complete(data_test_impute)
print(paste0("Missing value after imputation: ", sum(is.na(final_testimpute))))

[1] "Missing value after imputation: 3335"

Code

test_predict <- predict(model9, newdata=final_testimpute)
test_predict <- round(test_predict,0)
data_pred <- data.frame(TARGET=test_predict)

ggplot(data_pred, aes(x=TARGET)) + geom_bar(fill="light pink") + theme_minimal() +
  labs(y="Count", title = "Prediction: Zero Inflated Model (Model9)") + 
    scale_x_discrete(name = "Cases Bought", limits=c("1","2","3","4", "5", "6", "7", "8"))

Code

ggplot(wine_train, aes(x=TARGET)) + geom_bar(fill="light blue") + theme_minimal() +
  labs(y="Count", title = "Actual") + 
    scale_x_discrete(name = "Training dataset Cases Bought", limits=c("1","2","3","4", "5", "6", "7", "8"))

Code

write.csv(test_predict, "WinePredictionsFinal.csv")