DATA 621 - Homework5 - Wine Sales Prediction Count Regression Model

Libraries

library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(MASS)
library(corrplot)
library(RColorBrewer)
library(GGally)
library(ggResidpanel)
library(psych)
library(mice)
library(reshape2)
library(cowplot)
library(car)
library(caTools)
library(VIM)
library(broom)
library(pROC)
library(caret)
library(geoR)
library(moments)
library(glmulti)
library(pscl)

Introduction

In this homework assignment, we will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Our dataset consists of 15 variables about different qualities. A wine producer might be able to use this data, along with the target variable, number of cases purchased by restaurants, to determine what qualities consumers are looking for in wines and to be able to plan accordingly.

vr <- c("INDEX", "TARGET", "AcidIndex", "Alcohol", "Chlorides", "CitricAcid", "Density", "FixedAcidity", "FreeSulfurDioxide", "LabelAppeal", "ResidualSugar", "STARS", "Sulphates", "TotalSulfurDioxide", "VolatileAcidity", "pH")

def <- c("Identification Variable (do not use)", "Number of Cases Purchased", "Proprietary method of testing total acidity of wine by using a weighted average", "Alcohol Content", "Chloride content of wine", "Citric Acid Content", "Density of Wine", "Fixed Acidity of Wine", "Sulfur Dioxide content of wine", "Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don't like the design.", "Residual Sugar of wine", "Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor", "Sulfate conten of wine", "Total Sulfur Dioxide of Wine", "Volatile Acid content of wine", "pH of wine")

te <- c("None", "None", "", "",  "", "", "", "",  "", "Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales", "", "A high number of stars suggests high sales",  "", "", "", "")

kable(cbind(vr, def, te), col.names = c("Variable Name", "Definition", "Theoretical Effect")) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Variable Name	Definition	Theoretical Effect
INDEX	Identification Variable (do not use)	None
TARGET	Number of Cases Purchased	None
AcidIndex	Proprietary method of testing total acidity of wine by using a weighted average
Alcohol	Alcohol Content
Chlorides	Chloride content of wine
CitricAcid	Citric Acid Content
Density	Density of Wine
FixedAcidity	Fixed Acidity of Wine
FreeSulfurDioxide	Sulfur Dioxide content of wine
LabelAppeal	Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don’t like the design.	Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales
ResidualSugar	Residual Sugar of wine
STARS	Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor	A high number of stars suggests high sales
Sulphates	Sulfate conten of wine
TotalSulfurDioxide	Total Sulfur Dioxide of Wine
VolatileAcidity	Volatile Acid content of wine
pH	pH of wine

Objective

Our objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided).

Data Load

Loaded Training and Evalutaion data sets into respective data frames.

train_df <- read.csv("https://raw.githubusercontent.com/soumya2g/CUNYDataMiningHomeWork/master/HomeWork_5/DataFiles/wine-training-data.csv")

eval_df <- read.csv("https://raw.githubusercontent.com/soumya2g/CUNYDataMiningHomeWork/master/HomeWork_5/DataFiles/wine-evaluation-data.csv")

Training Data

Sample snapshot of training data frame -

head(train_df, 20) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

ï..INDEX	TARGET	FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Density	pH	Sulphates	Alcohol	LabelAppeal	AcidIndex	STARS
1	3	3.2	1.160	-0.98	54.20	-0.567	NA	268	0.99280	3.33	-0.59	9.9	0	8	2
2	3	4.5	0.160	-0.81	26.10	-0.425	15	-327	1.02792	3.38	0.70	NA	-1	7	3
4	5	7.1	2.640	-0.88	14.80	0.037	214	142	0.99518	3.12	0.48	22.0	-1	8	3
5	3	5.7	0.385	0.04	18.80	-0.425	22	115	0.99640	2.24	1.83	6.2	-1	6	1
6	4	8.0	0.330	-1.26	9.40	NA	-167	108	0.99457	3.12	1.77	13.7	0	9	2
7	0	11.3	0.320	0.59	2.20	0.556	-37	15	0.99940	3.20	1.29	15.4	0	11	NA
8	0	7.7	0.290	-0.40	21.50	0.060	287	156	0.99572	3.49	1.21	10.3	0	8	NA
11	4	6.5	-1.220	0.34	1.40	0.040	523	551	1.03236	3.20	NA	11.6	1	7	3
12	3	14.8	0.270	1.05	11.25	-0.007	-213	NA	0.99620	4.93	0.26	15.0	0	6	NA
13	6	5.5	-0.220	0.39	1.80	-0.277	62	180	0.94724	3.09	0.75	12.6	0	8	4
14	0	-17.2	0.520	0.15	-33.80	-0.022	551	65	0.99340	4.31	0.56	13.1	1	5	1
15	4	9.0	0.220	0.49	10.40	-0.150	NA	195	1.05918	3.01	-0.40	10.2	0	10	2
16	3	6.0	0.330	-1.06	3.00	0.518	5	378	0.96643	3.55	-0.86	3.9	1	7	2
17	7	14.9	0.320	0.34	1.30	0.042	20	NA	0.95523	2.93	2.16	12.0	2	8	3
19	4	21.8	0.510	1.83	2.90	-0.165	-49	45	0.99740	6.02	2.23	12.1	0	9	NA
20	0	5.7	0.500	-0.11	6.10	0.071	234	-331	0.99780	3.35	NA	6.8	0	8	NA
22	0	7.9	0.600	0.06	50.40	0.069	15	339	0.99640	3.30	0.46	3.9	1	9	NA
23	4	-1.3	0.220	2.95	-53.00	0.541	-85	-266	0.99672	3.61	0.82	10.0	0	8	3
24	5	10.0	0.230	0.27	14.10	0.033	-188	229	0.99880	3.14	0.88	11.0	1	11	2
25	4	6.8	0.475	-0.20	-50.75	0.047	-88	508	0.99403	3.23	0.35	18.3	-1	8	2

str(train_df)

## 'data.frame':    12795 obs. of  16 variables:
##  $ ï..INDEX          : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...

PART I: Data Exploration

Data Cleansing & Transoformation

We wanted to start off data exploration process with cleaning up the issues that we observed above -

# Exclude INDEX attribute from the data frame
train_df <- train_df %>% dplyr::select(-"ï..INDEX")

Descriptive Statistical Summary

Next step of data exploration process involves high level descriptive statistical summary and missing/exception value analysis.

stat_summary <- function(df){
  df %>%
    summary() %>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
    scroll_box(width="100%",height="400px")
}
stat_summary(train_df)

`TARGET </th>`	FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	`Density </th>`	`pH </th>`	Sulphates	`Alcohol </th>`	LabelAppeal	AcidIndex	`STARS </th>`
Min. :0.000	Min. :-18.100	Min. :-2.7900	Min. :-3.2400	Min. :-127.800	Min. :-1.1710	Min. :-555.00	Min. :-823.0	Min. :0.8881	Min. :0.480	Min. :-3.1300	Min. :-4.70	Min. :-2.000000	Min. : 4.000	Min. :1.000
1st Qu.:2.000	1st Qu.: 5.200	1st Qu.: 0.1300	1st Qu.: 0.0300	1st Qu.: -2.000	1st Qu.:-0.0310	1st Qu.: 0.00	1st Qu.: 27.0	1st Qu.:0.9877	1st Qu.:2.960	1st Qu.: 0.2800	1st Qu.: 9.00	1st Qu.:-1.000000	1st Qu.: 7.000	1st Qu.:1.000
Median :3.000	Median : 6.900	Median : 0.2800	Median : 0.3100	Median : 3.900	Median : 0.0460	Median : 30.00	Median : 123.0	Median :0.9945	Median :3.200	Median : 0.5000	Median :10.40	Median : 0.000000	Median : 8.000	Median :2.000
Mean :3.029	Mean : 7.076	Mean : 0.3241	Mean : 0.3084	Mean : 5.419	Mean : 0.0548	Mean : 30.85	Mean : 120.7	Mean :0.9942	Mean :3.208	Mean : 0.5271	Mean :10.49	Mean :-0.009066	Mean : 7.773	Mean :2.042
3rd Qu.:4.000	3rd Qu.: 9.500	3rd Qu.: 0.6400	3rd Qu.: 0.5800	3rd Qu.: 15.900	3rd Qu.: 0.1530	3rd Qu.: 70.00	3rd Qu.: 208.0	3rd Qu.:1.0005	3rd Qu.:3.470	3rd Qu.: 0.8600	3rd Qu.:12.40	3rd Qu.: 1.000000	3rd Qu.: 8.000	3rd Qu.:3.000
Max. :8.000	Max. : 34.400	Max. : 3.6800	Max. : 3.8600	Max. : 141.150	Max. : 1.3510	Max. : 623.00	Max. :1057.0	Max. :1.0992	Max. :6.130	Max. : 4.2400	Max. :26.50	Max. : 2.000000	Max. :17.000	Max. :4.000
NA	NA	NA	NA	NA’s :616	NA’s :638	NA’s :647	NA’s :682	NA	NA’s :395	NA’s :1210	NA’s :653	NA	NA	NA’s :3359

From the above we can see that there are certain features with missing values. We will handle these issues in a further step towards improving data quality.

We also used describe() function of ‘psych’ package to summarize additional statistical measurements like Standard Deviation, Skewness, Kurtois, Standard Error etc.

stat_desc <- function(df){
df %>% 
    describe() %>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
}

stat_desc(train_df)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TARGET	1	12795	3.0290739	1.9263682	3.00000	3.0538244	1.4826000	0.00000	8.00000	8.00000	-0.3263010	-0.8772457	0.0170302
FixedAcidity	2	12795	7.0757171	6.3176435	6.90000	7.0736739	3.2617200	-18.10000	34.40000	52.50000	-0.0225860	1.6749987	0.0558515
VolatileAcidity	3	12795	0.3241039	0.7840142	0.28000	0.3243890	0.4299540	-2.79000	3.68000	6.47000	0.0203800	1.8322106	0.0069311
CitricAcid	4	12795	0.3084127	0.8620798	0.31000	0.3102520	0.4151280	-3.24000	3.86000	7.10000	-0.0503070	1.8379401	0.0076213
ResidualSugar	5	12179	5.4187331	33.7493790	3.90000	5.5800410	15.7155600	-127.80000	141.15000	268.95000	-0.0531229	1.8846917	0.3058158
Chlorides	6	12157	0.0548225	0.3184673	0.04600	0.0540159	0.1349166	-1.17100	1.35100	2.52200	0.0304272	1.7886044	0.0028884
FreeSulfurDioxide	7	12148	30.8455713	148.7145577	30.00000	30.9334877	56.3388000	-555.00000	623.00000	1178.00000	0.0063930	1.8364966	1.3492769
TotalSulfurDioxide	8	12113	120.7142326	231.9132105	123.00000	120.8895367	134.9166000	-823.00000	1057.00000	1880.00000	-0.0071794	1.6746665	2.1071703
Density	9	12795	0.9942027	0.0265376	0.99449	0.9942130	0.0093552	0.88809	1.09924	0.21115	-0.0186938	1.8999592	0.0002346
pH	10	12400	3.2076282	0.6796871	3.20000	3.2055706	0.3854760	0.48000	6.13000	5.65000	0.0442880	1.6462681	0.0061038
Sulphates	11	11585	0.5271118	0.9321293	0.50000	0.5271453	0.4447800	-3.13000	4.24000	7.37000	0.0059119	1.7525655	0.0086602
Alcohol	12	12142	10.4892363	3.7278190	10.40000	10.5018255	2.3721600	-4.70000	26.50000	31.20000	-0.0307158	1.5394949	0.0338306
LabelAppeal	13	12795	-0.0090660	0.8910892	0.00000	-0.0099639	1.4826000	-2.00000	2.00000	4.00000	0.0084295	-0.2622916	0.0078777
AcidIndex	14	12795	7.7727237	1.3239264	8.00000	7.6431572	1.4826000	4.00000	17.00000	13.00000	1.6484959	5.1900925	0.0117043
STARS	15	9436	2.0417550	0.9025400	2.00000	1.9711258	1.4826000	1.00000	4.00000	3.00000	0.4472353	-0.6925343	0.0092912

Missing Value Analysis

Below we have applied imputation technique for the features that has NA values.

## Counts of missing data per feature
train_na_df <- data.frame(apply(train_df, 2, function(x) length(which(is.na(x)))))
train_na_df1 <- data.frame(apply(train_df, 2,function(x) {sum(is.na(x)) / length(x) * 100}))

train_na_df <- cbind(Feature = rownames(train_na_df), train_na_df, train_na_df1)
colnames(train_na_df) <- c('Feature Name','No. of NA Recocrds','Percentage of NA Records')
rownames(train_na_df) <- NULL


train_na_df%>% filter(`No. of NA Recocrds` != 0) %>% arrange(desc(`No. of NA Recocrds`)) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Feature Name	No. of NA Recocrds	Percentage of NA Records
STARS	3359	26.252442
Sulphates	1210	9.456819
TotalSulfurDioxide	682	5.330207
Alcohol	653	5.103556
FreeSulfurDioxide	647	5.056663
Chlorides	638	4.986323
ResidualSugar	616	4.814381
pH	395	3.087143

Imputation

Imputing the missing data for the other 4 features, which we are keeping for our analysis. We have used the ‘Predictive Mean Matching’(pmm) method included in MICE package for imputation purposes.

impute_data <- function(df){
  
  df <- mice(data = df, m = 1, method = "pmm", maxit = 5, seed = 500)
  df <- mice::complete(df, 1)
}

train_df <- impute_data(train_df)

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

eval_df <- impute_data(eval_df)

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

Below we have applied imputation technique for the features that has NA values per the table above.

Descriptive Statistical Plots

Box Plots

## Box plots:
gb1 <- ggplot(data = train_df, aes(y = TARGET)) + geom_boxplot()
gb2 <- ggplot(data = train_df, aes(y = FixedAcidity)) + geom_boxplot()
gb3 <- ggplot(data = train_df, aes(y = VolatileAcidity)) + geom_boxplot()
gb4 <- ggplot(data = train_df, aes(y = CitricAcid)) + geom_boxplot()
gb5 <- ggplot(data = train_df, aes(y = ResidualSugar)) + geom_boxplot()
gb6 <- ggplot(data = train_df, aes(y = Chlorides)) + geom_boxplot()
gb7 <- ggplot(data = train_df, aes(y = FreeSulfurDioxide)) + geom_boxplot()
gb8 <- ggplot(data = train_df, aes(y = TotalSulfurDioxide)) + geom_boxplot()
gb9 <- ggplot(data = train_df, aes(y = Density)) + geom_boxplot()
gb10 <- ggplot(data = train_df, aes(y = pH)) + geom_boxplot()
gb11 <- ggplot(data = train_df, aes(y = Sulphates)) + geom_boxplot()
gb12 <- ggplot(data = train_df, aes(y = Alcohol)) + geom_boxplot()
gb13 <- ggplot(data = train_df, aes(y = LabelAppeal)) + geom_boxplot()
gb14 <- ggplot(data = train_df, aes(y = AcidIndex)) + geom_boxplot()
gb15 <- ggplot(data = train_df, aes(y = STARS)) + geom_boxplot()


plot_grid(gb1, gb2, gb3, gb4, gb5, gb6, gb7, gb8, gb9, gb10,
          gb11, gb12, gb13, gb14, gb15, labels = "AUTO, scale = 8")

Density Plots

train_df %>% dplyr::select(TARGET,FixedAcidity,VolatileAcidity,CitricAcid,ResidualSugar,Chlorides,FreeSulfurDioxide,TotalSulfurDioxide,Density,pH,Sulphates,Alcohol,LabelAppeal,AcidIndex,STARS) %>%
  gather(variable, value) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "dodgerblue4", color="dodgerblue4") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

Observations Summary

The above shows that most of the data is more or less normally distributed. There are some values such as Sulphates, Chlorides, VolatileAcidity, CitricAcid, ResidualSugar, FixedAcid, FreeSulfurDioxide, and TotalSulfurDioxide that have negative values. After taking a closer look at these data points, it is likely that the creator of this dataset ended up standardizing the data. Because the data is already normally distributed and the test (eval) dataset is also likely structured in this way, we will not be making any adjustments to this data.

It would be interesting to see if there were any correlations between the independent variables to independent variables, and independent variables to the dependent variable.

PART II: Data Preparation

Correlation Plot

trainnum_df <- dplyr::select_if(train_df, is.numeric)
corrMatrix <- round(cor(trainnum_df),4)

corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

In the correlation plot above, we see, STARS and LabelAppeal are most positively correlated variables with the response variable. We expected this because our variable description mentions these variable’s theoretical effect are higher than other variables. Also, we some mild negative correlation between the response variable and AcidIndex variable.

Feature Engineering

In binary logistic regression, it is desirable to have predictor variables that are normally distributed, whenever possible. The data in the crime dataset presents some factors that would lead us to have to perform some “Transformations” on the data. These transformation include adding categorical variables, log of variables, and adding power transformations etc.

PART III: Building Models

Test train approach

We have divided the traning dataset into training and test sets using a 80/20 split. We will build our models on the training set and evaluate it on the test set.

set.seed(123)
split <- sample.split(train_df$TARGET, SplitRatio = 0.8)
training_set <- subset(train_df, split == TRUE)
test_set <- subset(train_df, split == FALSE)

Below is the list of complete list of variables before we embark on Model building process -

str(training_set)

## 'data.frame':    10237 obs. of  15 variables:
##  $ TARGET            : int  3 3 5 3 4 0 0 4 6 4 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 5.5 21.8 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 -0.22 0.51 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 0.39 1.83 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 2.2 21.5 1.4 1.8 2.9 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 0.049 0.556 0.06 0.04 -0.277 -0.165 ...
##  $ FreeSulfurDioxide : num  -8 15 214 22 -167 -37 287 523 62 -49 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 180 45 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 3.09 6.02 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 0.83 0.75 2.23 ...
##  $ Alcohol           : num  9.9 13.5 22 6.2 13.7 15.4 10.3 11.6 12.6 12.1 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 8 9 ...
##  $ STARS             : int  2 3 3 1 2 1 1 3 4 1 ...

We will build two different Poisson regression models using dataset with and without imputed values, two different negative binomial regression models using stepwise variables selection and imputed variables, and two multiple linear regression models using stepwise variables selection and imputed variables to see which model yields the best performance.

Build Poisson Regression Models

Model 1 (Poisson Regression):

We will start off with a model with all the original variables excluding any derived features -

poisson_model_1 <- glm(TARGET ~ ., family=poisson, data=training_set)
poisson_model_1 <- step(poisson_model_1, direction="backward")

## Start:  AIC=38426.61
## TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
##     Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
##     pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
## 
##                      Df Deviance   AIC
## - ResidualSugar       1    12839 38425
## - FixedAcidity        1    12839 38425
## - Alcohol             1    12840 38426
## <none>                     12839 38427
## - Density             1    12841 38427
## - CitricAcid          1    12841 38427
## - Sulphates           1    12843 38429
## - Chlorides           1    12844 38430
## - FreeSulfurDioxide   1    12846 38432
## - TotalSulfurDioxide  1    12846 38433
## - pH                  1    12847 38433
## - VolatileAcidity     1    12874 38460
## - AcidIndex           1    13247 38833
## - LabelAppeal         1    13307 38893
## - STARS               1    15645 41231
## 
## Step:  AIC=38425.09
## TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS
## 
##                      Df Deviance   AIC
## - FixedAcidity        1    12840 38424
## - Alcohol             1    12840 38424
## <none>                     12839 38425
## - Density             1    12841 38425
## - CitricAcid          1    12842 38426
## - Sulphates           1    12844 38428
## - Chlorides           1    12845 38429
## - FreeSulfurDioxide   1    12847 38431
## - TotalSulfurDioxide  1    12847 38431
## - pH                  1    12847 38431
## - VolatileAcidity     1    12874 38458
## - AcidIndex           1    13247 38831
## - LabelAppeal         1    13307 38891
## - STARS               1    15648 41232
## 
## Step:  AIC=38423.92
## TARGET ~ VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS
## 
##                      Df Deviance   AIC
## - Alcohol             1    12841 38423
## <none>                     12840 38424
## - Density             1    12842 38424
## - CitricAcid          1    12842 38424
## - Sulphates           1    12845 38427
## - Chlorides           1    12846 38428
## - FreeSulfurDioxide   1    12848 38430
## - TotalSulfurDioxide  1    12848 38430
## - pH                  1    12848 38430
## - VolatileAcidity     1    12875 38457
## - AcidIndex           1    13264 38846
## - LabelAppeal         1    13308 38891
## - STARS               1    15648 41230
## 
## Step:  AIC=38423.15
## TARGET ~ VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + LabelAppeal + 
##     AcidIndex + STARS
## 
##                      Df Deviance   AIC
## <none>                     12841 38423
## - Density             1    12843 38423
## - CitricAcid          1    12844 38424
## - Sulphates           1    12846 38426
## - Chlorides           1    12847 38427
## - FreeSulfurDioxide   1    12849 38429
## - TotalSulfurDioxide  1    12849 38429
## - pH                  1    12849 38429
## - VolatileAcidity     1    12876 38456
## - AcidIndex           1    13267 38847
## - LabelAppeal         1    13309 38889
## - STARS               1    15677 41257

summary(poisson_model_1)

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     LabelAppeal + AcidIndex + STARS, family = poisson, data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0644  -0.6898   0.1274   0.6338   2.6622  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.581e+00  2.170e-01   7.283 3.27e-13 ***
## VolatileAcidity    -4.351e-02  7.326e-03  -5.940 2.86e-09 ***
## CitricAcid          1.067e-02  6.561e-03   1.627  0.10378    
## Chlorides          -4.355e-02  1.774e-02  -2.455  0.01410 *  
## FreeSulfurDioxide   1.066e-04  3.828e-05   2.785  0.00535 ** 
## TotalSulfurDioxide  6.950e-05  2.474e-05   2.809  0.00498 ** 
## Density            -3.228e-01  2.139e-01  -1.509  0.13131    
## pH                 -2.381e-02  8.332e-03  -2.858  0.00426 ** 
## Sulphates          -1.348e-02  6.133e-03  -2.198  0.02794 *  
## LabelAppeal         1.472e-01  6.804e-03  21.631  < 2e-16 ***
## AcidIndex          -1.001e-01  4.991e-03 -20.053  < 2e-16 ***
## STARS               3.371e-01  6.256e-03  53.882  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18291  on 10236  degrees of freedom
## Residual deviance: 12841  on 10225  degrees of freedom
## AIC: 38423
## 
## Number of Fisher Scoring iterations: 5

All of these variables appear to have minor but have statistically significant impact. Overall, it does appear that the more citric, sulfur dioxide containing, alcohol, label appeal, and stars that the observation had, the more likely the case was going to sell. Likewise, the less acid, more dense, more sulphates, and more chlorides, the less likely the wine was going to sell cases.

RMSE Calculation

rmse_calc <- function(actual, predicted) {
  rmse_val <- sqrt(sum((actual - predicted)^2) / length(actual))

  return(rmse_val)
}

### RMSE of first model - training dataset
model1_rmse_train <- mean(rmse_calc(training_set$TARGET, predict(poisson_model_1, newdata = training_set)))
### RMSE of first model - test dataset
model1_rmse_test <- mean(rmse_calc(test_set$TARGET, predict(poisson_model_1, newdata = test_set)))

model1_aic <- poisson_model_1$aic

Model 2 (Poisson Regression Model 2 - Zero-inflated):

“Zero-inflated poisson regression is used to model count data that has an excess of zero counts. Further, theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently. Thus, the zip model has two parts, a poisson count model and the logit model for predicting excess zeros.”

Reference: https://www.theanalysisfactor.com/zero-inflated-poisson-models-for-count-outcomes/

“The Poisson distribution assumes that each count is the result of each Poisson process - a random process that says each counted event is independent and equally likely. If this count variable is used as the outcome of a regression model, we can use Poisson regression to estimate how predictors affect the number of times the devent occurred.”

“But sometimes, it’s just a matter of having too many zeros that a Poisson would predict. In this case, a better solution is often the Zero-Inflated Poisson (ZIP) model.ZIP models assume that some zeros occurred by a Poisson process.The ZIP model fits, simultaneously, two separate regression models. One is a logistic model that models the probability of being eligible for a non-zero count. The other models the size of that count.”

“Both models use the same predictor variables, but estimate their coefficients separately. So the predictors can have vastly different effects on the two processes.”

Below is the second model utilizing ZIP.

poisson_model_2 <- zeroinfl(TARGET ~ ., data = training_set)
summary(poisson_model_2)

## 
## Call:
## zeroinfl(formula = TARGET ~ ., data = training_set)
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.15710 -0.45312  0.02585  0.42737  6.98368 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.434e+00  2.253e-01   6.363 1.97e-10 ***
## FixedAcidity        1.840e-04  9.405e-04   0.196 0.844929    
## VolatileAcidity    -1.669e-02  7.561e-03  -2.207 0.027308 *  
## CitricAcid         -7.852e-04  6.748e-03  -0.116 0.907369    
## ResidualSugar      -6.918e-05  1.743e-04  -0.397 0.691419    
## Chlorides          -1.993e-02  1.838e-02  -1.084 0.278182    
## FreeSulfurDioxide   1.138e-05  3.878e-05   0.293 0.769256    
## TotalSulfurDioxide -1.849e-05  2.471e-05  -0.749 0.454148    
## Density            -3.276e-01  2.212e-01  -1.481 0.138666    
## pH                  5.086e-03  8.642e-03   0.589 0.556164    
## Sulphates           5.984e-04  6.366e-03   0.094 0.925109    
## Alcohol             7.100e-03  1.560e-03   4.552 5.31e-06 ***
## LabelAppeal         2.402e-01  7.092e-03  33.873  < 2e-16 ***
## AcidIndex          -1.859e-02  5.500e-03  -3.379 0.000727 ***
## STARS               1.150e-01  6.978e-03  16.477  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -3.0501383  1.3524316  -2.255 0.024115 *  
## FixedAcidity        0.0036236  0.0055406   0.654 0.513108    
## VolatileAcidity     0.2054621  0.0439780   4.672 2.98e-06 ***
## CitricAcid         -0.0536164  0.0406526  -1.319 0.187205    
## ResidualSugar      -0.0009446  0.0010396  -0.909 0.363543    
## Chlorides           0.2022919  0.1127347   1.794 0.072748 .  
## FreeSulfurDioxide  -0.0006176  0.0002423  -2.548 0.010819 *  
## TotalSulfurDioxide -0.0006515  0.0001525  -4.273 1.93e-05 ***
## Density             1.0260037  1.3274080   0.773 0.439559    
## pH                  0.2267552  0.0517110   4.385 1.16e-05 ***
## Sulphates           0.1351741  0.0383336   3.526 0.000421 ***
## Alcohol             0.0342709  0.0095144   3.602 0.000316 ***
## LabelAppeal         0.7046050  0.0451280  15.613  < 2e-16 ***
## AcidIndex           0.4725343  0.0272423  17.346  < 2e-16 ***
## STARS              -3.1492875  0.1155593 -27.253  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 37 
## Log-likelihood: -1.68e+04 on 30 Df

poisson_model_2$coefficients

## $count
##        (Intercept)       FixedAcidity    VolatileAcidity 
##       1.433536e+00       1.839594e-04      -1.668885e-02 
##         CitricAcid      ResidualSugar          Chlorides 
##      -7.852329e-04      -6.917652e-05      -1.992622e-02 
##  FreeSulfurDioxide TotalSulfurDioxide            Density 
##       1.137619e-05      -1.849410e-05      -3.276113e-01 
##                 pH          Sulphates            Alcohol 
##       5.086326e-03       5.984220e-04       7.100222e-03 
##        LabelAppeal          AcidIndex              STARS 
##       2.402199e-01      -1.858544e-02       1.149713e-01 
## 
## $zero
##        (Intercept)       FixedAcidity    VolatileAcidity 
##      -3.0501383371       0.0036235817       0.2054620822 
##         CitricAcid      ResidualSugar          Chlorides 
##      -0.0536164184      -0.0009446201       0.2022918617 
##  FreeSulfurDioxide TotalSulfurDioxide            Density 
##      -0.0006175706      -0.0006514559       1.0260036862 
##                 pH          Sulphates            Alcohol 
##       0.2267551540       0.1351740966       0.0342708587 
##        LabelAppeal          AcidIndex              STARS 
##       0.7046050190       0.4725343007      -3.1492874622

RMSE Calculation

### RMSE of first model - training dataset
model2_rmse_train <- mean(rmse_calc(training_set$TARGET, predict(poisson_model_2, newdata = training_set)))
### RMSE of first model - test dataset
model2_rmse_test <- mean(rmse_calc(test_set$TARGET, predict(poisson_model_2, newdata = test_set)))

model2_aic <- poisson_model_2$aic

Poisson models comparison statistic

It is interesting to compare the zero inflated Poisson model to the original Poisson model. There does appear to be some differences, but even more interesting is the differences in direction (positive and negative) of the independent variables. For instance, the density of the wine is positive in $count, but negative in $zero (any many other variables). It is unclear why this is, but we will certainly keep this in the back of our mind.

Is the zero-inflated model better than the standard Poisson regression model? We will perform the Vuong test between the two models.

# Vuong
vuong(poisson_model_1, poisson_model_2) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   -40.02174 model2 > model1 < 2.22e-16
## AIC-corrected         -39.72126 model2 > model1 < 2.22e-16
## BIC-corrected         -38.63443 model2 > model1 < 2.22e-16

“The Vuong test compares the zero-inflated model with an ordinary Poisson regression model.” The test statistic is significant indicating that the zero-inflated model is superior to the standard Poisson model.

“A Poisson distribution is parameterized by $\lambda$, which happens to be both its mean and variance. While convenient, it’s not often realistic. A distribution of counts will usually have a variance that’s not equal to its mean. When we see this happen with data that we assume is (or hope) is Poisson distributed, we say we have under- or overdispersion, depending on if the variance is smaller or larger than the mean. Performing Poisson regression on count data that exhibits this behavior results in a model that doesn’t fit well.”

“One approach that addresses this issue is the Negative Binomial Regresion. The negative binomial distribution describes the probabilities of the occurrence of whole numbers greater than or equal to 0. Unlike the Poisson distribution, the variance and the mean are not equivalent. This suggests it might serve as a useful approximation for modeling counts with variability different from its mean. The variance of a negative binomial distribution is a function of its mean and has an additional parameter, k, called the dispersion parameter. Say our count is a random variable Y from a negative binomial distribution, when the variance of Y is:”

$var(Y)\quad =\quad \mu \quad +\quad { \mu }^{ 2 }lk$

“As the dispersion parameter gets larger and larger, the variance converges to the same value as the mean, and the negative binomial turns into a Poisson distribution.”

# Mean/Variance
print(paste0("TARGET mean: ", round(mean(training_set$TARGET, 3))))

## [1] "TARGET mean: 3"

print(paste0("TARGET variance: ", round(var(training_set$TARGET),3)))

## [1] "TARGET variance: 3.712"

Model 3: Negative Binomial Regression

So there appears to be a slight overdispersion with the variance greater than the mean. Let’s apply a negative binomial model for model 3.

negative_binomial_3 <- glm.nb(TARGET ~ ., data=training_set)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

summary(negative_binomial_3)

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = training_set, init.theta = 48494.48001, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0880  -0.6896   0.1272   0.6344   2.6624  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.561e+00  2.177e-01   7.172 7.42e-13 ***
## FixedAcidity       -8.289e-04  9.170e-04  -0.904  0.36607    
## VolatileAcidity    -4.345e-02  7.326e-03  -5.930 3.02e-09 ***
## CitricAcid          1.057e-02  6.565e-03   1.610  0.10734    
## ResidualSugar       1.169e-04  1.693e-04   0.691  0.48984    
## Chlorides          -4.288e-02  1.775e-02  -2.416  0.01570 *  
## FreeSulfurDioxide   1.075e-04  3.829e-05   2.808  0.00498 ** 
## TotalSulfurDioxide  6.965e-05  2.476e-05   2.813  0.00491 ** 
## Density            -3.219e-01  2.140e-01  -1.505  0.13245    
## pH                 -2.380e-02  8.334e-03  -2.855  0.00430 ** 
## Sulphates          -1.332e-02  6.136e-03  -2.171  0.02996 *  
## Alcohol             1.708e-03  1.528e-03   1.118  0.26348    
## LabelAppeal         1.473e-01  6.805e-03  21.642  < 2e-16 ***
## AcidIndex          -9.926e-02  5.048e-03 -19.663  < 2e-16 ***
## STARS               3.365e-01  6.279e-03  53.581  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(48494.48) family taken to be 1)
## 
##     Null deviance: 18290  on 10236  degrees of freedom
## Residual deviance: 12838  on 10222  degrees of freedom
## AIC: 38429
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  48494 
##           Std. Err.:  62679 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38396.79

RMSE Calculation

### RMSE of first model - training dataset
model3_rmse_train <- mean(rmse_calc(training_set$TARGET, predict(negative_binomial_3, newdata = training_set)))
### RMSE of first model - test dataset
model3_rmse_test <- mean(rmse_calc(test_set$TARGET, predict(negative_binomial_3, newdata = test_set)))

model3_aic <- negative_binomial_3$aic

Model 4: Negative Binomial

In our negative binomial regression model below, we use forward and backward step-wise variables selection algorithm. This model is only slightly better with a lower AIC score.

# negative binomial regression with stepwise variable selection
negative_binomial_4 <- stepAIC(negative_binomial_3, direction = "both", trace = FALSE)
summary(negative_binomial_4)

## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     LabelAppeal + AcidIndex + STARS, data = training_set, init.theta = 48441.73451, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0644  -0.6898   0.1273   0.6338   2.6621  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.581e+00  2.170e-01   7.283 3.28e-13 ***
## VolatileAcidity    -4.351e-02  7.326e-03  -5.939 2.86e-09 ***
## CitricAcid          1.067e-02  6.561e-03   1.627  0.10379    
## Chlorides          -4.355e-02  1.774e-02  -2.455  0.01410 *  
## FreeSulfurDioxide   1.066e-04  3.828e-05   2.785  0.00536 ** 
## TotalSulfurDioxide  6.950e-05  2.475e-05   2.809  0.00498 ** 
## Density            -3.228e-01  2.140e-01  -1.509  0.13132    
## pH                 -2.381e-02  8.332e-03  -2.858  0.00426 ** 
## Sulphates          -1.348e-02  6.133e-03  -2.198  0.02794 *  
## LabelAppeal         1.472e-01  6.804e-03  21.630  < 2e-16 ***
## AcidIndex          -1.001e-01  4.992e-03 -20.053  < 2e-16 ***
## STARS               3.371e-01  6.256e-03  53.880  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(48441.73) family taken to be 1)
## 
##     Null deviance: 18290  on 10236  degrees of freedom
## Residual deviance: 12841  on 10225  degrees of freedom
## AIC: 38425
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  48442 
##           Std. Err.:  62616 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38399.32

Not surprisingly, this negative binomial model performed slightly worse than the previous negative binomial model. I suspect that the transformation had lost some key information. (Again, notable is that the beta values of the independent variables are similar to the beta values of the prior negative binomial model.)

RMSE Calculation

### RMSE of first model - training dataset
model4_rmse_train <- mean(rmse_calc(training_set$TARGET, predict(negative_binomial_4, newdata = training_set)))
### RMSE of first model - test dataset
model4_rmse_test <- mean(rmse_calc(test_set$TARGET, predict(negative_binomial_4, newdata = test_set)))

model4_aic <- negative_binomial_4$aic

Model 5: Multiple Linear Regression

In our multiple linear regression model below, r-squared is 0.4594, which means this model explains 45.94% of the data’s variation. As seen with previous models, FixedAcidity and ResidualSugar seem to have have no impact in this model. So far none of the model adequately explains the dataset.

model5 <- lm(TARGET ~ ., data=training_set)
summary(model5)

## 
## Call:
## lm(formula = TARGET ~ ., data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5443 -1.0241  0.1758  1.0316  4.4430 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.887e+00  5.367e-01   7.243 4.70e-13 ***
## FixedAcidity       -1.781e-03  2.258e-03  -0.789  0.43022    
## VolatileAcidity    -1.272e-01  1.800e-02  -7.068 1.68e-12 ***
## CitricAcid          2.572e-02  1.625e-02   1.583  0.11344    
## ResidualSugar       3.070e-04  4.183e-04   0.734  0.46296    
## Chlorides          -1.426e-01  4.374e-02  -3.259  0.00112 ** 
## FreeSulfurDioxide   2.798e-04  9.458e-05   2.959  0.00310 ** 
## TotalSulfurDioxide  1.795e-04  6.073e-05   2.955  0.00313 ** 
## Density            -9.352e-01  5.283e-01  -1.770  0.07676 .  
## pH                 -5.729e-02  2.055e-02  -2.788  0.00532 ** 
## Sulphates          -3.619e-02  1.511e-02  -2.396  0.01661 *  
## Alcohol             8.322e-03  3.764e-03   2.211  0.02706 *  
## LabelAppeal         4.509e-01  1.654e-02  27.268  < 2e-16 ***
## AcidIndex          -2.495e-01  1.099e-02 -22.701  < 2e-16 ***
## STARS               1.155e+00  1.676e-02  68.927  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.42 on 10222 degrees of freedom
## Multiple R-squared:  0.4578, Adjusted R-squared:  0.4571 
## F-statistic: 616.6 on 14 and 10222 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(model5, plots='default', smoother = TRUE)

RMSE Calculation

### RMSE of first model - training dataset
model5_rmse_train <- mean(rmse_calc(training_set$TARGET, predict(model5, newdata = training_set)))
### RMSE of first model - test dataset
model5_rmse_test <- mean(rmse_calc(test_set$TARGET, predict(model5, newdata = test_set)))

model5_aic <- model5$aic

Model 6: Multiple Linear Regression

In our last model using multiple linear regression with forward and backward step-wise variables selection algorithm, we see a similar output as model 5. R-squared is 0.4594, which means this model explains 45.94% of the data’s variation.

# multiple linear regression with stepwise variable selection
model6 <- stepAIC(model5, direction = "both", trace = FALSE)
summary(model6)

## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.527 -1.021  0.177  1.033  4.440 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.886e+00  5.367e-01   7.241 4.78e-13 ***
## VolatileAcidity    -1.273e-01  1.800e-02  -7.074 1.60e-12 ***
## CitricAcid          2.548e-02  1.624e-02   1.569  0.11673    
## Chlorides          -1.430e-01  4.374e-02  -3.270  0.00108 ** 
## FreeSulfurDioxide   2.800e-04  9.456e-05   2.961  0.00308 ** 
## TotalSulfurDioxide  1.810e-04  6.071e-05   2.981  0.00288 ** 
## Density            -9.334e-01  5.283e-01  -1.767  0.07732 .  
## pH                 -5.715e-02  2.055e-02  -2.781  0.00542 ** 
## Sulphates          -3.660e-02  1.510e-02  -2.424  0.01538 *  
## Alcohol             8.283e-03  3.762e-03   2.202  0.02771 *  
## LabelAppeal         4.510e-01  1.654e-02  27.271  < 2e-16 ***
## AcidIndex          -2.510e-01  1.082e-02 -23.191  < 2e-16 ***
## STARS               1.155e+00  1.675e-02  68.961  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.42 on 10224 degrees of freedom
## Multiple R-squared:  0.4578, Adjusted R-squared:  0.4571 
## F-statistic: 719.3 on 12 and 10224 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(model6, plots='default', smoother = TRUE)

RMSE Calculation

### RMSE of first model - training dataset
model6_rmse_train <- mean(rmse_calc(training_set$TARGET, predict(model6, newdata = training_set)))
### RMSE of first model - test dataset
model6_rmse_test <- mean(rmse_calc(test_set$TARGET, predict(model6, newdata = test_set)))

model6_aic <- model6$aic

PART IV: Selecting Models

Compare Key Regression Model statistics

The table below summarizes the model statistics for all 3 of our Regression models. The models are listed from left to right in accordance with the order in which they were described in Part III.

# metrics
Train_RMSE <- list(model1_rmse_train, model2_rmse_train, model3_rmse_train, model4_rmse_train, model5_rmse_train, model6_rmse_train)
Test_RMSE <- list(model1_rmse_test, model2_rmse_test, model3_rmse_test, model4_rmse_test, model5_rmse_test, model6_rmse_test)

AIC <- list(model1_aic, model2_aic, model3_aic, model4_aic, model5_aic, model6_aic)

kable(rbind(Train_RMSE, Test_RMSE, AIC), col.names = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6"))  %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")

	Model 1	Model 2	Model 3	Model 4	Model 5	Model 6
Train_RMSE	2.61749084465551	1.35432902629987	2.61743685476711	2.61749006996924	1.41854806293486	1.4186294587951
Test_RMSE	2.6020239659855	1.35247459709401	2.60206289256156	2.60202304945068	1.40469654988496	1.40437495721672
AIC	38423.1462242275	NULL	38428.7871733393	38425.3235609569	NULL	NULL

To make prediction, we will select one of our count regression model. The criteria for our selection for the best count regression model will be the AIC score and mean squared error of the model. Based on the table above, model 1 is our best model.

Model Prediction

lm_predicted <- round(predict(poisson_model_1, newdata = eval_df),0)

lm_predicted_df <- as.data.frame(cbind(eval_df$IN, lm_predicted))

colnames(lm_predicted_df) <- c('INDEX','TARGET_WINS')

lm_predicted_df %>% kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")

INDEX	TARGET_WINS
3	1
9	1
10	1
18	1
21	1
30	2
31	1
37	2
39	0
47	1
60	1
62	1
63	1
64	1
68	1
75	1
76	1
83	1
87	1
92	2
98	1
106	1
107	1
113	1
120	1
123	2
125	1
126	2
128	2
129	1
131	1
135	1
141	2
147	1
148	1
151	1
156	1
157	1
174	1
186	1
193	1
195	1
212	1
213	1
217	1
223	1
226	1
228	1
230	1
241	1
243	1
249	1
281	1
288	0
294	0
295	1
300	2
302	1
303	2
308	1
319	1
320	1
324	1
331	1
343	1
347	1
348	1
350	1
357	0
358	1
360	2
366	1
367	1
368	2
376	0
380	1
388	0
396	1
398	2
403	1
410	0
412	2
420	1
434	1
440	1
450	1
453	1
464	2
465	1
466	2
473	1
476	2
478	1
479	1
493	1
497	1
503	1
504	1
505	1
507	1
513	1
519	1
521	1
522	1
545	1
549	1
551	2
556	2
557	2
559	1
560	1
566	1
569	1
573	1
578	1
579	2
582	1
596	1
598	0
599	1
602	1
605	1
617	1
619	2
630	1
634	1
643	1
645	1
647	1
649	1
656	1
657	1
658	1
667	1
692	1
693	1
698	1
699	1
700	2
704	1
707	1
708	1
709	1
713	0
714	1
716	0
718	1
722	1
729	2
731	1
733	1
746	1
747	1
748	1
753	1
757	1
763	1
767	2
774	1
776	1
788	1
794	1
799	1
803	1
806	1
807	1
811	2
816	2
818	1
819	1
831	2
835	2
837	1
841	1
846	1
856	2
861	1
862	2
863	1
865	1
871	1
879	1
880	1
881	1
885	1
887	1
892	1
898	1
900	0
904	1
906	2
910	1
912	1
913	1
919	2
924	1
925	1
930	1
940	1
941	1
946	1
949	2
951	1
962	1
966	1
967	2
971	1
981	1
982	1
983	1
984	1
989	1
990	1
992	1
995	2
996	1
998	1
1001	2
1007	1
1008	1
1016	1
1022	1
1027	2
1032	1
1033	1
1041	1
1065	1
1074	1
1075	1
1081	1
1094	1
1099	1
1105	1
1123	2
1135	1
1142	1
1155	1
1169	1
1176	1
1178	1
1180	1
1184	1
1185	1
1193	1
1196	1
1199	1
1203	1
1205	1
1207	1
1208	1
1212	0
1213	1
1222	1
1223	1
1226	1
1227	2
1229	1
1230	2
1231	1
1241	0
1243	1
1244	2
1246	2
1248	1
1249	1
1252	1
1261	1
1275	1
1281	1
1285	1
1288	1
1290	1
1291	1
1304	1
1305	1
1323	1
1342	1
1348	1
1353	1
1363	1
1371	1
1372	1
1378	1
1381	1
1382	1
1393	1
1394	2
1398	2
1404	1
1405	1
1419	1
1421	1
1426	1
1431	1
1435	1
1437	1
1438	2
1442	0
1464	1
1471	1
1473	1
1476	1
1478	0
1479	1
1487	2
1492	1
1496	1
1497	1
1515	1
1519	1
1522	1
1526	1
1537	1
1538	2
1540	0
1543	1
1548	1
1549	1
1556	1
1564	1
1570	1
1577	1
1585	2
1590	1
1592	1
1594	1
1596	2
1598	2
1603	2
1607	1
1612	2
1627	1
1629	1
1630	1
1640	2
1641	1
1646	1
1662	1
1668	1
1671	1
1672	2
1673	2
1686	1
1688	1
1696	1
1701	2
1707	1
1708	1
1713	1
1715	1
1717	1
1721	1
1724	1
1725	1
1730	1
1731	1
1734	1
1740	1
1748	1
1749	1
1750	2
1763	1
1768	2
1773	1
1777	2
1778	1
1780	1
1782	1
1784	1
1786	1
1787	1
1792	1
1800	1
1801	1
1803	1
1804	1
1807	1
1818	2
1821	1
1822	2
1828	1
1833	1
1844	1
1847	1
1850	1
1854	1
1858	1
1864	1
1867	0
1876	1
1880	1
1881	0
1891	1
1894	1
1895	1
1901	2
1905	1
1912	2
1918	1
1921	1
1923	1
1924	2
1931	1
1941	2
1950	1
1951	2
1954	1
1961	1
1966	1
1979	1
1982	1
1987	1
1997	1
2004	1
2011	2
2015	1
2025	2
2033	1
2034	1
2035	1
2036	1
2053	1
2059	1
2060	1
2073	1
2084	1
2089	1
2092	1
2109	2
2129	1
2134	2
2135	2
2148	1
2149	1
2150	1
2165	1
2166	1
2168	2
2170	0
2171	1
2172	1
2176	2
2182	1
2189	0
2191	1
2197	1
2202	1
2203	1
2204	0
2206	1
2218	0
2219	1
2221	1
2226	1
2228	1
2232	1
2236	0
2241	1
2245	2
2251	1
2255	2
2256	1
2259	1
2263	1
2264	1
2267	1
2273	0
2277	1
2287	1
2289	1
2291	1
2296	0
2299	1
2306	1
2314	1
2317	1
2318	1
2321	1
2324	1
2340	1
2343	1
2349	1
2352	2
2353	1
2365	1
2370	1
2378	1
2390	0
2399	1
2402	1
2403	1
2404	2
2414	2
2422	1
2424	1
2430	1
2435	1
2439	2
2442	1
2445	1
2449	2
2451	1
2461	1
2464	1
2465	1
2472	1
2476	1
2482	1
2487	2
2498	1
2501	1
2504	1
2511	1
2518	2
2521	2
2530	1
2543	1
2545	1
2561	2
2566	1
2572	1
2577	1
2578	2
2580	1
2581	2
2582	1
2584	1
2590	2
2598	1
2602	2
2605	2
2616	1
2618	1
2619	1
2624	2
2632	1
2640	1
2646	2
2651	1
2660	1
2661	1
2668	1
2670	2
2680	1
2681	1
2689	1
2694	1
2695	1
2696	1
2702	1
2704	1
2708	2
2709	1
2714	1
2716	0
2723	1
2725	1
2738	1
2750	1
2756	1
2758	1
2766	1
2767	1
2771	1
2775	1
2776	1
2779	1
2780	1
2781	1
2782	2
2783	1
2796	1
2798	1
2800	1
2803	2
2806	2
2813	1
2818	1
2821	2
2825	2
2829	0
2830	1
2833	1
2839	1
2843	2
2846	1
2847	1
2848	2
2856	1
2863	1
2867	2
2869	1
2873	2
2874	1
2875	1
2880	1
2886	1
2887	1
2888	1
2889	1
2890	2
2892	1
2901	1
2902	0
2905	0
2917	0
2922	1
2924	1
2930	1
2931	1
2946	1
2955	1
2962	1
2964	1
2965	1
2967	1
2970	2
2973	2
2974	1
2976	1
2977	1
2978	1
2986	2
2988	2
2989	2
2995	1
3005	1
3011	1
3013	1
3019	1
3021	1
3022	2
3029	1
3037	1
3042	1
3043	1
3049	1
3050	2
3053	1
3058	1
3062	2
3063	1
3065	1
3080	1
3088	1
3093	1
3096	2
3101	2
3103	1
3107	2
3109	2
3111	2
3113	1
3116	1
3132	1
3141	2
3153	1
3154	1
3160	1
3167	1
3170	1
3173	1
3174	1
3177	2
3179	1
3184	1
3190	1
3193	2
3199	1
3201	1
3202	1
3203	1
3206	1
3209	1
3210	1
3217	2
3220	1
3228	1
3232	1
3239	1
3243	1
3245	1
3246	1
3251	2
3253	2
3257	1
3260	1
3261	1
3263	2
3278	1
3281	1
3283	2
3290	1
3297	1
3304	1
3305	1
3307	1
3308	1
3313	1
3314	1
3317	1
3348	1
3350	0
3359	1
3367	1
3376	1
3378	1
3384	1
3386	1
3387	1
3388	1
3390	1
3391	1
3396	0
3398	1
3404	2
3406	2
3407	1
3414	1
3419	1
3423	1
3427	1
3432	1
3434	1
3438	1
3442	0
3443	1
3448	0
3456	1
3464	2
3470	1
3475	1
3477	1
3490	1
3493	1
3502	1
3508	2
3516	1
3517	1
3525	1
3532	1
3535	1
3536	1
3540	1
3547	1
3550	1
3557	1
3562	1
3563	1
3564	1
3570	1
3573	1
3577	1
3579	1
3581	1
3587	1
3602	1
3609	1
3612	1
3621	1
3642	1
3647	1
3649	1
3654	0
3660	1
3665	1
3669	1
3673	1
3675	1
3678	1
3680	1
3686	2
3693	1
3710	1
3713	1
3718	2
3725	1
3726	1
3747	1
3753	1
3754	2
3760	2
3763	1
3765	1
3769	2
3771	1
3784	1
3787	1
3794	1
3796	1
3798	1
3809	1
3812	2
3819	1
3828	1
3831	2
3833	1
3837	2
3839	1
3843	1
3846	1
3854	2
3861	1
3864	1
3868	1
3869	2
3870	1
3883	1
3886	1
3889	1
3894	0
3907	1
3910	1
3913	1
3914	1
3921	2
3923	1
3929	1
3931	1
3932	2
3937	1
3943	1
3956	1
3957	1
3961	2
3971	1
4004	0
4005	1
4006	1
4011	1
4013	2
4014	2
4016	0
4017	2
4020	1
4022	1
4026	1
4032	1
4043	1
4045	1
4048	2
4051	2
4052	1
4056	1
4059	0
4069	2
4074	1
4076	1
4077	1
4079	0
4081	1
4088	2
4105	1
4125	1
4134	1
4139	0
4146	1
4149	2
4151	0
4155	1
4157	1
4168	2
4170	1
4174	1
4179	2
4185	1
4199	1
4205	1
4208	1
4211	1
4212	0
4215	1
4217	1
4219	1
4226	1
4227	1
4229	1
4231	0
4233	1
4237	1
4243	1
4248	2
4255	2
4262	1
4266	1
4268	2
4270	1
4273	2
4276	1
4277	1
4279	0
4299	1
4313	2
4322	1
4324	1
4328	1
4331	1
4335	0
4337	1
4338	1
4343	0
4347	0
4355	1
4357	1
4359	2
4362	1
4368	1
4374	2
4375	2
4378	1
4381	1
4387	1
4400	0
4423	1
4424	1
4428	1
4433	2
4436	1
4437	1
4439	2
4449	1
4456	1
4463	2
4467	1
4468	1
4469	1
4472	1
4473	2
4476	1
4500	1
4509	1
4513	1
4521	1
4527	1
4530	0
4532	1
4533	1
4535	1
4536	1
4542	1
4551	1
4554	1
4555	0
4564	1
4572	1
4573	2
4577	1
4579	2
4583	1
4584	2
4596	1
4599	1
4607	1
4609	0
4610	1
4616	1
4617	1
4633	1
4638	1
4641	1
4653	2
4655	1
4659	1
4669	1
4678	1
4685	1
4686	1
4691	1
4695	1
4698	1
4700	2
4711	1
4722	1
4727	1
4756	2
4762	1
4763	1
4766	2
4770	1
4784	1
4791	1
4795	2
4799	1
4802	1
4805	1
4814	1
4816	1
4817	1
4822	1
4827	1
4833	2
4836	1
4842	1
4844	1
4845	1
4849	1
4850	1
4860	1
4863	1
4871	1
4878	1
4881	1
4888	1
4900	2
4906	1
4909	1
4916	1
4918	2
4926	1
4928	1
4941	1
4946	2
4949	1
4956	1
4966	1
4969	1
4973	1
4978	2
4982	1
4985	1
4991	1
4998	1
5000	1
5004	1
5005	1
5011	1
5016	0
5018	1
5034	1
5038	1
5042	2
5046	1
5051	1
5054	1
5057	1
5062	1
5063	1
5065	1
5066	1
5076	1
5089	1
5092	1
5093	1
5094	1
5098	1
5102	1
5112	2
5117	1
5127	1
5130	1
5131	0
5132	1
5135	1
5136	1
5147	1
5157	1
5160	1
5165	1
5166	1
5172	1
5173	1
5179	1
5184	2
5187	1
5191	1
5193	1
5194	1
5199	1
5212	1
5213	1
5224	1
5226	1
5239	1
5252	1
5264	1
5266	1
5271	2
5273	1
5276	1
5278	2
5281	0
5283	1
5291	2
5294	2
5296	1
5297	1
5313	1
5314	1
5321	1
5325	1
5326	1
5328	1
5334	1
5338	1
5344	0
5348	1
5352	1
5353	1
5354	1
5361	1
5364	1
5365	1
5367	1
5379	1
5382	1
5386	2
5395	1
5410	1
5411	1
5416	1
5424	1
5426	1
5428	1
5430	2
5433	1
5437	1
5440	1
5442	2
5445	1
5449	1
5452	1
5460	1
5461	0
5465	1
5467	1
5471	1
5474	1
5475	1
5480	1
5481	1
5484	1
5494	2
5495	1
5497	1
5499	1
5507	1
5510	1
5515	1
5516	1
5517	1
5524	1
5530	2
5534	1
5543	1
5545	1
5558	1
5562	1
5573	2
5581	1
5583	2
5587	1
5589	0
5591	2
5596	1
5606	1
5608	1
5611	1
5612	1
5614	1
5620	1
5623	2
5624	1
5626	2
5633	1
5635	1
5640	1
5643	1
5644	2
5653	2
5663	1
5664	2
5667	1
5671	1
5673	1
5676	1
5678	0
5698	1
5700	2
5705	1
5706	1
5711	2
5712	2
5716	1
5719	1
5725	1
5728	2
5734	0
5735	1
5743	1
5754	1
5755	1
5756	1
5766	1
5770	1
5774	1
5775	1
5776	1
5778	2
5786	1
5787	1
5791	2
5794	1
5803	1
5804	1
5808	1
5810	1
5813	1
5828	1
5839	2
5842	1
5843	0
5844	1
5847	1
5851	1
5854	1
5857	1
5866	1
5874	1
5886	1
5895	0
5897	1
5898	1
5900	2
5902	1
5908	1
5909	1
5912	2
5913	1
5917	2
5918	1
5921	1
5931	1
5942	1
5943	1
5950	1
5954	1
5983	1
5995	1
6002	1
6005	1
6009	2
6011	1
6012	2
6019	1
6021	1
6029	2
6036	1
6037	1
6038	0
6043	0
6045	1
6047	1
6048	1
6061	1
6063	1
6064	2
6068	2
6069	1
6070	1
6071	2
6074	1
6079	1
6082	1
6088	1
6094	1
6095	1
6098	1
6102	1
6105	1
6113	1
6116	1
6120	1
6121	1
6126	1
6144	1
6145	1
6153	1
6156	1
6159	1
6162	1
6184	1
6188	1
6189	1
6191	0
6211	1
6216	1
6218	1
6222	1
6235	1
6245	1
6248	1
6253	1
6256	1
6257	1
6259	0
6266	1
6268	1
6275	1
6280	1
6283	1
6288	2
6289	1
6301	1
6308	1
6314	1
6315	0
6316	1
6317	1
6318	1
6323	1
6329	2
6336	1
6341	1
6348	1
6349	1
6365	1
6372	1
6376	1
6378	1
6379	1
6382	1
6383	1
6389	1
6390	0
6392	2
6394	1
6402	1
6404	1
6405	1
6406	1
6409	2
6410	2
6411	1
6421	1
6428	2
6429	1
6432	1
6436	1
6437	1
6438	2
6445	1
6447	1
6450	1
6462	1
6467	1
6478	1
6484	1
6492	1
6497	2
6504	1
6505	1
6513	1
6525	2
6526	1
6528	1
6540	1
6542	1
6544	2
6548	1
6552	1
6558	1
6567	1
6569	2
6572	2
6577	2
6581	1
6588	1
6591	1
6594	1
6600	2
6602	1
6604	1
6605	1
6614	1
6616	1
6621	1
6640	2
6641	1
6643	1
6644	1
6649	1
6650	1
6655	2
6661	0
6672	2
6677	1
6688	1
6689	2
6691	1
6692	1
6694	1
6702	1
6714	1
6716	2
6724	1
6725	1
6730	1
6735	1
6738	1
6739	1
6743	1
6747	1
6750	2
6751	1
6753	1
6754	1
6755	1
6762	1
6764	1
6772	2
6774	1
6787	1
6789	1
6793	1
6798	2
6799	1
6800	1
6802	1
6808	1
6809	1
6812	1
6814	1
6816	1
6822	1
6829	1
6834	1
6836	2
6839	1
6840	1
6843	1
6846	0
6848	1
6852	1
6856	2
6860	1
6866	2
6870	1
6878	2
6880	2
6885	1
6897	2
6902	1
6904	1
6907	1
6909	1
6914	1
6915	1
6922	1
6924	1
6933	1
6934	1
6941	1
6957	1
6960	1
6969	0
6975	1
6980	1
6983	1
6987	1
6994	1
6997	1
7002	1
7010	1
7015	1
7019	1
7022	1
7025	1
7029	1
7031	1
7037	1
7038	1
7043	1
7049	1
7052	1
7053	2
7056	1
7057	2
7080	1
7086	1
7087	1
7105	1
7108	1
7121	0
7122	1
7125	1
7132	1
7134	1
7151	1
7152	1
7157	1
7159	1
7166	2
7167	1
7177	1
7179	1
7181	2
7183	1
7186	1
7193	2
7205	1
7207	1
7209	1
7216	1
7232	0
7235	1
7238	1
7240	1
7243	1
7252	1
7269	1
7275	1
7281	0
7283	1
7287	2
7289	2
7291	1
7294	2
7304	2
7308	1
7313	1
7319	1
7325	1
7326	1
7330	2
7332	1
7337	1
7341	1
7346	1
7353	0
7354	1
7361	1
7366	1
7368	2
7372	1
7375	1
7377	2
7380	2
7382	2
7385	1
7392	1
7395	1
7397	1
7403	1
7406	1
7409	1
7410	1
7412	1
7419	1
7425	1
7435	2
7438	1
7440	1
7447	1
7449	1
7456	1
7464	1
7478	1
7480	1
7481	2
7483	1
7484	1
7491	1
7494	1
7501	1
7503	1
7509	1
7517	1
7518	2
7519	0
7521	2
7522	1
7536	1
7539	1
7547	2
7549	1
7552	1
7554	1
7556	1
7564	1
7566	1
7570	1
7571	1
7572	1
7575	0
7586	1
7589	2
7590	1
7597	1
7602	1
7604	1
7605	1
7612	0
7615	1
7617	1
7624	1
7632	1
7639	1
7642	1
7643	1
7649	1
7650	1
7653	1
7654	1
7657	2
7662	1
7669	1
7671	1
7675	1
7678	1
7682	1
7688	1
7689	1
7690	2
7692	1
7699	2
7705	1
7712	1
7726	1
7728	0
7735	1
7737	1
7739	1
7743	2
7744	1
7746	1
7749	1
7750	1
7752	1
7755	2
7756	1
7762	2
7764	1
7769	1
7770	1
7776	2
7778	2
7784	1
7786	1
7789	1
7793	1
7794	1
7804	1
7811	1
7813	1
7815	1
7817	1
7818	1
7821	2
7825	1
7830	2
7832	1
7835	1
7839	2
7842	2
7849	1
7856	1
7857	1
7863	1
7866	2
7871	1
7875	1
7882	0
7887	0
7888	1
7891	0
7895	1
7901	1
7906	1
7908	2
7917	0
7924	1
7948	1
7950	2
7955	1
7957	1
7959	1
7967	2
7969	1
7971	1
7974	1
7976	1
7986	2
7987	0
7993	1
7996	1
7998	1
8018	1
8019	1
8027	1
8036	1
8040	1
8044	1
8050	1
8052	1
8054	1
8057	1
8058	2
8059	1
8066	1
8070	2
8072	2
8078	1
8079	1
8080	1
8081	1
8088	1
8091	1
8094	1
8095	1
8099	2
8101	2
8102	2
8116	2
8125	2
8134	1
8139	1
8141	1
8147	1
8158	2
8160	1
8165	1
8187	1
8205	1
8209	1
8211	1
8232	1
8236	1
8237	2
8238	2
8245	1
8256	1
8268	1
8269	1
8270	1
8286	1
8289	1
8301	1
8305	1
8310	1
8312	0
8318	2
8321	1
8328	1
8331	1
8334	1
8344	1
8345	0
8352	2
8358	1
8359	0
8360	1
8365	1
8366	1
8369	1
8373	2
8378	1
8392	1
8397	1
8399	1
8400	1
8405	1
8406	1
8410	1
8413	1
8414	1
8416	2
8426	1
8434	1
8439	2
8440	1
8475	2
8480	2
8497	0
8499	1
8500	1
8501	1
8502	1
8518	1
8520	1
8523	1
8525	0
8532	1
8535	1
8543	1
8554	1
8560	2
8561	2
8563	1
8566	1
8570	1
8572	1
8582	0
8583	1
8587	1
8592	0
8593	1
8607	1
8609	1
8610	1
8614	1
8616	2
8622	2
8623	1
8624	1
8633	2
8641	2
8644	2
8649	1
8653	1
8657	2
8658	1
8663	1
8672	0
8680	1
8684	1
8687	2
8688	1
8690	1
8712	0
8717	1
8730	2
8739	1
8744	1
8747	2
8748	1
8751	1
8758	1
8761	1
8763	2
8764	1
8765	2
8773	2
8780	1
8781	1
8782	1
8785	0
8786	1
8797	1
8799	1
8807	1
8816	1
8817	1
8826	2
8833	1
8834	1
8835	1
8840	2
8843	1
8849	1
8855	1
8861	1
8862	1
8865	1
8868	2
8870	1
8880	1
8885	1
8894	1
8895	2
8899	2
8912	1
8922	1
8924	1
8928	1
8932	1
8943	2
8945	2
8946	1
8954	0
8958	1
8960	1
8965	0
8966	1
8967	1
8969	1
8980	1
8984	1
8985	1
8988	1
8989	1
8995	1
9004	1
9010	0
9012	1
9018	1
9036	0
9037	0
9040	1
9041	2
9044	2
9045	1
9047	1
9049	1
9061	2
9062	1
9076	1
9079	1
9081	1
9082	1
9089	1
9092	1
9094	2
9115	1
9117	1
9118	1
9120	1
9124	1
9128	1
9135	1
9136	1
9138	1
9157	1
9176	2
9183	2
9187	1
9188	1
9190	1
9197	1
9200	1
9201	1
9203	1
9212	1
9213	1
9214	2
9217	1
9219	1
9220	2
9221	2
9237	1
9240	1
9241	1
9248	1
9253	2
9259	1
9267	1
9271	1
9273	0
9285	2
9290	1
9291	1
9293	1
9294	1
9301	1
9302	1
9312	1
9316	1
9319	1
9328	1
9331	1
9338	0
9350	1
9356	1
9359	0
9362	1
9364	1
9370	1
9380	1
9386	1
9394	1
9407	1
9411	1
9422	1
9423	1
9429	1
9433	1
9439	1
9451	1
9452	1
9453	1
9460	2
9465	1
9470	1
9476	1
9485	1
9486	1
9488	1
9507	2
9508	1
9517	2
9521	2
9528	1
9532	1
9536	1
9540	2
9542	1
9546	1
9548	1
9549	2
9554	2
9555	1
9558	1
9573	2
9575	2
9584	1
9586	1
9588	2
9591	1
9592	2
9597	2
9600	1
9603	1
9605	1
9614	1
9616	1
9622	1
9624	1
9629	1
9633	1
9640	1
9644	0
9645	1
9646	1
9648	1
9649	0
9660	1
9664	1
9675	1
9679	1
9680	1
9682	1
9697	1
9701	1
9704	1
9705	2
9707	1
9714	1
9718	1
9722	2
9739	1
9747	2
9751	1
9757	0
9759	1
9760	1
9764	1
9776	1
9778	1
9786	1
9803	1
9804	1
9815	2
9824	0
9825	0
9826	1
9827	1
9833	1
9835	0
9860	1
9865	1
9871	1
9874	1
9880	1
9882	1
9885	1
9888	1
9892	1
9893	2
9896	1
9902	1
9906	1
9910	1
9914	2
9918	1
9920	1
9926	1
9931	2
9935	1
9945	1
9953	2
9957	1
9963	1
9972	1
9976	2
9979	1
9980	1
9982	1
9991	1
10000	1
10003	1
10005	1
10014	1
10032	1
10034	1
10041	1
10042	1
10044	2
10045	0
10054	1
10061	2
10062	1
10073	0
10081	2
10084	2
10086	1
10093	1
10101	2
10105	1
10110	1
10113	1
10115	1
10119	1
10121	1
10124	1
10126	2
10127	1
10145	1
10147	1
10148	1
10162	0
10163	1
10166	1
10172	1
10173	1
10175	0
10180	1
10186	1
10192	1
10199	1
10209	1
10210	2
10214	1
10215	1
10216	2
10232	1
10239	1
10249	1
10253	2
10255	1
10262	1
10264	1
10266	1
10268	1
10271	1
10272	2
10276	1
10277	1
10279	1
10281	0
10285	1
10294	0
10300	1
10304	1
10307	1
10309	2
10310	1
10312	1
10321	1
10332	1
10336	1
10368	1
10369	1
10375	1
10376	1
10379	1
10380	1
10383	0
10385	2
10387	1
10397	1
10412	1
10413	2
10418	1
10420	1
10426	1
10427	1
10428	1
10430	1
10435	2
10436	0
10446	2
10448	1
10449	1
10463	1
10469	1
10470	2
10471	1
10473	2
10476	1
10482	1
10500	1
10511	2
10512	1
10514	1
10515	2
10526	1
10546	1
10549	1
10553	1
10558	1
10575	1
10581	1
10583	0
10584	1
10585	1
10610	1
10611	1
10616	1
10618	1
10628	0
10632	1
10642	1
10648	1
10649	1
10650	1
10654	0
10656	2
10661	2
10663	2
10672	1
10678	2
10685	1
10690	1
10702	1
10706	1
10708	1
10716	1
10717	2
10720	2
10729	1
10730	2
10745	2
10753	1
10754	1
10762	1
10766	1
10776	1
10783	1
10789	1
10790	2
10797	1
10807	1
10810	1
10817	1
10820	1
10822	1
10828	1
10829	1
10830	1
10831	2
10841	2
10847	1
10856	0
10860	1
10861	2
10863	1
10875	1
10884	1
10895	1
10897	1
10898	1
10903	1
10908	2
10924	1
10926	1
10927	1
10928	1
10933	0
10939	2
10942	1
10945	1
10949	1
10950	1
10958	2
10963	1
10967	1
10971	1
10972	1
10974	1
10976	2
10980	1
10991	1
10995	2
11014	2
11017	1
11019	1
11022	1
11030	2
11031	1
11041	2
11042	1
11044	1
11047	1
11048	1
11049	1
11052	1
11058	1
11069	1
11070	1
11073	2
11074	1
11078	2
11079	1
11085	1
11088	1
11106	1
11110	2
11114	1
11118	1
11129	1
11130	1
11131	1
11133	1
11138	1
11143	1
11146	2
11153	1
11162	1
11170	2
11171	1
11201	1
11216	1
11219	1
11222	1
11234	1
11238	1
11244	1
11246	1
11248	1
11250	1
11256	1
11259	0
11263	1
11264	1
11270	1
11274	1
11281	1
11285	1
11300	0
11305	1
11317	1
11319	1
11330	2
11334	1
11335	2
11336	1
11356	1
11358	1
11360	1
11364	1
11373	2
11379	1
11382	1
11383	1
11385	1
11387	1
11391	1
11397	1
11404	0
11405	1
11409	1
11419	1
11430	2
11434	2
11436	1
11440	1
11443	1
11449	1
11452	1
11453	1
11456	1
11457	1
11459	1
11471	0
11476	1
11479	1
11481	1
11485	1
11486	1
11487	1
11488	0
11498	2
11506	1
11511	2
11515	1
11518	1
11521	0
11523	1
11524	2
11525	1
11528	1
11530	1
11531	1
11533	1
11535	1
11537	1
11538	1
11541	1
11548	1
11552	1
11558	1
11560	1
11566	1
11572	1
11573	2
11582	1
11586	1
11590	1
11591	1
11601	1
11611	1
11617	1
11619	1
11624	1
11626	2
11644	0
11652	1
11656	1
11658	1
11659	2
11663	1
11665	1
11683	1
11685	1
11691	1
11694	1
11698	1
11700	1
11703	1
11705	1
11710	1
11711	1
11714	1
11731	1
11732	1
11742	1
11744	1
11745	1
11749	2
11756	1
11761	0
11762	2
11766	2
11767	2
11769	1
11770	1
11771	1
11777	1
11778	2
11779	1
11788	1
11790	1
11794	1
11801	1
11807	1
11812	1
11817	2
11818	1
11825	1
11828	1
11833	2
11837	2
11838	0
11842	2
11853	1
11857	1
11858	0
11860	2
11867	1
11868	2
11871	1
11875	1
11881	2
11890	1
11892	2
11894	1
11896	1
11903	1
11905	1
11907	1
11909	2
11911	1
11915	1
11918	1
11920	2
11923	1
11924	1
11926	0
11931	1
11933	1
11940	1
11951	1
11953	1
11973	1
11984	1
11985	1
11991	1
12002	1
12006	1
12008	2
12013	1
12015	1
12016	1
12023	1
12029	2
12036	1
12038	1
12041	2
12049	1
12050	1
12054	1
12060	1
12062	2
12065	1
12079	1
12083	2
12090	1
12091	1
12094	1
12099	1
12101	1
12110	1
12116	2
12122	1
12127	2
12133	1
12142	2
12147	1
12156	1
12157	2
12158	2
12161	1
12163	1
12166	1
12170	1
12174	2
12183	1
12188	1
12189	2
12192	1
12201	1
12204	1
12207	1
12208	0
12209	1
12210	2
12217	1
12227	1
12231	1
12232	1
12239	2
12240	1
12251	1
12256	1
12261	0
12263	1
12266	1
12267	1
12268	1
12279	1
12280	1
12283	1
12284	1
12285	2
12286	2
12292	1
12295	1
12301	1
12314	1
12315	0
12318	1
12332	2
12334	2
12337	1
12338	1
12349	2
12350	1
12359	2
12360	2
12373	1
12374	1
12380	1
12382	1
12383	1
12390	1
12398	1
12405	1
12407	1
12410	2
12418	2
12421	1
12422	1
12439	1
12444	1
12463	2
12465	1
12470	1
12471	1
12480	1
12482	1
12484	1
12487	1
12491	1
12503	1
12507	1
12526	1
12533	1
12540	1
12543	1
12552	1
12555	2
12556	1
12570	1
12579	1
12588	1
12600	1
12615	2
12624	1
12629	1
12634	1
12638	1
12646	1
12650	1
12665	1
12674	1
12676	1
12678	1
12685	1
12690	1
12698	1
12702	1
12704	1
12705	1
12710	1
12715	2
12720	1
12734	1
12744	1
12747	1
12757	1
12758	0
12766	1
12782	1
12787	1
12799	1
12804	1
12809	1
12813	1
12816	1
12821	1
12826	1
12831	1
12832	1
12833	1
12835	1
12842	1
12844	1
12847	0
12852	1
12856	1
12857	1
12858	2
12861	1
12869	1
12876	1
12877	2
12879	1
12882	1
12883	2
12887	1
12889	1
12891	2
12894	1
12895	1
12899	2
12905	2
12913	1
12916	1
12917	1
12925	1
12934	2
12939	1
12943	2
12950	2
12961	0
12963	1
12973	2
12979	1
12980	0
12981	1
12982	1
12992	1
12994	1
12999	1
13002	1
13004	1
13010	1
13013	1
13015	1
13019	1
13030	0
13031	2
13036	1
13037	2
13042	1
13054	1
13060	2
13072	1
13073	1
13079	2
13081	1
13086	1
13087	1
13090	1
13098	1
13100	1
13105	1
13106	1
13107	1
13113	2
13115	2
13117	1
13118	1
13121	1
13137	0
13146	1
13150	1
13151	1
13152	1
13156	1
13165	2
13169	1
13178	1
13180	1
13183	1
13184	1
13188	1
13191	1
13196	1
13203	1
13206	1
13211	1
13219	1
13223	2
13226	1
13228	1
13230	1
13240	1
13249	2
13250	1
13256	1
13261	2
13263	1
13268	1
13275	2
13277	2
13283	1
13284	1
13285	1
13286	1
13287	1
13290	1
13291	2
13294	2
13295	1
13303	1
13306	0
13311	1
13322	1
13331	1
13337	1
13344	1
13362	1
13364	1
13366	1
13368	1
13370	0
13377	2
13378	1
13388	1
13392	1
13398	1
13403	1
13404	1
13409	1
13416	2
13422	0
13427	1
13433	2
13438	1
13441	2
13449	2
13450	1
13453	1
13460	1
13461	1
13465	1
13468	1
13481	1
13485	1
13487	2
13490	1
13493	1
13497	1
13508	2
13516	1
13525	1
13533	2
13535	2
13538	1
13545	1
13566	1
13581	1
13584	1
13588	1
13596	1
13600	2
13604	1
13608	1
13611	1
13612	1
13615	1
13616	1
13618	1
13625	1
13628	1
13629	1
13630	2
13633	1
13637	1
13640	1
13641	1
13651	1
13674	1
13684	1
13690	1
13707	1
13709	2
13710	1
13713	1
13724	2
13725	2
13731	1
13736	0
13740	2
13745	2
13748	1
13751	1
13758	1
13762	1
13764	2
13765	1
13769	0
13770	1
13774	2
13787	1
13791	0
13802	0
13807	1
13808	0
13809	1
13810	1
13822	1
13823	1
13825	1
13826	1
13833	1
13837	1
13842	1
13846	0
13852	1
13853	0
13858	1
13860	1
13866	1
13886	1
13887	2
13890	1
13891	1
13893	1
13902	2
13903	1
13908	1
13912	1
13924	1
13928	1
13929	1
13938	1
13939	1
13941	1
13951	0
13962	0
13964	1
13967	1
13971	1
13972	1
13975	1
13977	1
13979	2
13983	1
13984	1
13987	1
13994	1
13999	1
14003	2
14008	1
14011	1
14012	1
14016	1
14017	1
14020	1
14027	2
14038	1
14040	2
14042	1
14055	2
14057	1
14060	1
14081	1
14091	1
14111	1
14117	2
14121	1
14122	2
14125	1
14129	1
14135	1
14148	1
14157	1
14161	1
14163	1
14172	1
14180	1
14182	1
14188	2
14191	1
14201	2
14202	1
14213	1
14220	1
14224	1
14231	1
14241	2
14243	1
14245	1
14247	1
14248	1
14252	1
14254	1
14260	1
14269	1
14272	2
14274	1
14279	1
14280	2
14290	1
14298	2
14308	1
14313	1
14316	1
14319	2
14322	1
14323	1
14325	2
14337	1
14339	1
14341	1
14342	1
14346	2
14351	1
14354	1
14355	1
14358	2
14359	1
14364	1
14374	1
14376	0
14382	1
14384	1
14393	1
14398	1
14403	1
14406	1
14408	1
14411	2
14414	1
14418	1
14423	1
14442	1
14443	1
14444	1
14446	1
14455	1
14456	2
14458	1
14464	1
14466	1
14467	1
14469	1
14483	1
14484	1
14490	1
14491	1
14494	1
14496	1
14503	1
14504	1
14505	1
14506	1
14507	1
14512	1
14520	1
14527	0
14531	1
14532	2
14535	1
14543	1
14554	1
14556	2
14557	1
14561	1
14562	1
14567	0
14568	1
14574	1
14575	1
14579	1
14581	1
14582	1
14586	1
14591	1
14598	1
14599	1
14600	1
14612	2
14613	2
14624	1
14626	1
14630	2
14633	1
14639	1
14642	1
14643	2
14649	1
14650	1
14653	1
14655	1
14656	1
14662	1
14663	1
14673	1
14674	1
14676	1
14682	1
14685	2
14689	1
14693	1
14697	1
14700	1
14704	1
14710	1
14719	1
14724	1
14728	1
14735	1
14736	1
14741	1
14744	0
14753	1
14756	1
14762	2
14765	2
14783	2
14784	1
14786	0
14790	0
14793	1
14796	2
14801	1
14807	1
14812	1
14815	2
14831	1
14833	2
14836	2
14856	1
14859	1
14861	1
14863	0
14865	1
14880	1
14881	1
14883	1
14884	1
14894	1
14896	1
14899	1
14900	1
14901	1
14906	1
14907	1
14915	2
14919	1
14926	2
14927	1
14933	1
14937	1
14939	1
14940	1
14943	1
14953	1
14954	1
14969	1
14999	2
15008	1
15009	1
15018	1
15023	1
15025	1
15034	1
15036	1
15051	2
15052	1
15064	1
15070	1
15074	1
15077	1
15081	1
15086	2
15093	0
15094	2
15103	1
15104	1
15110	1
15112	0
15115	1
15131	1
15139	1
15141	1
15148	1
15154	2
15156	1
15161	0
15167	1
15178	1
15205	2
15207	2
15222	1
15223	2
15225	1
15228	1
15239	1
15241	1
15246	1
15247	1
15249	1
15255	1
15257	1
15267	1
15277	1
15280	2
15289	1
15297	1
15302	1
15304	0
15312	1
15321	0
15325	1
15326	1
15333	1
15337	1
15338	1
15340	1
15342	0
15344	1
15347	1
15349	2
15355	1
15359	1
15366	1
15367	0
15368	1
15369	1
15380	1
15381	1
15387	0
15388	1
15389	1
15392	2
15400	1
15405	1
15407	1
15408	2
15411	1
15413	1
15418	2
15419	2
15421	1
15425	1
15436	1
15438	1
15440	1
15443	1
15460	1
15464	1
15465	1
15473	1
15475	1
15483	1
15494	2
15495	2
15498	1
15499	1
15500	1
15501	1
15510	0
15512	1
15516	1
15518	2
15519	1
15524	1
15527	1
15529	1
15530	1
15538	1
15539	1
15541	1
15546	1
15547	1
15548	2
15552	0
15556	2
15567	1
15572	1
15573	1
15574	1
15577	1
15579	1
15581	1
15589	1
15596	1
15598	1
15599	1
15605	2
15606	2
15608	1
15616	1
15618	1
15621	1
15626	2
15638	1
15639	0
15642	1
15644	1
15646	1
15649	1
15656	1
15659	1
15680	1
15686	1
15693	2
15697	2
15699	2
15701	1
15705	1
15714	1
15722	2
15728	1
15734	1
15752	0
15756	1
15760	1
15762	1
15767	1
15768	1
15773	2
15774	1
15781	1
15782	1
15784	2
15791	1
15796	1
15798	2
15806	1
15814	1
15819	1
15825	1
15826	1
15831	1
15835	2
15836	1
15839	2
15845	1
15858	1
15859	1
15876	1
15878	2
15880	2
15886	2
15888	1
15891	1
15900	0
15902	1
15904	1
15908	0
15910	1
15917	1
15919	2
15924	1
15927	1
15937	2
15946	1
15949	1
15957	1
15961	1
15964	1
15965	1
15966	1
15978	1
15983	1
15987	1
15988	1
15998	1
16004	1
16008	1
16011	1
16023	1
16024	1
16025	1
16048	1
16050	1
16051	1
16057	2
16059	2
16060	1
16075	2
16094	2
16096	1
16116	1
16118	1
16121	1
16122	1
16124	2
16125	1
16126	1
16130	1

Model output

# export to .csv for submission
write.csv(lm_predicted_df, file = "C:/CUNY/Semester4(Spring)/DATA 621/Assignments/Homework5/Output/Wine_Sales_Prediction.csv",row.names = FALSE)

Our model prediction output can be found in the below GitHub location -

Model Output

DATA 621 - Homework5 - Wine Sales Prediction Count Regression Model

Soumya Ghosh

May 10, 2020

Libraries

Introduction

Objective

Data Load

Training Data

PART I: Data Exploration

Data Cleansing & Transoformation

Descriptive Statistical Summary

Missing Value Analysis

Imputation

Descriptive Statistical Plots

Box Plots

Density Plots

Observations Summary

PART II: Data Preparation

Correlation Plot

Feature Engineering

PART III: Building Models

Test train approach

Build Poisson Regression Models

Model 1 (Poisson Regression):

RMSE Calculation

Model 2 (Poisson Regression Model 2 - Zero-inflated):

RMSE Calculation

Poisson models comparison statistic

Model 3: Negative Binomial Regression

RMSE Calculation

Model 4: Negative Binomial

RMSE Calculation

Model 5: Multiple Linear Regression

Model Diagnostic Plots

RMSE Calculation

Model 6: Multiple Linear Regression

Model Diagnostic Plots

RMSE Calculation

PART IV: Selecting Models

Compare Key Regression Model statistics

Model Prediction

Model output