Libraries

library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(MASS)
library(corrplot)
library(RColorBrewer)
library(GGally)
library(ggResidpanel)
library(psych)
library(mice)
library(reshape2)
library(cowplot)
library(car)
library(caTools)
library(VIM)
library(broom)
library(pROC)
library(caret)
library(geoR)
library(moments)
library(glmulti)
library(pscl)

Introduction

In this homework assignment, we will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Our dataset consists of 15 variables about different qualities. A wine producer might be able to use this data, along with the target variable, number of cases purchased by restaurants, to determine what qualities consumers are looking for in wines and to be able to plan accordingly.

vr <- c("INDEX", "TARGET", "AcidIndex", "Alcohol", "Chlorides", "CitricAcid", "Density", "FixedAcidity", "FreeSulfurDioxide", "LabelAppeal", "ResidualSugar", "STARS", "Sulphates", "TotalSulfurDioxide", "VolatileAcidity", "pH")

def <- c("Identification Variable (do not use)", "Number of Cases Purchased", "Proprietary method of testing total acidity of wine by using a weighted average", "Alcohol Content", "Chloride content of wine", "Citric Acid Content", "Density of Wine", "Fixed Acidity of Wine", "Sulfur Dioxide content of wine", "Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don't like the design.", "Residual Sugar of wine", "Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor", "Sulfate conten of wine", "Total Sulfur Dioxide of Wine", "Volatile Acid content of wine", "pH of wine")

te <- c("None", "None", "", "",  "", "", "", "",  "", "Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales", "", "A high number of stars suggests high sales",  "", "", "", "")

kable(cbind(vr, def, te), col.names = c("Variable Name", "Definition", "Theoretical Effect")) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Variable Name Definition Theoretical Effect
INDEX Identification Variable (do not use) None
TARGET Number of Cases Purchased None
AcidIndex Proprietary method of testing total acidity of wine by using a weighted average
Alcohol Alcohol Content
Chlorides Chloride content of wine
CitricAcid Citric Acid Content
Density Density of Wine
FixedAcidity Fixed Acidity of Wine
FreeSulfurDioxide Sulfur Dioxide content of wine
LabelAppeal Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don’t like the design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales
ResidualSugar Residual Sugar of wine
STARS Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor A high number of stars suggests high sales
Sulphates Sulfate conten of wine
TotalSulfurDioxide Total Sulfur Dioxide of Wine
VolatileAcidity Volatile Acid content of wine
pH pH of wine

Objective

Our objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided).

Data Load

Loaded Training and Evalutaion data sets into respective data frames.

train_df <- read.csv("https://raw.githubusercontent.com/soumya2g/CUNYDataMiningHomeWork/master/HomeWork_5/DataFiles/wine-training-data.csv")

eval_df <- read.csv("https://raw.githubusercontent.com/soumya2g/CUNYDataMiningHomeWork/master/HomeWork_5/DataFiles/wine-evaluation-data.csv")

Training Data

Sample snapshot of training data frame -

head(train_df, 20) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
ï..INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol LabelAppeal AcidIndex STARS
1 3 3.2 1.160 -0.98 54.20 -0.567 NA 268 0.99280 3.33 -0.59 9.9 0 8 2
2 3 4.5 0.160 -0.81 26.10 -0.425 15 -327 1.02792 3.38 0.70 NA -1 7 3
4 5 7.1 2.640 -0.88 14.80 0.037 214 142 0.99518 3.12 0.48 22.0 -1 8 3
5 3 5.7 0.385 0.04 18.80 -0.425 22 115 0.99640 2.24 1.83 6.2 -1 6 1
6 4 8.0 0.330 -1.26 9.40 NA -167 108 0.99457 3.12 1.77 13.7 0 9 2
7 0 11.3 0.320 0.59 2.20 0.556 -37 15 0.99940 3.20 1.29 15.4 0 11 NA
8 0 7.7 0.290 -0.40 21.50 0.060 287 156 0.99572 3.49 1.21 10.3 0 8 NA
11 4 6.5 -1.220 0.34 1.40 0.040 523 551 1.03236 3.20 NA 11.6 1 7 3
12 3 14.8 0.270 1.05 11.25 -0.007 -213 NA 0.99620 4.93 0.26 15.0 0 6 NA
13 6 5.5 -0.220 0.39 1.80 -0.277 62 180 0.94724 3.09 0.75 12.6 0 8 4
14 0 -17.2 0.520 0.15 -33.80 -0.022 551 65 0.99340 4.31 0.56 13.1 1 5 1
15 4 9.0 0.220 0.49 10.40 -0.150 NA 195 1.05918 3.01 -0.40 10.2 0 10 2
16 3 6.0 0.330 -1.06 3.00 0.518 5 378 0.96643 3.55 -0.86 3.9 1 7 2
17 7 14.9 0.320 0.34 1.30 0.042 20 NA 0.95523 2.93 2.16 12.0 2 8 3
19 4 21.8 0.510 1.83 2.90 -0.165 -49 45 0.99740 6.02 2.23 12.1 0 9 NA
20 0 5.7 0.500 -0.11 6.10 0.071 234 -331 0.99780 3.35 NA 6.8 0 8 NA
22 0 7.9 0.600 0.06 50.40 0.069 15 339 0.99640 3.30 0.46 3.9 1 9 NA
23 4 -1.3 0.220 2.95 -53.00 0.541 -85 -266 0.99672 3.61 0.82 10.0 0 8 3
24 5 10.0 0.230 0.27 14.10 0.033 -188 229 0.99880 3.14 0.88 11.0 1 11 2
25 4 6.8 0.475 -0.20 -50.75 0.047 -88 508 0.99403 3.23 0.35 18.3 -1 8 2
str(train_df)
## 'data.frame':    12795 obs. of  16 variables:
##  $ ï..INDEX          : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...

PART I: Data Exploration

Data Cleansing & Transoformation

We wanted to start off data exploration process with cleaning up the issues that we observed above -

# Exclude INDEX attribute from the data frame
train_df <- train_df %>% dplyr::select(-"ï..INDEX")

Descriptive Statistical Summary

Next step of data exploration process involves high level descriptive statistical summary and missing/exception value analysis.

stat_summary <- function(df){
  df %>%
    summary() %>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
    scroll_box(width="100%",height="400px")
}
stat_summary(train_df)
 TARGET </th>
FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
Density </th>
   pH </th>
Sulphates
Alcohol </th>
LabelAppeal AcidIndex
 STARS </th>
Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
NA NA NA NA NA’s :616 NA’s :638 NA’s :647 NA’s :682 NA NA’s :395 NA’s :1210 NA’s :653 NA NA NA’s :3359

From the above we can see that there are certain features with missing values. We will handle these issues in a further step towards improving data quality.

We also used describe() function of ‘psych’ package to summarize additional statistical measurements like Standard Deviation, Skewness, Kurtois, Standard Error etc.

stat_desc <- function(df){
df %>% 
    describe() %>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
}

stat_desc(train_df)
vars n mean sd median trimmed mad min max range skew kurtosis se
TARGET 1 12795 3.0290739 1.9263682 3.00000 3.0538244 1.4826000 0.00000 8.00000 8.00000 -0.3263010 -0.8772457 0.0170302
FixedAcidity 2 12795 7.0757171 6.3176435 6.90000 7.0736739 3.2617200 -18.10000 34.40000 52.50000 -0.0225860 1.6749987 0.0558515
VolatileAcidity 3 12795 0.3241039 0.7840142 0.28000 0.3243890 0.4299540 -2.79000 3.68000 6.47000 0.0203800 1.8322106 0.0069311
CitricAcid 4 12795 0.3084127 0.8620798 0.31000 0.3102520 0.4151280 -3.24000 3.86000 7.10000 -0.0503070 1.8379401 0.0076213
ResidualSugar 5 12179 5.4187331 33.7493790 3.90000 5.5800410 15.7155600 -127.80000 141.15000 268.95000 -0.0531229 1.8846917 0.3058158
Chlorides 6 12157 0.0548225 0.3184673 0.04600 0.0540159 0.1349166 -1.17100 1.35100 2.52200 0.0304272 1.7886044 0.0028884
FreeSulfurDioxide 7 12148 30.8455713 148.7145577 30.00000 30.9334877 56.3388000 -555.00000 623.00000 1178.00000 0.0063930 1.8364966 1.3492769
TotalSulfurDioxide 8 12113 120.7142326 231.9132105 123.00000 120.8895367 134.9166000 -823.00000 1057.00000 1880.00000 -0.0071794 1.6746665 2.1071703
Density 9 12795 0.9942027 0.0265376 0.99449 0.9942130 0.0093552 0.88809 1.09924 0.21115 -0.0186938 1.8999592 0.0002346
pH 10 12400 3.2076282 0.6796871 3.20000 3.2055706 0.3854760 0.48000 6.13000 5.65000 0.0442880 1.6462681 0.0061038
Sulphates 11 11585 0.5271118 0.9321293 0.50000 0.5271453 0.4447800 -3.13000 4.24000 7.37000 0.0059119 1.7525655 0.0086602
Alcohol 12 12142 10.4892363 3.7278190 10.40000 10.5018255 2.3721600 -4.70000 26.50000 31.20000 -0.0307158 1.5394949 0.0338306
LabelAppeal 13 12795 -0.0090660 0.8910892 0.00000 -0.0099639 1.4826000 -2.00000 2.00000 4.00000 0.0084295 -0.2622916 0.0078777
AcidIndex 14 12795 7.7727237 1.3239264 8.00000 7.6431572 1.4826000 4.00000 17.00000 13.00000 1.6484959 5.1900925 0.0117043
STARS 15 9436 2.0417550 0.9025400 2.00000 1.9711258 1.4826000 1.00000 4.00000 3.00000 0.4472353 -0.6925343 0.0092912

Missing Value Analysis

Below we have applied imputation technique for the features that has NA values.

## Counts of missing data per feature
train_na_df <- data.frame(apply(train_df, 2, function(x) length(which(is.na(x)))))
train_na_df1 <- data.frame(apply(train_df, 2,function(x) {sum(is.na(x)) / length(x) * 100}))

train_na_df <- cbind(Feature = rownames(train_na_df), train_na_df, train_na_df1)
colnames(train_na_df) <- c('Feature Name','No. of NA Recocrds','Percentage of NA Records')
rownames(train_na_df) <- NULL


train_na_df%>% filter(`No. of NA Recocrds` != 0) %>% arrange(desc(`No. of NA Recocrds`)) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Feature Name No. of NA Recocrds Percentage of NA Records
STARS 3359 26.252442
Sulphates 1210 9.456819
TotalSulfurDioxide 682 5.330207
Alcohol 653 5.103556
FreeSulfurDioxide 647 5.056663
Chlorides 638 4.986323
ResidualSugar 616 4.814381
pH 395 3.087143

Imputation

Imputing the missing data for the other 4 features, which we are keeping for our analysis. We have used the ‘Predictive Mean Matching’(pmm) method included in MICE package for imputation purposes.

impute_data <- function(df){
  
  df <- mice(data = df, m = 1, method = "pmm", maxit = 5, seed = 500)
  df <- mice::complete(df, 1)
}

train_df <- impute_data(train_df)
## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
eval_df <- impute_data(eval_df)
## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

Below we have applied imputation technique for the features that has NA values per the table above.

Descriptive Statistical Plots

Box Plots

## Box plots:
gb1 <- ggplot(data = train_df, aes(y = TARGET)) + geom_boxplot()
gb2 <- ggplot(data = train_df, aes(y = FixedAcidity)) + geom_boxplot()
gb3 <- ggplot(data = train_df, aes(y = VolatileAcidity)) + geom_boxplot()
gb4 <- ggplot(data = train_df, aes(y = CitricAcid)) + geom_boxplot()
gb5 <- ggplot(data = train_df, aes(y = ResidualSugar)) + geom_boxplot()
gb6 <- ggplot(data = train_df, aes(y = Chlorides)) + geom_boxplot()
gb7 <- ggplot(data = train_df, aes(y = FreeSulfurDioxide)) + geom_boxplot()
gb8 <- ggplot(data = train_df, aes(y = TotalSulfurDioxide)) + geom_boxplot()
gb9 <- ggplot(data = train_df, aes(y = Density)) + geom_boxplot()
gb10 <- ggplot(data = train_df, aes(y = pH)) + geom_boxplot()
gb11 <- ggplot(data = train_df, aes(y = Sulphates)) + geom_boxplot()
gb12 <- ggplot(data = train_df, aes(y = Alcohol)) + geom_boxplot()
gb13 <- ggplot(data = train_df, aes(y = LabelAppeal)) + geom_boxplot()
gb14 <- ggplot(data = train_df, aes(y = AcidIndex)) + geom_boxplot()
gb15 <- ggplot(data = train_df, aes(y = STARS)) + geom_boxplot()


plot_grid(gb1, gb2, gb3, gb4, gb5, gb6, gb7, gb8, gb9, gb10,
          gb11, gb12, gb13, gb14, gb15, labels = "AUTO, scale = 8")

Density Plots

train_df %>% dplyr::select(TARGET,FixedAcidity,VolatileAcidity,CitricAcid,ResidualSugar,Chlorides,FreeSulfurDioxide,TotalSulfurDioxide,Density,pH,Sulphates,Alcohol,LabelAppeal,AcidIndex,STARS) %>%
  gather(variable, value) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "dodgerblue4", color="dodgerblue4") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

Observations Summary

The above shows that most of the data is more or less normally distributed. There are some values such as Sulphates, Chlorides, VolatileAcidity, CitricAcid, ResidualSugar, FixedAcid, FreeSulfurDioxide, and TotalSulfurDioxide that have negative values. After taking a closer look at these data points, it is likely that the creator of this dataset ended up standardizing the data. Because the data is already normally distributed and the test (eval) dataset is also likely structured in this way, we will not be making any adjustments to this data.

It would be interesting to see if there were any correlations between the independent variables to independent variables, and independent variables to the dependent variable.

PART II: Data Preparation

Correlation Plot

trainnum_df <- dplyr::select_if(train_df, is.numeric)
corrMatrix <- round(cor(trainnum_df),4)

corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

In the correlation plot above, we see, STARS and LabelAppeal are most positively correlated variables with the response variable. We expected this because our variable description mentions these variable’s theoretical effect are higher than other variables. Also, we some mild negative correlation between the response variable and AcidIndex variable.

Feature Engineering

In binary logistic regression, it is desirable to have predictor variables that are normally distributed, whenever possible. The data in the crime dataset presents some factors that would lead us to have to perform some “Transformations” on the data. These transformation include adding categorical variables, log of variables, and adding power transformations etc.

PART III: Building Models

Test train approach

We have divided the traning dataset into training and test sets using a 80/20 split. We will build our models on the training set and evaluate it on the test set.

set.seed(123)
split <- sample.split(train_df$TARGET, SplitRatio = 0.8)
training_set <- subset(train_df, split == TRUE)
test_set <- subset(train_df, split == FALSE)

Below is the list of complete list of variables before we embark on Model building process -

str(training_set)
## 'data.frame':    10237 obs. of  15 variables:
##  $ TARGET            : int  3 3 5 3 4 0 0 4 6 4 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 5.5 21.8 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 -0.22 0.51 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 0.39 1.83 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 2.2 21.5 1.4 1.8 2.9 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 0.049 0.556 0.06 0.04 -0.277 -0.165 ...
##  $ FreeSulfurDioxide : num  -8 15 214 22 -167 -37 287 523 62 -49 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 180 45 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 3.09 6.02 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 0.83 0.75 2.23 ...
##  $ Alcohol           : num  9.9 13.5 22 6.2 13.7 15.4 10.3 11.6 12.6 12.1 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 8 9 ...
##  $ STARS             : int  2 3 3 1 2 1 1 3 4 1 ...

We will build two different Poisson regression models using dataset with and without imputed values, two different negative binomial regression models using stepwise variables selection and imputed variables, and two multiple linear regression models using stepwise variables selection and imputed variables to see which model yields the best performance.

Build Poisson Regression Models

Model 1 (Poisson Regression):

We will start off with a model with all the original variables excluding any derived features -

poisson_model_1 <- glm(TARGET ~ ., family=poisson, data=training_set)
poisson_model_1 <- step(poisson_model_1, direction="backward")
## Start:  AIC=38426.61
## TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
##     Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
##     pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
## 
##                      Df Deviance   AIC
## - ResidualSugar       1    12839 38425
## - FixedAcidity        1    12839 38425
## - Alcohol             1    12840 38426
## <none>                     12839 38427
## - Density             1    12841 38427
## - CitricAcid          1    12841 38427
## - Sulphates           1    12843 38429
## - Chlorides           1    12844 38430
## - FreeSulfurDioxide   1    12846 38432
## - TotalSulfurDioxide  1    12846 38433
## - pH                  1    12847 38433
## - VolatileAcidity     1    12874 38460
## - AcidIndex           1    13247 38833
## - LabelAppeal         1    13307 38893
## - STARS               1    15645 41231
## 
## Step:  AIC=38425.09
## TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS
## 
##                      Df Deviance   AIC
## - FixedAcidity        1    12840 38424
## - Alcohol             1    12840 38424
## <none>                     12839 38425
## - Density             1    12841 38425
## - CitricAcid          1    12842 38426
## - Sulphates           1    12844 38428
## - Chlorides           1    12845 38429
## - FreeSulfurDioxide   1    12847 38431
## - TotalSulfurDioxide  1    12847 38431
## - pH                  1    12847 38431
## - VolatileAcidity     1    12874 38458
## - AcidIndex           1    13247 38831
## - LabelAppeal         1    13307 38891
## - STARS               1    15648 41232
## 
## Step:  AIC=38423.92
## TARGET ~ VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS
## 
##                      Df Deviance   AIC
## - Alcohol             1    12841 38423
## <none>                     12840 38424
## - Density             1    12842 38424
## - CitricAcid          1    12842 38424
## - Sulphates           1    12845 38427
## - Chlorides           1    12846 38428
## - FreeSulfurDioxide   1    12848 38430
## - TotalSulfurDioxide  1    12848 38430
## - pH                  1    12848 38430
## - VolatileAcidity     1    12875 38457
## - AcidIndex           1    13264 38846
## - LabelAppeal         1    13308 38891
## - STARS               1    15648 41230
## 
## Step:  AIC=38423.15
## TARGET ~ VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + LabelAppeal + 
##     AcidIndex + STARS
## 
##                      Df Deviance   AIC
## <none>                     12841 38423
## - Density             1    12843 38423
## - CitricAcid          1    12844 38424
## - Sulphates           1    12846 38426
## - Chlorides           1    12847 38427
## - FreeSulfurDioxide   1    12849 38429
## - TotalSulfurDioxide  1    12849 38429
## - pH                  1    12849 38429
## - VolatileAcidity     1    12876 38456
## - AcidIndex           1    13267 38847
## - LabelAppeal         1    13309 38889
## - STARS               1    15677 41257
summary(poisson_model_1)
## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     LabelAppeal + AcidIndex + STARS, family = poisson, data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0644  -0.6898   0.1274   0.6338   2.6622  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.581e+00  2.170e-01   7.283 3.27e-13 ***
## VolatileAcidity    -4.351e-02  7.326e-03  -5.940 2.86e-09 ***
## CitricAcid          1.067e-02  6.561e-03   1.627  0.10378    
## Chlorides          -4.355e-02  1.774e-02  -2.455  0.01410 *  
## FreeSulfurDioxide   1.066e-04  3.828e-05   2.785  0.00535 ** 
## TotalSulfurDioxide  6.950e-05  2.474e-05   2.809  0.00498 ** 
## Density            -3.228e-01  2.139e-01  -1.509  0.13131    
## pH                 -2.381e-02  8.332e-03  -2.858  0.00426 ** 
## Sulphates          -1.348e-02  6.133e-03  -2.198  0.02794 *  
## LabelAppeal         1.472e-01  6.804e-03  21.631  < 2e-16 ***
## AcidIndex          -1.001e-01  4.991e-03 -20.053  < 2e-16 ***
## STARS               3.371e-01  6.256e-03  53.882  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18291  on 10236  degrees of freedom
## Residual deviance: 12841  on 10225  degrees of freedom
## AIC: 38423
## 
## Number of Fisher Scoring iterations: 5

All of these variables appear to have minor but have statistically significant impact. Overall, it does appear that the more citric, sulfur dioxide containing, alcohol, label appeal, and stars that the observation had, the more likely the case was going to sell. Likewise, the less acid, more dense, more sulphates, and more chlorides, the less likely the wine was going to sell cases.

RMSE Calculation

rmse_calc <- function(actual, predicted) {
  rmse_val <- sqrt(sum((actual - predicted)^2) / length(actual))

  return(rmse_val)
}

### RMSE of first model - training dataset
model1_rmse_train <- mean(rmse_calc(training_set$TARGET, predict(poisson_model_1, newdata = training_set)))
### RMSE of first model - test dataset
model1_rmse_test <- mean(rmse_calc(test_set$TARGET, predict(poisson_model_1, newdata = test_set)))

model1_aic <- poisson_model_1$aic

Model 2 (Poisson Regression Model 2 - Zero-inflated):

“Zero-inflated poisson regression is used to model count data that has an excess of zero counts. Further, theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently. Thus, the zip model has two parts, a poisson count model and the logit model for predicting excess zeros.”

Reference: https://www.theanalysisfactor.com/zero-inflated-poisson-models-for-count-outcomes/

“The Poisson distribution assumes that each count is the result of each Poisson process - a random process that says each counted event is independent and equally likely. If this count variable is used as the outcome of a regression model, we can use Poisson regression to estimate how predictors affect the number of times the devent occurred.”

“But sometimes, it’s just a matter of having too many zeros that a Poisson would predict. In this case, a better solution is often the Zero-Inflated Poisson (ZIP) model.ZIP models assume that some zeros occurred by a Poisson process.The ZIP model fits, simultaneously, two separate regression models. One is a logistic model that models the probability of being eligible for a non-zero count. The other models the size of that count.”

“Both models use the same predictor variables, but estimate their coefficients separately. So the predictors can have vastly different effects on the two processes.”

Below is the second model utilizing ZIP.

poisson_model_2 <- zeroinfl(TARGET ~ ., data = training_set)
summary(poisson_model_2)
## 
## Call:
## zeroinfl(formula = TARGET ~ ., data = training_set)
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.15710 -0.45312  0.02585  0.42737  6.98368 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.434e+00  2.253e-01   6.363 1.97e-10 ***
## FixedAcidity        1.840e-04  9.405e-04   0.196 0.844929    
## VolatileAcidity    -1.669e-02  7.561e-03  -2.207 0.027308 *  
## CitricAcid         -7.852e-04  6.748e-03  -0.116 0.907369    
## ResidualSugar      -6.918e-05  1.743e-04  -0.397 0.691419    
## Chlorides          -1.993e-02  1.838e-02  -1.084 0.278182    
## FreeSulfurDioxide   1.138e-05  3.878e-05   0.293 0.769256    
## TotalSulfurDioxide -1.849e-05  2.471e-05  -0.749 0.454148    
## Density            -3.276e-01  2.212e-01  -1.481 0.138666    
## pH                  5.086e-03  8.642e-03   0.589 0.556164    
## Sulphates           5.984e-04  6.366e-03   0.094 0.925109    
## Alcohol             7.100e-03  1.560e-03   4.552 5.31e-06 ***
## LabelAppeal         2.402e-01  7.092e-03  33.873  < 2e-16 ***
## AcidIndex          -1.859e-02  5.500e-03  -3.379 0.000727 ***
## STARS               1.150e-01  6.978e-03  16.477  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -3.0501383  1.3524316  -2.255 0.024115 *  
## FixedAcidity        0.0036236  0.0055406   0.654 0.513108    
## VolatileAcidity     0.2054621  0.0439780   4.672 2.98e-06 ***
## CitricAcid         -0.0536164  0.0406526  -1.319 0.187205    
## ResidualSugar      -0.0009446  0.0010396  -0.909 0.363543    
## Chlorides           0.2022919  0.1127347   1.794 0.072748 .  
## FreeSulfurDioxide  -0.0006176  0.0002423  -2.548 0.010819 *  
## TotalSulfurDioxide -0.0006515  0.0001525  -4.273 1.93e-05 ***
## Density             1.0260037  1.3274080   0.773 0.439559    
## pH                  0.2267552  0.0517110   4.385 1.16e-05 ***
## Sulphates           0.1351741  0.0383336   3.526 0.000421 ***
## Alcohol             0.0342709  0.0095144   3.602 0.000316 ***
## LabelAppeal         0.7046050  0.0451280  15.613  < 2e-16 ***
## AcidIndex           0.4725343  0.0272423  17.346  < 2e-16 ***
## STARS              -3.1492875  0.1155593 -27.253  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 37 
## Log-likelihood: -1.68e+04 on 30 Df
poisson_model_2$coefficients
## $count
##        (Intercept)       FixedAcidity    VolatileAcidity 
##       1.433536e+00       1.839594e-04      -1.668885e-02 
##         CitricAcid      ResidualSugar          Chlorides 
##      -7.852329e-04      -6.917652e-05      -1.992622e-02 
##  FreeSulfurDioxide TotalSulfurDioxide            Density 
##       1.137619e-05      -1.849410e-05      -3.276113e-01 
##                 pH          Sulphates            Alcohol 
##       5.086326e-03       5.984220e-04       7.100222e-03 
##        LabelAppeal          AcidIndex              STARS 
##       2.402199e-01      -1.858544e-02       1.149713e-01 
## 
## $zero
##        (Intercept)       FixedAcidity    VolatileAcidity 
##      -3.0501383371       0.0036235817       0.2054620822 
##         CitricAcid      ResidualSugar          Chlorides 
##      -0.0536164184      -0.0009446201       0.2022918617 
##  FreeSulfurDioxide TotalSulfurDioxide            Density 
##      -0.0006175706      -0.0006514559       1.0260036862 
##                 pH          Sulphates            Alcohol 
##       0.2267551540       0.1351740966       0.0342708587 
##        LabelAppeal          AcidIndex              STARS 
##       0.7046050190       0.4725343007      -3.1492874622

RMSE Calculation

### RMSE of first model - training dataset
model2_rmse_train <- mean(rmse_calc(training_set$TARGET, predict(poisson_model_2, newdata = training_set)))
### RMSE of first model - test dataset
model2_rmse_test <- mean(rmse_calc(test_set$TARGET, predict(poisson_model_2, newdata = test_set)))

model2_aic <- poisson_model_2$aic

Poisson models comparison statistic

It is interesting to compare the zero inflated Poisson model to the original Poisson model. There does appear to be some differences, but even more interesting is the differences in direction (positive and negative) of the independent variables. For instance, the density of the wine is positive in $count, but negative in $zero (any many other variables). It is unclear why this is, but we will certainly keep this in the back of our mind.

Is the zero-inflated model better than the standard Poisson regression model? We will perform the Vuong test between the two models.

# Vuong
vuong(poisson_model_1, poisson_model_2) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   -40.02174 model2 > model1 < 2.22e-16
## AIC-corrected         -39.72126 model2 > model1 < 2.22e-16
## BIC-corrected         -38.63443 model2 > model1 < 2.22e-16

“The Vuong test compares the zero-inflated model with an ordinary Poisson regression model.” The test statistic is significant indicating that the zero-inflated model is superior to the standard Poisson model.

“A Poisson distribution is parameterized by \(\lambda\), which happens to be both its mean and variance. While convenient, it’s not often realistic. A distribution of counts will usually have a variance that’s not equal to its mean. When we see this happen with data that we assume is (or hope) is Poisson distributed, we say we have under- or overdispersion, depending on if the variance is smaller or larger than the mean. Performing Poisson regression on count data that exhibits this behavior results in a model that doesn’t fit well.”

“One approach that addresses this issue is the Negative Binomial Regresion. The negative binomial distribution describes the probabilities of the occurrence of whole numbers greater than or equal to 0. Unlike the Poisson distribution, the variance and the mean are not equivalent. This suggests it might serve as a useful approximation for modeling counts with variability different from its mean. The variance of a negative binomial distribution is a function of its mean and has an additional parameter, k, called the dispersion parameter. Say our count is a random variable Y from a negative binomial distribution, when the variance of Y is:”

\(var(Y)\quad =\quad \mu \quad +\quad { \mu }^{ 2 }lk\)

“As the dispersion parameter gets larger and larger, the variance converges to the same value as the mean, and the negative binomial turns into a Poisson distribution.”

# Mean/Variance
print(paste0("TARGET mean: ", round(mean(training_set$TARGET, 3))))
## [1] "TARGET mean: 3"
print(paste0("TARGET variance: ", round(var(training_set$TARGET),3)))
## [1] "TARGET variance: 3.712"

Model 3: Negative Binomial Regression

So there appears to be a slight overdispersion with the variance greater than the mean. Let’s apply a negative binomial model for model 3.

negative_binomial_3 <- glm.nb(TARGET ~ ., data=training_set)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(negative_binomial_3)
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = training_set, init.theta = 48494.48001, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0880  -0.6896   0.1272   0.6344   2.6624  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.561e+00  2.177e-01   7.172 7.42e-13 ***
## FixedAcidity       -8.289e-04  9.170e-04  -0.904  0.36607    
## VolatileAcidity    -4.345e-02  7.326e-03  -5.930 3.02e-09 ***
## CitricAcid          1.057e-02  6.565e-03   1.610  0.10734    
## ResidualSugar       1.169e-04  1.693e-04   0.691  0.48984    
## Chlorides          -4.288e-02  1.775e-02  -2.416  0.01570 *  
## FreeSulfurDioxide   1.075e-04  3.829e-05   2.808  0.00498 ** 
## TotalSulfurDioxide  6.965e-05  2.476e-05   2.813  0.00491 ** 
## Density            -3.219e-01  2.140e-01  -1.505  0.13245    
## pH                 -2.380e-02  8.334e-03  -2.855  0.00430 ** 
## Sulphates          -1.332e-02  6.136e-03  -2.171  0.02996 *  
## Alcohol             1.708e-03  1.528e-03   1.118  0.26348    
## LabelAppeal         1.473e-01  6.805e-03  21.642  < 2e-16 ***
## AcidIndex          -9.926e-02  5.048e-03 -19.663  < 2e-16 ***
## STARS               3.365e-01  6.279e-03  53.581  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(48494.48) family taken to be 1)
## 
##     Null deviance: 18290  on 10236  degrees of freedom
## Residual deviance: 12838  on 10222  degrees of freedom
## AIC: 38429
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  48494 
##           Std. Err.:  62679 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38396.79

RMSE Calculation

### RMSE of first model - training dataset
model3_rmse_train <- mean(rmse_calc(training_set$TARGET, predict(negative_binomial_3, newdata = training_set)))
### RMSE of first model - test dataset
model3_rmse_test <- mean(rmse_calc(test_set$TARGET, predict(negative_binomial_3, newdata = test_set)))

model3_aic <- negative_binomial_3$aic

Model 4: Negative Binomial

In our negative binomial regression model below, we use forward and backward step-wise variables selection algorithm. This model is only slightly better with a lower AIC score.

# negative binomial regression with stepwise variable selection
negative_binomial_4 <- stepAIC(negative_binomial_3, direction = "both", trace = FALSE)
summary(negative_binomial_4)
## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     LabelAppeal + AcidIndex + STARS, data = training_set, init.theta = 48441.73451, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0644  -0.6898   0.1273   0.6338   2.6621  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.581e+00  2.170e-01   7.283 3.28e-13 ***
## VolatileAcidity    -4.351e-02  7.326e-03  -5.939 2.86e-09 ***
## CitricAcid          1.067e-02  6.561e-03   1.627  0.10379    
## Chlorides          -4.355e-02  1.774e-02  -2.455  0.01410 *  
## FreeSulfurDioxide   1.066e-04  3.828e-05   2.785  0.00536 ** 
## TotalSulfurDioxide  6.950e-05  2.475e-05   2.809  0.00498 ** 
## Density            -3.228e-01  2.140e-01  -1.509  0.13132    
## pH                 -2.381e-02  8.332e-03  -2.858  0.00426 ** 
## Sulphates          -1.348e-02  6.133e-03  -2.198  0.02794 *  
## LabelAppeal         1.472e-01  6.804e-03  21.630  < 2e-16 ***
## AcidIndex          -1.001e-01  4.992e-03 -20.053  < 2e-16 ***
## STARS               3.371e-01  6.256e-03  53.880  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(48441.73) family taken to be 1)
## 
##     Null deviance: 18290  on 10236  degrees of freedom
## Residual deviance: 12841  on 10225  degrees of freedom
## AIC: 38425
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  48442 
##           Std. Err.:  62616 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38399.32

Not surprisingly, this negative binomial model performed slightly worse than the previous negative binomial model. I suspect that the transformation had lost some key information. (Again, notable is that the beta values of the independent variables are similar to the beta values of the prior negative binomial model.)

RMSE Calculation

### RMSE of first model - training dataset
model4_rmse_train <- mean(rmse_calc(training_set$TARGET, predict(negative_binomial_4, newdata = training_set)))
### RMSE of first model - test dataset
model4_rmse_test <- mean(rmse_calc(test_set$TARGET, predict(negative_binomial_4, newdata = test_set)))

model4_aic <- negative_binomial_4$aic

Model 5: Multiple Linear Regression

In our multiple linear regression model below, r-squared is 0.4594, which means this model explains 45.94% of the data’s variation. As seen with previous models, FixedAcidity and ResidualSugar seem to have have no impact in this model. So far none of the model adequately explains the dataset.

model5 <- lm(TARGET ~ ., data=training_set)
summary(model5)
## 
## Call:
## lm(formula = TARGET ~ ., data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5443 -1.0241  0.1758  1.0316  4.4430 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.887e+00  5.367e-01   7.243 4.70e-13 ***
## FixedAcidity       -1.781e-03  2.258e-03  -0.789  0.43022    
## VolatileAcidity    -1.272e-01  1.800e-02  -7.068 1.68e-12 ***
## CitricAcid          2.572e-02  1.625e-02   1.583  0.11344    
## ResidualSugar       3.070e-04  4.183e-04   0.734  0.46296    
## Chlorides          -1.426e-01  4.374e-02  -3.259  0.00112 ** 
## FreeSulfurDioxide   2.798e-04  9.458e-05   2.959  0.00310 ** 
## TotalSulfurDioxide  1.795e-04  6.073e-05   2.955  0.00313 ** 
## Density            -9.352e-01  5.283e-01  -1.770  0.07676 .  
## pH                 -5.729e-02  2.055e-02  -2.788  0.00532 ** 
## Sulphates          -3.619e-02  1.511e-02  -2.396  0.01661 *  
## Alcohol             8.322e-03  3.764e-03   2.211  0.02706 *  
## LabelAppeal         4.509e-01  1.654e-02  27.268  < 2e-16 ***
## AcidIndex          -2.495e-01  1.099e-02 -22.701  < 2e-16 ***
## STARS               1.155e+00  1.676e-02  68.927  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.42 on 10222 degrees of freedom
## Multiple R-squared:  0.4578, Adjusted R-squared:  0.4571 
## F-statistic: 616.6 on 14 and 10222 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(model5, plots='default', smoother = TRUE)

RMSE Calculation

### RMSE of first model - training dataset
model5_rmse_train <- mean(rmse_calc(training_set$TARGET, predict(model5, newdata = training_set)))
### RMSE of first model - test dataset
model5_rmse_test <- mean(rmse_calc(test_set$TARGET, predict(model5, newdata = test_set)))

model5_aic <- model5$aic

Model 6: Multiple Linear Regression

In our last model using multiple linear regression with forward and backward step-wise variables selection algorithm, we see a similar output as model 5. R-squared is 0.4594, which means this model explains 45.94% of the data’s variation.

# multiple linear regression with stepwise variable selection
model6 <- stepAIC(model5, direction = "both", trace = FALSE)
summary(model6)
## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.527 -1.021  0.177  1.033  4.440 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.886e+00  5.367e-01   7.241 4.78e-13 ***
## VolatileAcidity    -1.273e-01  1.800e-02  -7.074 1.60e-12 ***
## CitricAcid          2.548e-02  1.624e-02   1.569  0.11673    
## Chlorides          -1.430e-01  4.374e-02  -3.270  0.00108 ** 
## FreeSulfurDioxide   2.800e-04  9.456e-05   2.961  0.00308 ** 
## TotalSulfurDioxide  1.810e-04  6.071e-05   2.981  0.00288 ** 
## Density            -9.334e-01  5.283e-01  -1.767  0.07732 .  
## pH                 -5.715e-02  2.055e-02  -2.781  0.00542 ** 
## Sulphates          -3.660e-02  1.510e-02  -2.424  0.01538 *  
## Alcohol             8.283e-03  3.762e-03   2.202  0.02771 *  
## LabelAppeal         4.510e-01  1.654e-02  27.271  < 2e-16 ***
## AcidIndex          -2.510e-01  1.082e-02 -23.191  < 2e-16 ***
## STARS               1.155e+00  1.675e-02  68.961  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.42 on 10224 degrees of freedom
## Multiple R-squared:  0.4578, Adjusted R-squared:  0.4571 
## F-statistic: 719.3 on 12 and 10224 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(model6, plots='default', smoother = TRUE)

RMSE Calculation

### RMSE of first model - training dataset
model6_rmse_train <- mean(rmse_calc(training_set$TARGET, predict(model6, newdata = training_set)))
### RMSE of first model - test dataset
model6_rmse_test <- mean(rmse_calc(test_set$TARGET, predict(model6, newdata = test_set)))

model6_aic <- model6$aic

PART IV: Selecting Models

Compare Key Regression Model statistics

The table below summarizes the model statistics for all 3 of our Regression models. The models are listed from left to right in accordance with the order in which they were described in Part III.

# metrics
Train_RMSE <- list(model1_rmse_train, model2_rmse_train, model3_rmse_train, model4_rmse_train, model5_rmse_train, model6_rmse_train)
Test_RMSE <- list(model1_rmse_test, model2_rmse_test, model3_rmse_test, model4_rmse_test, model5_rmse_test, model6_rmse_test)

AIC <- list(model1_aic, model2_aic, model3_aic, model4_aic, model5_aic, model6_aic)

kable(rbind(Train_RMSE, Test_RMSE, AIC), col.names = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6"))  %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
Train_RMSE 2.61749084465551 1.35432902629987 2.61743685476711 2.61749006996924 1.41854806293486 1.4186294587951
Test_RMSE 2.6020239659855 1.35247459709401 2.60206289256156 2.60202304945068 1.40469654988496 1.40437495721672
AIC 38423.1462242275 NULL 38428.7871733393 38425.3235609569 NULL NULL

To make prediction, we will select one of our count regression model. The criteria for our selection for the best count regression model will be the AIC score and mean squared error of the model. Based on the table above, model 1 is our best model.

Model Prediction

lm_predicted <- round(predict(poisson_model_1, newdata = eval_df),0)

lm_predicted_df <- as.data.frame(cbind(eval_df$IN, lm_predicted))

colnames(lm_predicted_df) <- c('INDEX','TARGET_WINS')

lm_predicted_df %>% kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
INDEX TARGET_WINS
3 1
9 1
10 1
18 1
21 1
30 2
31 1
37 2
39 0
47 1
60 1
62 1
63 1
64 1
68 1
75 1
76 1
83 1
87 1
92 2
98 1
106 1
107 1
113 1
120 1
123 2
125 1
126 2
128 2
129 1
131 1
135 1
141 2
147 1
148 1
151 1
156 1
157 1
174 1
186 1
193 1
195 1
212 1
213 1
217 1
223 1
226 1
228 1
230 1
241 1
243 1
249 1
281 1
288 0
294 0
295 1
300 2
302 1
303 2
308 1
319 1
320 1
324 1
331 1
343 1
347 1
348 1
350 1
357 0
358 1
360 2
366 1
367 1
368 2
376 0
380 1
388 0
396 1
398 2
403 1
410 0
412 2
420 1
434 1
440 1
450 1
453 1
464 2
465 1
466 2
473 1
476 2
478 1
479 1
493 1
497 1
503 1
504 1
505 1
507 1
513 1
519 1
521 1
522 1
545 1
549 1
551 2
556 2
557 2
559 1
560 1
566 1
569 1
573 1
578 1
579 2
582 1
596 1
598 0
599 1
602 1
605 1
617 1
619 2
630 1
634 1
643 1
645 1
647 1
649 1
656 1
657 1
658 1
667 1
692 1
693 1
698 1
699 1
700 2
704 1
707 1
708 1
709 1
713 0
714 1
716 0
718 1
722 1
729 2
731 1
733 1
746 1
747 1
748 1
753 1
757 1
763 1
767 2
774 1
776 1
788 1
794 1
799 1
803 1
806 1
807 1
811 2
816 2
818 1
819 1
831 2
835 2
837 1
841 1
846 1
856 2
861 1
862 2
863 1
865 1
871 1
879 1
880 1
881 1
885 1
887 1
892 1
898 1
900 0
904 1
906 2
910 1
912 1
913 1
919 2
924 1
925 1
930 1
940 1
941 1
946 1
949 2
951 1
962 1
966 1
967 2
971 1
981 1
982 1
983 1
984 1
989 1
990 1
992 1
995 2
996 1
998 1
1001 2
1007 1
1008 1
1016 1
1022 1
1027 2
1032 1
1033 1
1041 1
1065 1
1074 1
1075 1
1081 1
1094 1
1099 1
1105 1
1123 2
1135 1
1142 1
1155 1
1169 1
1176 1
1178 1
1180 1
1184 1
1185 1
1193 1
1196 1
1199 1
1203 1
1205 1
1207 1
1208 1
1212 0
1213 1
1222 1
1223 1
1226 1
1227 2
1229 1
1230 2
1231 1
1241 0
1243 1
1244 2
1246 2
1248 1
1249 1
1252 1
1261 1
1275 1
1281 1
1285 1
1288 1
1290 1
1291 1
1304 1
1305 1
1323 1
1342 1
1348 1
1353 1
1363 1
1371 1
1372 1
1378 1
1381 1
1382 1
1393 1
1394 2
1398 2
1404 1
1405 1
1419 1
1421 1
1426 1
1431 1
1435 1
1437 1
1438 2
1442 0
1464 1
1471 1
1473 1
1476 1
1478 0
1479 1
1487 2
1492 1
1496 1
1497 1
1515 1
1519 1
1522 1
1526 1
1537 1
1538 2
1540 0
1543 1
1548 1
1549 1
1556 1
1564 1
1570 1
1577 1
1585 2
1590 1
1592 1
1594 1
1596 2
1598 2
1603 2
1607 1
1612 2
1627 1
1629 1
1630 1
1640 2
1641 1
1646 1
1662 1
1668 1
1671 1
1672 2
1673 2
1686 1
1688 1
1696 1
1701 2
1707 1
1708 1
1713 1
1715 1
1717 1
1721 1
1724 1
1725 1
1730 1
1731 1
1734 1
1740 1
1748 1
1749 1
1750 2
1763 1
1768 2
1773 1
1777 2
1778 1
1780 1
1782 1
1784 1
1786 1
1787 1
1792 1
1800 1
1801 1
1803 1
1804 1
1807 1
1818 2
1821 1
1822 2
1828 1
1833 1
1844 1
1847 1
1850 1
1854 1
1858 1
1864 1
1867 0
1876 1
1880 1
1881 0
1891 1
1894 1
1895 1
1901 2
1905 1
1912 2
1918 1
1921 1
1923 1
1924 2
1931 1
1941 2
1950 1
1951 2
1954 1
1961 1
1966 1
1979 1
1982 1
1987 1
1997 1
2004 1
2011 2
2015 1
2025 2
2033 1
2034 1
2035 1
2036 1
2053 1
2059 1
2060 1
2073 1
2084 1
2089 1
2092 1
2109 2
2129 1
2134 2
2135 2
2148 1
2149 1
2150 1
2165 1
2166 1
2168 2
2170 0
2171 1
2172 1
2176 2
2182 1
2189 0
2191 1
2197 1
2202 1
2203 1
2204 0
2206 1
2218 0
2219 1
2221 1
2226 1
2228 1
2232 1
2236 0
2241 1
2245 2
2251 1
2255 2
2256 1
2259 1
2263 1
2264 1
2267 1
2273 0
2277 1
2287 1
2289 1
2291 1
2296 0
2299 1
2306 1
2314 1
2317 1
2318 1
2321 1
2324 1
2340 1
2343 1
2349 1
2352 2
2353 1
2365 1
2370 1
2378 1
2390 0
2399 1
2402 1
2403 1
2404 2
2414 2
2422 1
2424 1
2430 1
2435 1
2439 2
2442 1
2445 1
2449 2
2451 1
2461 1
2464 1
2465 1
2472 1
2476 1
2482 1
2487 2
2498 1
2501 1
2504 1
2511 1
2518 2
2521 2
2530 1
2543 1
2545 1
2561 2
2566 1
2572 1
2577 1
2578 2
2580 1
2581 2
2582 1
2584 1
2590 2
2598 1
2602 2
2605 2
2616 1
2618 1
2619 1
2624 2
2632 1
2640 1
2646 2
2651 1
2660 1
2661 1
2668 1
2670 2
2680 1
2681 1
2689 1
2694 1
2695 1
2696 1
2702 1
2704 1
2708 2
2709 1
2714 1
2716 0
2723 1
2725 1
2738 1
2750 1
2756 1
2758 1
2766 1
2767 1
2771 1
2775 1
2776 1
2779 1
2780 1
2781 1
2782 2
2783 1
2796 1
2798 1
2800 1
2803 2
2806 2
2813 1
2818 1
2821 2
2825 2
2829 0
2830 1
2833 1
2839 1
2843 2
2846 1
2847 1
2848 2
2856 1
2863 1
2867 2
2869 1
2873 2
2874 1
2875 1
2880 1
2886 1
2887 1
2888 1
2889 1
2890 2
2892 1
2901 1
2902 0
2905 0
2917 0
2922 1
2924 1
2930 1
2931 1
2946 1
2955 1
2962 1
2964 1
2965 1
2967 1
2970 2
2973 2
2974 1
2976 1
2977 1
2978 1
2986 2
2988 2
2989 2
2995 1
3005 1
3011 1
3013 1
3019 1
3021 1
3022 2
3029 1
3037 1
3042 1
3043 1
3049 1
3050 2
3053 1
3058 1
3062 2
3063 1
3065 1
3080 1
3088 1
3093 1
3096 2
3101 2
3103 1
3107 2
3109 2
3111 2
3113 1
3116 1
3132 1
3141 2
3153 1
3154 1
3160 1
3167 1
3170 1
3173 1
3174 1
3177 2
3179 1
3184 1
3190 1
3193 2
3199 1
3201 1
3202 1
3203 1
3206 1
3209 1
3210 1
3217 2
3220 1
3228 1
3232 1
3239 1
3243 1
3245 1
3246 1
3251 2
3253 2
3257 1
3260 1
3261 1
3263 2
3278 1
3281 1
3283 2
3290 1
3297 1
3304 1
3305 1
3307 1
3308 1
3313 1
3314 1
3317 1
3348 1
3350 0
3359 1
3367 1
3376 1
3378 1
3384 1
3386 1
3387 1
3388 1
3390 1
3391 1
3396 0
3398 1
3404 2
3406 2
3407 1
3414 1
3419 1
3423 1
3427 1
3432 1
3434 1
3438 1
3442 0
3443 1
3448 0
3456 1
3464 2
3470 1
3475 1
3477 1
3490 1
3493 1
3502 1
3508 2
3516 1
3517 1
3525 1
3532 1
3535 1
3536 1
3540 1
3547 1
3550 1
3557 1
3562 1
3563 1
3564 1
3570 1
3573 1
3577 1
3579 1
3581 1
3587 1
3602 1
3609 1
3612 1
3621 1
3642 1
3647 1
3649 1
3654 0
3660 1
3665 1
3669 1
3673 1
3675 1
3678 1
3680 1
3686 2
3693 1
3710 1
3713 1
3718 2
3725 1
3726 1
3747 1
3753 1
3754 2
3760 2
3763 1
3765 1
3769 2
3771 1
3784 1
3787 1
3794 1
3796 1
3798 1
3809 1
3812 2
3819 1
3828 1
3831 2
3833 1
3837 2
3839 1
3843 1
3846 1
3854 2
3861 1
3864 1
3868 1
3869 2
3870 1
3883 1
3886 1
3889 1
3894 0
3907 1
3910 1
3913 1
3914 1
3921 2
3923 1
3929 1
3931 1
3932 2
3937 1
3943 1
3956 1
3957 1
3961 2
3971 1
4004 0
4005 1
4006 1
4011 1
4013 2
4014 2
4016 0
4017 2
4020 1
4022 1
4026 1
4032 1
4043 1
4045 1
4048 2
4051 2
4052 1
4056 1
4059 0
4069 2
4074 1
4076 1
4077 1
4079 0
4081 1
4088 2
4105 1
4125 1
4134 1
4139 0
4146 1
4149 2
4151 0
4155 1
4157 1
4168 2
4170 1
4174 1
4179 2
4185 1
4199 1
4205 1
4208 1
4211 1
4212 0
4215 1
4217 1
4219 1
4226 1
4227 1
4229 1
4231 0
4233 1
4237 1
4243 1
4248 2
4255 2
4262 1
4266 1
4268 2
4270 1
4273 2
4276 1
4277 1
4279 0
4299 1
4313 2
4322 1
4324 1
4328 1
4331 1
4335 0
4337 1
4338 1
4343 0
4347 0
4355 1
4357 1
4359 2
4362 1
4368 1
4374 2
4375 2
4378 1
4381 1
4387 1
4400 0
4423 1
4424 1
4428 1
4433 2
4436 1
4437 1
4439 2
4449 1
4456 1
4463 2
4467 1
4468 1
4469 1
4472 1
4473 2
4476 1
4500 1
4509 1
4513 1
4521 1
4527 1
4530 0
4532 1
4533 1
4535 1
4536 1
4542 1
4551 1
4554 1
4555 0
4564 1
4572 1
4573 2
4577 1
4579 2
4583 1
4584 2
4596 1
4599 1
4607 1
4609 0
4610 1
4616 1
4617 1
4633 1
4638 1
4641 1
4653 2
4655 1
4659 1
4669 1
4678 1
4685 1
4686 1
4691 1
4695 1
4698 1
4700 2
4711 1
4722 1
4727 1
4756 2
4762 1
4763 1
4766 2
4770 1
4784 1
4791 1
4795 2
4799 1
4802 1
4805 1
4814 1
4816 1
4817 1
4822 1
4827 1
4833 2
4836 1
4842 1
4844 1
4845 1
4849 1
4850 1
4860 1
4863 1
4871 1
4878 1
4881 1
4888 1
4900 2
4906 1
4909 1
4916 1
4918 2
4926 1
4928 1
4941 1
4946 2
4949 1
4956 1
4966 1
4969 1
4973 1
4978 2
4982 1
4985 1
4991 1
4998 1
5000 1
5004 1
5005 1
5011 1
5016 0
5018 1
5034 1
5038 1
5042 2
5046 1
5051 1
5054 1
5057 1
5062 1
5063 1
5065 1
5066 1
5076 1
5089 1
5092 1
5093 1
5094 1
5098 1
5102 1
5112 2
5117 1
5127 1
5130 1
5131 0
5132 1
5135 1
5136 1
5147 1
5157 1
5160 1
5165 1
5166 1
5172 1
5173 1
5179 1
5184 2
5187 1
5191 1
5193 1
5194 1
5199 1
5212 1
5213 1
5224 1
5226 1
5239 1
5252 1
5264 1
5266 1
5271 2
5273 1
5276 1
5278 2
5281 0
5283 1
5291 2
5294 2
5296 1
5297 1
5313 1
5314 1
5321 1
5325 1
5326 1
5328 1
5334 1
5338 1
5344 0
5348 1
5352 1
5353 1
5354 1
5361 1
5364 1
5365 1
5367 1
5379 1
5382 1
5386 2
5395 1
5410 1
5411 1
5416 1
5424 1
5426 1
5428 1
5430 2
5433 1
5437 1
5440 1
5442 2
5445 1
5449 1
5452 1
5460 1
5461 0
5465 1
5467 1
5471 1
5474 1
5475 1
5480 1
5481 1
5484 1
5494 2
5495 1
5497 1
5499 1
5507 1
5510 1
5515 1
5516 1
5517 1
5524 1
5530 2
5534 1
5543 1
5545 1
5558 1
5562 1
5573 2
5581 1
5583 2
5587 1
5589 0
5591 2
5596 1
5606 1
5608 1
5611 1
5612 1
5614 1
5620 1
5623 2
5624 1
5626 2
5633 1
5635 1
5640 1
5643 1
5644 2
5653 2
5663 1
5664 2
5667 1
5671 1
5673 1
5676 1
5678 0
5698 1
5700 2
5705 1
5706 1
5711 2
5712 2
5716 1
5719 1
5725 1
5728 2
5734 0
5735 1
5743 1
5754 1
5755 1
5756 1
5766 1
5770 1
5774 1
5775 1
5776 1
5778 2
5786 1
5787 1
5791 2
5794 1
5803 1
5804 1
5808 1
5810 1
5813 1
5828 1
5839 2
5842 1
5843 0
5844 1
5847 1
5851 1
5854 1
5857 1
5866 1
5874 1
5886 1
5895 0
5897 1
5898 1
5900 2
5902 1
5908 1
5909 1
5912 2
5913 1
5917 2
5918 1
5921 1
5931 1
5942 1
5943 1
5950 1
5954 1
5983 1
5995 1
6002 1
6005 1
6009 2
6011 1
6012 2
6019 1
6021 1
6029 2
6036 1
6037 1
6038 0
6043 0
6045 1
6047 1
6048 1
6061 1
6063 1
6064 2
6068 2
6069 1
6070 1
6071 2
6074 1
6079 1
6082 1
6088 1
6094 1
6095 1
6098 1
6102 1
6105 1
6113 1
6116 1
6120 1
6121 1
6126 1
6144 1
6145 1
6153 1
6156 1
6159 1
6162 1
6184 1
6188 1
6189 1
6191 0
6211 1
6216 1
6218 1
6222 1
6235 1
6245 1
6248 1
6253 1
6256 1
6257 1
6259 0
6266 1
6268 1
6275 1
6280 1
6283 1
6288 2
6289 1
6301 1
6308 1
6314 1
6315 0
6316 1
6317 1
6318 1
6323 1
6329 2
6336 1
6341 1
6348 1
6349 1
6365 1
6372 1
6376 1
6378 1
6379 1
6382 1
6383 1
6389 1
6390 0
6392 2
6394 1
6402 1
6404 1
6405 1
6406 1
6409 2
6410 2
6411 1
6421 1
6428 2
6429 1
6432 1
6436 1
6437 1
6438 2
6445 1
6447 1
6450 1
6462 1
6467 1
6478 1
6484 1
6492 1
6497 2
6504 1
6505 1
6513 1
6525 2
6526 1
6528 1
6540 1
6542 1
6544 2
6548 1
6552 1
6558 1
6567 1
6569 2
6572 2
6577 2
6581 1
6588 1
6591 1
6594 1
6600 2
6602 1
6604 1
6605 1
6614 1
6616 1
6621 1
6640 2
6641 1
6643 1
6644 1
6649 1
6650 1
6655 2
6661 0
6672 2
6677 1
6688 1
6689 2
6691 1
6692 1
6694 1
6702 1
6714 1
6716 2
6724 1
6725 1
6730 1
6735 1
6738 1
6739 1
6743 1
6747 1
6750 2
6751 1
6753 1
6754 1
6755 1
6762 1
6764 1
6772 2
6774 1
6787 1
6789 1
6793 1
6798 2
6799 1
6800 1
6802 1
6808 1
6809 1
6812 1
6814 1
6816 1
6822 1
6829 1
6834 1
6836 2
6839 1
6840 1
6843 1
6846 0
6848 1
6852 1
6856 2
6860 1
6866 2
6870 1
6878 2
6880 2
6885 1
6897 2
6902 1
6904 1
6907 1
6909 1
6914 1
6915 1
6922 1
6924 1
6933 1
6934 1
6941 1
6957 1
6960 1
6969 0
6975 1
6980 1
6983 1
6987 1
6994 1
6997 1
7002 1
7010 1
7015 1
7019 1
7022 1
7025 1
7029 1
7031 1
7037 1
7038 1
7043 1
7049 1
7052 1
7053 2
7056 1
7057 2
7080 1
7086 1
7087 1
7105 1
7108 1
7121 0
7122 1
7125 1
7132 1
7134 1
7151 1
7152 1
7157 1
7159 1
7166 2
7167 1
7177 1
7179 1
7181 2
7183 1
7186 1
7193 2
7205 1
7207 1
7209 1
7216 1
7232 0
7235 1
7238 1
7240 1
7243 1
7252 1
7269 1
7275 1
7281 0
7283 1
7287 2
7289 2
7291 1
7294 2
7304 2
7308 1
7313 1
7319 1
7325 1
7326 1
7330 2
7332 1
7337 1
7341 1
7346 1
7353 0
7354 1
7361 1
7366 1
7368 2
7372 1
7375 1
7377 2
7380 2
7382 2
7385 1
7392 1
7395 1
7397 1
7403 1
7406 1
7409 1
7410 1
7412 1
7419 1
7425 1
7435 2
7438 1
7440 1
7447 1
7449 1
7456 1
7464 1
7478 1
7480 1
7481 2
7483 1
7484 1
7491 1
7494 1
7501 1
7503 1
7509 1
7517 1
7518 2
7519 0
7521 2
7522 1
7536 1
7539 1
7547 2
7549 1
7552 1
7554 1
7556 1
7564 1
7566 1
7570 1
7571 1
7572 1
7575 0
7586 1
7589 2
7590 1
7597 1
7602 1
7604 1
7605 1
7612 0
7615 1
7617 1
7624 1
7632 1
7639 1
7642 1
7643 1
7649 1
7650 1
7653 1
7654 1
7657 2
7662 1
7669 1
7671 1
7675 1
7678 1
7682 1
7688 1
7689 1
7690 2
7692 1
7699 2
7705 1
7712 1
7726 1
7728 0
7735 1
7737 1
7739 1
7743 2
7744 1
7746 1
7749 1
7750 1
7752 1
7755 2
7756 1
7762 2
7764 1
7769 1
7770 1
7776 2
7778 2
7784 1
7786 1
7789 1
7793 1
7794 1
7804 1
7811 1
7813 1
7815 1
7817 1
7818 1
7821 2
7825 1
7830 2
7832 1
7835 1
7839 2
7842 2
7849 1
7856 1
7857 1
7863 1
7866 2
7871 1
7875 1
7882 0
7887 0
7888 1
7891 0
7895 1
7901 1
7906 1
7908 2
7917 0
7924 1
7948 1
7950 2
7955 1
7957 1
7959 1
7967 2
7969 1
7971 1
7974 1
7976 1
7986 2
7987 0
7993 1
7996 1
7998 1
8018 1
8019 1
8027 1
8036 1
8040 1
8044 1
8050 1
8052 1
8054 1
8057 1
8058 2
8059 1
8066 1
8070 2
8072 2
8078 1
8079 1
8080 1
8081 1
8088 1
8091 1
8094 1
8095 1
8099 2
8101 2
8102 2
8116 2
8125 2
8134 1
8139 1
8141 1
8147 1
8158 2
8160 1
8165 1
8187 1
8205 1
8209 1
8211 1
8232 1
8236 1
8237 2
8238 2
8245 1
8256 1
8268 1
8269 1
8270 1
8286 1
8289 1
8301 1
8305 1
8310 1
8312 0
8318 2
8321 1
8328 1
8331 1
8334 1
8344 1
8345 0
8352 2
8358 1
8359 0
8360 1
8365 1
8366 1
8369 1
8373 2
8378 1
8392 1
8397 1
8399 1
8400 1
8405 1
8406 1
8410 1
8413 1
8414 1
8416 2
8426 1
8434 1
8439 2
8440 1
8475 2
8480 2
8497 0
8499 1
8500 1
8501 1
8502 1
8518 1
8520 1
8523 1
8525 0
8532 1
8535 1
8543 1
8554 1
8560 2
8561 2
8563 1
8566 1
8570 1
8572 1
8582 0
8583 1
8587 1
8592 0
8593 1
8607 1
8609 1
8610 1
8614 1
8616 2
8622 2
8623 1
8624 1
8633 2
8641 2
8644 2
8649 1
8653 1
8657 2
8658 1
8663 1
8672 0
8680 1
8684 1
8687 2
8688 1
8690 1
8712 0
8717 1
8730 2
8739 1
8744 1
8747 2
8748 1
8751 1
8758 1
8761 1
8763 2
8764 1
8765 2
8773 2
8780 1
8781 1
8782 1
8785 0
8786 1
8797 1
8799 1
8807 1
8816 1
8817 1
8826 2
8833 1
8834 1
8835 1
8840 2
8843 1
8849 1
8855 1
8861 1
8862 1
8865 1
8868 2
8870 1
8880 1
8885 1
8894 1
8895 2
8899 2
8912 1
8922 1
8924 1
8928 1
8932 1
8943 2
8945 2
8946 1
8954 0
8958 1
8960 1
8965 0
8966 1
8967 1
8969 1
8980 1
8984 1
8985 1
8988 1
8989 1
8995 1
9004 1
9010 0
9012 1
9018 1
9036 0
9037 0
9040 1
9041 2
9044 2
9045 1
9047 1
9049 1
9061 2
9062 1
9076 1
9079 1
9081 1
9082 1
9089 1
9092 1
9094 2
9115 1
9117 1
9118 1
9120 1
9124 1
9128 1
9135 1
9136 1
9138 1
9157 1
9176 2
9183 2
9187 1
9188 1
9190 1
9197 1
9200 1
9201 1
9203 1
9212 1
9213 1
9214 2
9217 1
9219 1
9220 2
9221 2
9237 1
9240 1
9241 1
9248 1
9253 2
9259 1
9267 1
9271 1
9273 0
9285 2
9290 1
9291 1
9293 1
9294 1
9301 1
9302 1
9312 1
9316 1
9319 1
9328 1
9331 1
9338 0
9350 1
9356 1
9359 0
9362 1
9364 1
9370 1
9380 1
9386 1
9394 1
9407 1
9411 1
9422 1
9423 1
9429 1
9433 1
9439 1
9451 1
9452 1
9453 1
9460 2
9465 1
9470 1
9476 1
9485 1
9486 1
9488 1
9507 2
9508 1
9517 2
9521 2
9528 1
9532 1
9536 1
9540 2
9542 1
9546 1
9548 1
9549 2
9554 2
9555 1
9558 1
9573 2
9575 2
9584 1
9586 1
9588 2
9591 1
9592 2
9597 2
9600 1
9603 1
9605 1
9614 1
9616 1
9622 1
9624 1
9629 1
9633 1
9640 1
9644 0
9645 1
9646 1
9648 1
9649 0
9660 1
9664 1
9675 1
9679 1
9680 1
9682 1
9697 1
9701 1
9704 1
9705 2
9707 1
9714 1
9718 1
9722 2
9739 1
9747 2
9751 1
9757 0
9759 1
9760 1
9764 1
9776 1
9778 1
9786 1
9803 1
9804 1
9815 2
9824 0
9825 0
9826 1
9827 1
9833 1
9835 0
9860 1
9865 1
9871 1
9874 1
9880 1
9882 1
9885 1
9888 1
9892 1
9893 2
9896 1
9902 1
9906 1
9910 1
9914 2
9918 1
9920 1
9926 1
9931 2
9935 1
9945 1
9953 2
9957 1
9963 1
9972 1
9976 2
9979 1
9980 1
9982 1
9991 1
10000 1
10003 1
10005 1
10014 1
10032 1
10034 1
10041 1
10042 1
10044 2
10045 0
10054 1
10061 2
10062 1
10073 0
10081 2
10084 2
10086 1
10093 1
10101 2
10105 1
10110 1
10113 1
10115 1
10119 1
10121 1
10124 1
10126 2
10127 1
10145 1
10147 1
10148 1
10162 0
10163 1
10166 1
10172 1
10173 1
10175 0
10180 1
10186 1
10192 1
10199 1
10209 1
10210 2
10214 1
10215 1
10216 2
10232 1
10239 1
10249 1
10253 2
10255 1
10262 1
10264 1
10266 1
10268 1
10271 1
10272 2
10276 1
10277 1
10279 1
10281 0
10285 1
10294 0
10300 1
10304 1
10307 1
10309 2
10310 1
10312 1
10321 1
10332 1
10336 1
10368 1
10369 1
10375 1
10376 1
10379 1
10380 1
10383 0
10385 2
10387 1
10397 1
10412 1
10413 2
10418 1
10420 1
10426 1
10427 1
10428 1
10430 1
10435 2
10436 0
10446 2
10448 1
10449 1
10463 1
10469 1
10470 2
10471 1
10473 2
10476 1
10482 1
10500 1
10511 2
10512 1
10514 1
10515 2
10526 1
10546 1
10549 1
10553 1
10558 1
10575 1
10581 1
10583 0
10584 1
10585 1
10610 1
10611 1
10616 1
10618 1
10628 0
10632 1
10642 1
10648 1
10649 1
10650 1
10654 0
10656 2
10661 2
10663 2
10672 1
10678 2
10685 1
10690 1
10702 1
10706 1
10708 1
10716 1
10717 2
10720 2
10729 1
10730 2
10745 2
10753 1
10754 1
10762 1
10766 1
10776 1
10783 1
10789 1
10790 2
10797 1
10807 1
10810 1
10817 1
10820 1
10822 1
10828 1
10829 1
10830 1
10831 2
10841 2
10847 1
10856 0
10860 1
10861 2
10863 1
10875 1
10884 1
10895 1
10897 1
10898 1
10903 1
10908 2
10924 1
10926 1
10927 1
10928 1
10933 0
10939 2
10942 1
10945 1
10949 1
10950 1
10958 2
10963 1
10967 1
10971 1
10972 1
10974 1
10976 2
10980 1
10991 1
10995 2
11014 2
11017 1
11019 1
11022 1
11030 2
11031 1
11041 2
11042 1
11044 1
11047 1
11048 1
11049 1
11052 1
11058 1
11069 1
11070 1
11073 2
11074 1
11078 2
11079 1
11085 1
11088 1
11106 1
11110 2
11114 1
11118 1
11129 1
11130 1
11131 1
11133 1
11138 1
11143 1
11146 2
11153 1
11162 1
11170 2
11171 1
11201 1
11216 1
11219 1
11222 1
11234 1
11238 1
11244 1
11246 1
11248 1
11250 1
11256 1
11259 0
11263 1
11264 1
11270 1
11274 1
11281 1
11285 1
11300 0
11305 1
11317 1
11319 1
11330 2
11334 1
11335 2
11336 1
11356 1
11358 1
11360 1
11364 1
11373 2
11379 1
11382 1
11383 1
11385 1
11387 1
11391 1
11397 1
11404 0
11405 1
11409 1
11419 1
11430 2
11434 2
11436 1
11440 1
11443 1
11449 1
11452 1
11453 1
11456 1
11457 1
11459 1
11471 0
11476 1
11479 1
11481 1
11485 1
11486 1
11487 1
11488 0
11498 2
11506 1
11511 2
11515 1
11518 1
11521 0
11523 1
11524 2
11525 1
11528 1
11530 1
11531 1
11533 1
11535 1
11537 1
11538 1
11541 1
11548 1
11552 1
11558 1
11560 1
11566 1
11572 1
11573 2
11582 1
11586 1
11590 1
11591 1
11601 1
11611 1
11617 1
11619 1
11624 1
11626 2
11644 0
11652 1
11656 1
11658 1
11659 2
11663 1
11665 1
11683 1
11685 1
11691 1
11694 1
11698 1
11700 1
11703 1
11705 1
11710 1
11711 1
11714 1
11731 1
11732 1
11742 1
11744 1
11745 1
11749 2
11756 1
11761 0
11762 2
11766 2
11767 2
11769 1
11770 1
11771 1
11777 1
11778 2
11779 1
11788 1
11790 1
11794 1
11801 1
11807 1
11812 1
11817 2
11818 1
11825 1
11828 1
11833 2
11837 2
11838 0
11842 2
11853 1
11857 1
11858 0
11860 2
11867 1
11868 2
11871 1
11875 1
11881 2
11890 1
11892 2
11894 1
11896 1
11903 1
11905 1
11907 1
11909 2
11911 1
11915 1
11918 1
11920 2
11923 1
11924 1
11926 0
11931 1
11933 1
11940 1
11951 1
11953 1
11973 1
11984 1
11985 1
11991 1
12002 1
12006 1
12008 2
12013 1
12015 1
12016 1
12023 1
12029 2
12036 1
12038 1
12041 2
12049 1
12050 1
12054 1
12060 1
12062 2
12065 1
12079 1
12083 2
12090 1
12091 1
12094 1
12099 1
12101 1
12110 1
12116 2
12122 1
12127 2
12133 1
12142 2
12147 1
12156 1
12157 2
12158 2
12161 1
12163 1
12166 1
12170 1
12174 2
12183 1
12188 1
12189 2
12192 1
12201 1
12204 1
12207 1
12208 0
12209 1
12210 2
12217 1
12227 1
12231 1
12232 1
12239 2
12240 1
12251 1
12256 1
12261 0
12263 1
12266 1
12267 1
12268 1
12279 1
12280 1
12283 1
12284 1
12285 2
12286 2
12292 1
12295 1
12301 1
12314 1
12315 0
12318 1
12332 2
12334 2
12337 1
12338 1
12349 2
12350 1
12359 2
12360 2
12373 1
12374 1
12380 1
12382 1
12383 1
12390 1
12398 1
12405 1
12407 1
12410 2
12418 2
12421 1
12422 1
12439 1
12444 1
12463 2
12465 1
12470 1
12471 1
12480 1
12482 1
12484 1
12487 1
12491 1
12503 1
12507 1
12526 1
12533 1
12540 1
12543 1
12552 1
12555 2
12556 1
12570 1
12579 1
12588 1
12600 1
12615 2
12624 1
12629 1
12634 1
12638 1
12646 1
12650 1
12665 1
12674 1
12676 1
12678 1
12685 1
12690 1
12698 1
12702 1
12704 1
12705 1
12710 1
12715 2
12720 1
12734 1
12744 1
12747 1
12757 1
12758 0
12766 1
12782 1
12787 1
12799 1
12804 1
12809 1
12813 1
12816 1
12821 1
12826 1
12831 1
12832 1
12833 1
12835 1
12842 1
12844 1
12847 0
12852 1
12856 1
12857 1
12858 2
12861 1
12869 1
12876 1
12877 2
12879 1
12882 1
12883 2
12887 1
12889 1
12891 2
12894 1
12895 1
12899 2
12905 2
12913 1
12916 1
12917 1
12925 1
12934 2
12939 1
12943 2
12950 2
12961 0
12963 1
12973 2
12979 1
12980 0
12981 1
12982 1
12992 1
12994 1
12999 1
13002 1
13004 1
13010 1
13013 1
13015 1
13019 1
13030 0
13031 2
13036 1
13037 2
13042 1
13054 1
13060 2
13072 1
13073 1
13079 2
13081 1
13086 1
13087 1
13090 1
13098 1
13100 1
13105 1
13106 1
13107 1
13113 2
13115 2
13117 1
13118 1
13121 1
13137 0
13146 1
13150 1
13151 1
13152 1
13156 1
13165 2
13169 1
13178 1
13180 1
13183 1
13184 1
13188 1
13191 1
13196 1
13203 1
13206 1
13211 1
13219 1
13223 2
13226 1
13228 1
13230 1
13240 1
13249 2
13250 1
13256 1
13261 2
13263 1
13268 1
13275 2
13277 2
13283 1
13284 1
13285 1
13286 1
13287 1
13290 1
13291 2
13294 2
13295 1
13303 1
13306 0
13311 1
13322 1
13331 1
13337 1
13344 1
13362 1
13364 1
13366 1
13368 1
13370 0
13377 2
13378 1
13388 1
13392 1
13398 1
13403 1
13404 1
13409 1
13416 2
13422 0
13427 1
13433 2
13438 1
13441 2
13449 2
13450 1
13453 1
13460 1
13461 1
13465 1
13468 1
13481 1
13485 1
13487 2
13490 1
13493 1
13497 1
13508 2
13516 1
13525 1
13533 2
13535 2
13538 1
13545 1
13566 1
13581 1
13584 1
13588 1
13596 1
13600 2
13604 1
13608 1
13611 1
13612 1
13615 1
13616 1
13618 1
13625 1
13628 1
13629 1
13630 2
13633 1
13637 1
13640 1
13641 1
13651 1
13674 1
13684 1
13690 1
13707 1
13709 2
13710 1
13713 1
13724 2
13725 2
13731 1
13736 0
13740 2
13745 2
13748 1
13751 1
13758 1
13762 1
13764 2
13765 1
13769 0
13770 1
13774 2
13787 1
13791 0
13802 0
13807 1
13808 0
13809 1
13810 1
13822 1
13823 1
13825 1
13826 1
13833 1
13837 1
13842 1
13846 0
13852 1
13853 0
13858 1
13860 1
13866 1
13886 1
13887 2
13890 1
13891 1
13893 1
13902 2
13903 1
13908 1
13912 1
13924 1
13928 1
13929 1
13938 1
13939 1
13941 1
13951 0
13962 0
13964 1
13967 1
13971 1
13972 1
13975 1
13977 1
13979 2
13983 1
13984 1
13987 1
13994 1
13999 1
14003 2
14008 1
14011 1
14012 1
14016 1
14017 1
14020 1
14027 2
14038 1
14040 2
14042 1
14055 2
14057 1
14060 1
14081 1
14091 1
14111 1
14117 2
14121 1
14122 2
14125 1
14129 1
14135 1
14148 1
14157 1
14161 1
14163 1
14172 1
14180 1
14182 1
14188 2
14191 1
14201 2
14202 1
14213 1
14220 1
14224 1
14231 1
14241 2
14243 1
14245 1
14247 1
14248 1
14252 1
14254 1
14260 1
14269 1
14272 2
14274 1
14279 1
14280 2
14290 1
14298 2
14308 1
14313 1
14316 1
14319 2
14322 1
14323 1
14325 2
14337 1
14339 1
14341 1
14342 1
14346 2
14351 1
14354 1
14355 1
14358 2
14359 1
14364 1
14374 1
14376 0
14382 1
14384 1
14393 1
14398 1
14403 1
14406 1
14408 1
14411 2
14414 1
14418 1
14423 1
14442 1
14443 1
14444 1
14446 1
14455 1
14456 2
14458 1
14464 1
14466 1
14467 1
14469 1
14483 1
14484 1
14490 1
14491 1
14494 1
14496 1
14503 1
14504 1
14505 1
14506 1
14507 1
14512 1
14520 1
14527 0
14531 1
14532 2
14535 1
14543 1
14554 1
14556 2
14557 1
14561 1
14562 1
14567 0
14568 1
14574 1
14575 1
14579 1
14581 1
14582 1
14586 1
14591 1
14598 1
14599 1
14600 1
14612 2
14613 2
14624 1
14626 1
14630 2
14633 1
14639 1
14642 1
14643 2
14649 1
14650 1
14653 1
14655 1
14656 1
14662 1
14663 1
14673 1
14674 1
14676 1
14682 1
14685 2
14689 1
14693 1
14697 1
14700 1
14704 1
14710 1
14719 1
14724 1
14728 1
14735 1
14736 1
14741 1
14744 0
14753 1
14756 1
14762 2
14765 2
14783 2
14784 1
14786 0
14790 0
14793 1
14796 2
14801 1
14807 1
14812 1
14815 2
14831 1
14833 2
14836 2
14856 1
14859 1
14861 1
14863 0
14865 1
14880 1
14881 1
14883 1
14884 1
14894 1
14896 1
14899 1
14900 1
14901 1
14906 1
14907 1
14915 2
14919 1
14926 2
14927 1
14933 1
14937 1
14939 1
14940 1
14943 1
14953 1
14954 1
14969 1
14999 2
15008 1
15009 1
15018 1
15023 1
15025 1
15034 1
15036 1
15051 2
15052 1
15064 1
15070 1
15074 1
15077 1
15081 1
15086 2
15093 0
15094 2
15103 1
15104 1
15110 1
15112 0
15115 1
15131 1
15139 1
15141 1
15148 1
15154 2
15156 1
15161 0
15167 1
15178 1
15205 2
15207 2
15222 1
15223 2
15225 1
15228 1
15239 1
15241 1
15246 1
15247 1
15249 1
15255 1
15257 1
15267 1
15277 1
15280 2
15289 1
15297 1
15302 1
15304 0
15312 1
15321 0
15325 1
15326 1
15333 1
15337 1
15338 1
15340 1
15342 0
15344 1
15347 1
15349 2
15355 1
15359 1
15366 1
15367 0
15368 1
15369 1
15380 1
15381 1
15387 0
15388 1
15389 1
15392 2
15400 1
15405 1
15407 1
15408 2
15411 1
15413 1
15418 2
15419 2
15421 1
15425 1
15436 1
15438 1
15440 1
15443 1
15460 1
15464 1
15465 1
15473 1
15475 1
15483 1
15494 2
15495 2
15498 1
15499 1
15500 1
15501 1
15510 0
15512 1
15516 1
15518 2
15519 1
15524 1
15527 1
15529 1
15530 1
15538 1
15539 1
15541 1
15546 1
15547 1
15548 2
15552 0
15556 2
15567 1
15572 1
15573 1
15574 1
15577 1
15579 1
15581 1
15589 1
15596 1
15598 1
15599 1
15605 2
15606 2
15608 1
15616 1
15618 1
15621 1
15626 2
15638 1
15639 0
15642 1
15644 1
15646 1
15649 1
15656 1
15659 1
15680 1
15686 1
15693 2
15697 2
15699 2
15701 1
15705 1
15714 1
15722 2
15728 1
15734 1
15752 0
15756 1
15760 1
15762 1
15767 1
15768 1
15773 2
15774 1
15781 1
15782 1
15784 2
15791 1
15796 1
15798 2
15806 1
15814 1
15819 1
15825 1
15826 1
15831 1
15835 2
15836 1
15839 2
15845 1
15858 1
15859 1
15876 1
15878 2
15880 2
15886 2
15888 1
15891 1
15900 0
15902 1
15904 1
15908 0
15910 1
15917 1
15919 2
15924 1
15927 1
15937 2
15946 1
15949 1
15957 1
15961 1
15964 1
15965 1
15966 1
15978 1
15983 1
15987 1
15988 1
15998 1
16004 1
16008 1
16011 1
16023 1
16024 1
16025 1
16048 1
16050 1
16051 1
16057 2
16059 2
16060 1
16075 2
16094 2
16096 1
16116 1
16118 1
16121 1
16122 1
16124 2
16125 1
16126 1
16130 1

Model output

# export to .csv for submission
write.csv(lm_predicted_df, file = "C:/CUNY/Semester4(Spring)/DATA 621/Assignments/Homework5/Output/Wine_Sales_Prediction.csv",row.names = FALSE)

Our model prediction output can be found in the below GitHub location -

Model Output