DATA 621: Business Analytics And Data Mining

Overview

Purpose of this assignment is to explore, analyze and model a dataset containing approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable(TARGET) is the number of sample cases of wine that were purchased by wine distribution companies after sampling a bottle of wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a bottle of wine to be sold at a high-end restaurant. A large wine manufacturer wants to study the data to predict the number of wine cases that would be ordered based on the wine characteristics.

If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. Our objective is to build a count regression model to predict the number of cases of wine that will be sold under given properties of the wine. Sometimes, the fact that a variable is missing is predictive of the target. For building various models, we will be using only the variables that are part of the dataset(or variables that are derived from the variables).

Below is a short description of the variables of interest in the dataset:

I have uploaded dataset to Github location, training and evaluation.

Data Exploration

Since data is observational, good data is the basis for constructing decent regression model. Let’s explore missing data along with other parameters including mean(\(\mu\)), standard deviation(\(\sigma\)), etc. For the scope of the project, a value of 0 in the response variable TARGET is considered as the wine was never sold in the market and value greater than 0 is treated as at least one case is being sold. Also, variable INDEX is observation indicator and won’t be part of the model building or data analysis.

Summary of the dataset.

Dataset contains twelve variables that express chemical properties of the wine and two variables related to wine marketing. In other words twelve variables may be treated as quantitative variables and other two variables as qualitative variables. However, based on the information from the website SUNY Oswego variables such as LabelAppeal and STARS may change roles between quantitative and categorical during analysis.

It is noticeable from the summary that quantitative variables ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates, Alcohol have missing values(NA). Except Sulphates all other quantitative variables have approximately 5% of missing values. Variable Sulphates has about 10% missing values. Quantitive variable STARS has about 26% missing values.

Quantitative variables FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Sulphates, Alcohol have negative observations. According to studies posted on various websites, the chemical properties of the wine may not be measured in negative values. Only positive values describe if certain chemicals exist in wine. Hence their measurement has to be either positive indicating chemical’s presence or zero for non-existence.

Websites also suggest chemical properties we are working with always exist in any wine in minimum quantities if they are not printed explicitly on the wine bottle label. However, the article posted on Cornell.edu website shows a graph Change in Acetaldehyde mg/l with negative values. This is because the value is calculated as a change. Though the variable has nothing to do with our dataset, it gives an idea that negative values in our dataset may be calculated using some base value.

Website morewinemaking.com, suggests Sulfur-di-oxide(\(SO_2\)) content in a wine is pH dependent. As the pH goes up, higer levels of free \(SO_2\) are needed to protect the wine.

Table pH Value Vs Sulfur-Di-Oxide Content shows, for observation with INDEX value 12052 has pH value as 3.52 and has negative TotalSulfurDioxide value(-793). On the other hand observation with INDEX value 14870 has pH value as 3.34 and has positive TotalSulfurDioxide value(1041). It means condition mentioned in the article on the website morewinemaking.com does not hold, leading us to assume error was made during data collection.

Qualitative variable LabelAppeal has approximately 28% negative values. Since variable has no relation to wine quality, it seems valid. It provides a rating of wine bottle label and may be treated as a categorical variable. It is safe to assume the value of -2 is worst, and 2 is best as the variable only has five values.

Boxplots and histograms show the spread of the data of each variable. Both plots are mapped using rows that have complete data in every variable. Dataset contains 6436 rows that have no missing values. Other 6359 rows have at least one variable with a missing value.

Boxplot suggests all the variables that describe chemical properties of the wine has outliers and are evenly spread on both ends or quartiles. Variable TARGET has mean and median values same.

Histogram suggests data is evenly spread and all plots are unimodal except for variable AcidIndex, it shows some right-skewness.

Both boxplot and histogram of each variable describing chemical properties of the wine suggest we can create bins and convert variables into categorical values.

Correlation

Let’s check the relation between the variables. I will be using Pearson Correlation method to analyze the relationship. If the coefficient is 1 that indicates positive, perfect direct linear relationship. On the other hand, if the coefficient is -1 that indicates perfect decreasing (inverse) linear relationship.

The correlation plot suggests,

Correlation plot does not show any strong relationship between TARGET and other variables.
Variables TARGET and LabelAppeal have a mild positive relation, 0.36, suggesting as LabelAppeal rating increase, wine sales increase.
Variables TARGET and STARS have a medium positive relation, 0.56, this is the only relation with a value above 0.50. It suggests as wine rating increase, sales of that wine also increase.
Variables LabelAppeal and STARS have some week positive relation, and it may be due to rating, based on wine bottle presentation.
Variables TARGET and AcidIndex have week negative relation, suggesting AcidIndex has a negative impact on wine sales, as AcidIndex of wine increase, wine sales decrease.
Other variables that have week positive relation are AcidIndex and FixedAcidity.
During correlation analysis, variables LabelAppeal and STARS are treated as quantitative variables.

Conclusion of Data Exploration

There are approximately 50% observations with missing values. To build the decent model we cannot exclude missing data from the dataset. Hence we need to impute the dataset for filling missing values.
Variable STARS has approximately 26% missing values; this could be due to wine not released in the market.
Variable TARGET has 2734 observations with zero values; this could be due to two reasons, one could be manufacturer may not have released the wine into the market, and another reason could be based on the wine characteristics there were no orders placed. We may need to perform zero-inflated analysis to understand the condition.
The dataset has negative values in variables describing chemical properties of the wine. Since there is no way to know if the variables were normalized, I will be using two methods to prepare data for building models. The first method is, converting the negative values to positive assuming there was an error made during data collection stage. The second method is binning the continuous variables based on boxplot and histogram ranges.

Data Preparation

Data Conversion Method

First approach is to convert the variables FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Sulphates, and Alcohol to positive values.

wine_c.df <- wine.df[complete.cases(wine.df), ]
wine_c_abs.df <- abs(wine_c.df)

Let`s see individual histogram after data conversion.

Correlation Plot

After data conversion, histogram of each variable FixedAcidity, VolatileAcidity, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, and Sulphates show skewness in data distribution. All of them are right skewed.

However, change in sign of the variables, negative to positive had no impact on the correlation between variables.

Imputing Missing Values

Variables ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates, and Alcohol have some missing variables. To fill in missing values, I will be using function knnImputation from DMwR package. Imputing missing values is a two-step process. In the first step I will impute chemical properties of the wine, and in the second step, I will impute variable STARS.

Since LabelAppeal and STARS has no direct relation to chemical properties of the wine, these variable are not used in the first step of imputation process.

Process of imputation is split into following steps,

Select chemical properties of the wine, FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, and AcidIndex into seperate dataset.
Remove all observations with missing values.
Update remainder dataset variables with NA value on the random basis, approximately 300 observations.
Impute the dataset using function knnImputation.
Check for mape or rmse. Due to the zero observation in actual values(e.g. ResidualSugar), mape reports NA. For the variables with mape as NA, I have used rmse to obtain best imputation k value. Wherek is the number of neighbors to be compared. I will be using 30 for this project. ``

The second step, let’s impute 3359 missing values for variable STARS, using knnImputation function. Variable holds values between 1 and 4. All the variables from the dataset are used to impute STARS except TARGET variable.

Binning Method

As part of this process, we will be making an assumption that dataset is vaild and no errors were made during data collection. Since observation have negative numerical values we will bin them into ranges. Example,

wine.asis.df$AlcoholR<-cut(wine.asis.df$Alcohol,seq(-9.7,31.5,by=5))

Variables with maximum value less than 10 will have a class interval of 1.
Variables with maximum value less than 50 will have a class interval of 5,
Variables with maximum value less than 100 will have a class interval of 10,
Variables with maximum value less than 500 will have a class interval of 50,

Sample data after binning, we will add two extra variables for each variable with negative values. Example, ResidualSugar has negative values, we will add ResidualSugarR that holds bin range and ResidualSugarC holds a sequential number of the bin. Both variables are categorical.

Imputation Process

Data will be imputed using two steps process after binning the data,

Step one will be using kNN algorithm to impute missing values for all chemical properties. Variables TARGET, LabelAppeal and STARS won’t be part of the process.
Step two, variables LabelAppeal and STARS will be applied to resultant dataset from step one before variable STARS is imputed. The process will use knnImputation function of DMwR package. The resultant dataset will have complete rows.

Conclusion of Data Preparation

All the variables that have negative values describing chemical properties of the wine are converted to positive values. Two-step imputation process will be applied to keep all the observations. The resultant dataset will be first dataset, with 12795 observations.
The second dataset will be created by excluding observations that have missing values in variable STARS from first dataset. The resultant dataset will have a total of 9436 observations.
The third dataset will be created by binning each variable having negative values that describe chemical properties of the wine. Observations will be left as is, no conversion to absolute values. Sample data is shown in the table Binning Variables With Negative Values. Imputation process is applied after binning the data. The resultant dataset will have 12795 observations.
The fourth dataset will be created by excluding observations that have missing values in variable STARS from third dataset. The resultant dataset will have a total of 9436 observations.
Change of sign of an observation, negative to positive has no impact on the correlation.
Change of sign of an observation, negative to positive has impacted histograms; it shows right-skewness.
Total of 4 datasets will be generated to build various models.

Build Models

Using four datasets we will be generating four regression models of,

Poisson
Quasi-Poisson
Negative Binomial
Zero inflated Poisson Model
Multiple Linear Regression Model

Poisson Model

Poisson Model is defined as

\[ln(\mathbf{E}(\mathbf{Y}|\mathbf{x})) = \alpha + {\beta}^{'} \mathbf{x}\],

Sometimes written as,

\[ln(\mathbf{E}(\mathbf{Y}|\mathbf{x})) = {\theta}^{'} \mathbf{x}\]

Thus predicted mean of the associated Poisson distribution is given by,

\[\mathbf{E}(\mathbf{Y}|\mathbf{x}) = \mathbf{e}^{{\theta}^{'} \mathbf{x}}\]

Hence, coefficients(\(\beta\)) obtained from the model cannot be used for interpretation directly. However, sign(+ or -) associated with coefficient indicates increasing(+) or decreasing(-) log count of the dependent variable for a unit change in the independent variable.

The mean of the variable TARGET, number of wine cases sold(full dataset) is 3.03 and variance is 3.71. Ratio between variance and mean 1.23 and indicates some over dispersion exists.

On the other hand The mean of the variable TARGET, number of wine cases sold(smaller dataset) is 3.69 and variance is 2.41. Ratio between variance and mean is 0.65 and indicates some under dispersion exists.

However, this is based on the observation of single variable from both datasets and we cannot conclude the same of entire dataset.

Let’s build Poisson models

#Poisson models
#Stars imputed
wine.miss.stars.p <- glm(TARGET ~ ., data = wine.miss.stars.df, family = poisson)
summary(wine.miss.stars.p)

#Stars removed
wine.nomiss.stars.p <- glm(TARGET ~ ., data = wine.nomiss.stars.df, family = poisson)
summary(wine.nomiss.stars.p)

#Negative values converted to categories Stars included
wine.cat.miss.stars.p <- glm(TARGET ~ ., data = wine.cat.miss.stars.df, family = poisson)
summary(wine.cat.miss.stars.p)

#Negative values converted to categories Stars excluded
wine.cat.nomiss.stars.p <- glm(TARGET ~ ., data = wine.cat.nomiss.stars.df, family = poisson)
summary(wine.cat.nomiss.stars.p)

Mode-1

The dataset used for building Model-1 contains all the rows. Missing values are imputed using kNN algorithm. This model may be considered as a saturated model.

Interpretation

The model suggests variables FixedAcidity, and ResidualSugar have high p-value hence they are not statistically significant to the model for TARGET estimation. Variables Chlorides, FreeSulfurDioxide, Density, and pH are statistically significant at the 5% level. And all other variables are statistically significant at 1% level. Null deviance is 22861 If a model is built with just one variable TARGET, the sum of errors is 22861. Residual deviance is 18258.

As LabelAppeal and STARS are converted to categorical variables before building the model; they have more than one entry in the output. Both variables are missing first entry LabelAppeal1 and STARS1 as they are considered as reference group respectively.

Estimates explain change in log count for a unit change in the predictor variable.

The coefficient of variable VolatileAcidity suggests, for one unit increase in VolatileAcidity log count of TARGET decreases by 0.0527. In other words, if the volatile acidity of the wine increases by one unit, log cases of wine sold will decrease by a factor of 0.0527. It may be interpreted as there will be 5% decrease in the number of wine cases sold if the volatile acidity of wine goes up by one unit and all other variables held constant.

\[1 - e^{(-0.0527)} = 0.05\]

The coefficient of variable CitricAcid suggests, for one unit increase in CitricAcid log count of TARGET increases by 0.0184. In other words, if the citric acid content of the wine increases by one unit, log cases of the wine sold increases by a factor of 0.0184. It may be interpreted as there will be 2% increase in the number of wine cases sold if the citric acid content of the wine goes up by one unit and all other variables held constant. Since coefficient is positive, we subtract 1 from exponentiated \(\beta_i\).

\[e^{(0.0184)} - 1 = 0.02\]

The coefficient of variable Chlorides suggests, for one unit increase in Chlorides log count of TARGET decreases by 0.049. In other words, holding all other variables constant, if the chloride content of the wine increases by one unit, log cases of the wine sold decreases by a factor of 0.049. The model suggests, increase in chloride content of the wine has a negative impact on wine cases sold.
The coefficient of variables FreeSulfurDioxide and TotalSulfurDioxide suggests, for one unit increase, log count of TARGET increases by 0.0001. The model suggests, holding all other variables constant, increase in the sulfur dioxide content of the wine in any form has a positive impact on wine cases sold.
The coefficient of variables Density, pH, Sulphates, and AcidIndex suggests, for one unit increase in the variables will impact cases of wine sold inversely.
The coefficient of variable LabelAppeal2 is 0.3529 and is interpreted, with respected to base reference group LabelAppeal1, that is wine bottle’s label rating -2. The expected log count for LabelAppeal2 is 0.3529 greater than the expected log count for LabelAppeal1. In other words, as the rating of bottle’s label changes from -2 to -1, there is a higher chance of selling more cases of wine with bottle’s label rating -2 compared to bottle’s label rating -1. The base reference group is always denoted by value 1. When the coefficient is positive, we subtract 1 from the exponential value of the coefficient otherwise subtract the exponential value of the coefficient from 1. Holding all other variables constant, there is 42% higher chance of selling more cases of wine when bottle’s label ratings equal -1 compared to bottle’s label ratings -2.

\[e^{0.3529} -1 = 0.42\]

The coefficient of variable LabelAppeal5 is 0.8502, rules of interpretation remain same as LabelAppeal2. It is compared to base reference group LabelAppeal1, that is wine bottle label rating -2. In other words, holding all other variables constant when wine bottle’s label rating change from -2 to 2 there is 134% higher chance of selling more cases of wine with bottle’s label ratings 2 compared to bottle’s label rating -2. A similar interpretation may be applied to coefficients of variables LabelAppeal3 and LabelAppeal4, comparing them to LabelAppeal1.

\[e^{0.8502} -1 = 1.34\]

The coefficient of variable STARS2 is 0.1997, rules of interpretation remain same as LabelAppeal2. It is compared to base reference group STARS1. In this case, holding all other variables constant when overall wine rating increases from 1 star to 2 stars there is 22% higher chance of selling more 2 stars cases of wine compared to 1 star. A similar interpretation may be applied to coefficients of variable STARS3 and STARS4, comparing them to STARS1.

\[e^{0.1997} -1 = 0.22\]

Mode-2

For Model-2, I have removed rows that had NA values for variables STARS. I have done this to study how a change in the number of rows will impact the model. Also, variable description suggests that if the wine does not have STARS value, it may be indicative of weak sales.

Interpretation of coefficients of variables remains same as Model-1. The model suggests variables VolatileAcidity, Alcohol, AcidIndex, LabelAppealand STARS are statistically significant to the model for TARGET estimation.

Mode-3

For Model-3, I have binned all negative values based on maximum and minimum values. Dataset is imputed to fill in missing values. As no rows are removed from the dataset, it may be considered as a saturated model for categorical values.

Model suggest Density, pH, Alcohol, AcidIndex, LabelAppeal and STARS are statistically significant to the model.

Mode-4

For Model-4, I have binned all negative values based on maximum and minimum values. Dataset is imputed to fill in missing values. However, rows with missing STARS were removed from the dataset.

Model suggest AcidIndex, LabelAppeal and STARS are statistically significant to the model.

Analysis Of Models

Model Stats Interpretation

Deviance, df- degrees of freedom are taken from the summary of each model. Poisson regression is based on Poisson distribution, where \(\mathbf{E(Y)} = \mathbf{Var{Y}}\). This condition may be checked by dividing the sum of Pearson chi-squared and residual degrees of freedom.

\[proportion = \frac{sum(resudials(model.\#, "pearson")^2)}{df.resudial(model.\#)}\]

If the proportion is greater than 1, that implies there is more variability around the model’s fitted values and is known as overdispersion. On the other hand, if the value is less than 1 there is less variability around the model’s fitted values and is known as underdispersion. In both cases, it violates Poisson distribution assumption. And in general terms, the model is referred as overdispersed or underdispersed model.

Parameter overdispersed or underdispersed is calculated as,

\[adj.factor.phi = \sqrt{proportion} = \sqrt{\frac{sum(resudials(model.\#, "pearson")^2)}{df.resudial(model.\#)}}\]

Dispersion parameter adj.factor.phi is used in adjusting standard errors. If the adj.factor.phi value is greater than 1, then standard errors increase in value and it impacts variable contribution to the model. In other words, if standard error is large variable may not be statistically significant to the model.

When the adj.factor.phi value is less than 1, then standard errors decrease in value and it impacts variable contribution to the model. In other words, if standard error is small, variable may be statistically significant to the model. It means the variable can explain the change in the dependent variable.

-- chi-square test
-- where # is model number
p-value <- 1 - pchisq(summary(model.#)$deviance, summary(model.#)$df.residual)

--goodness of fit tests

--where # is model number
pr <- resudials(model.#, "pearson")

--give ratio between variance and mean
proportion <- sum(pr^2)/df.resudials(model.#) 

--used for correcting standard errors
adj.factor.phi <- sqrt(proportion)

Following are the results of goodness of fit(chi-squared) test to each model

p-value of Model-1 is 0, fails goodness of fit as it is less than 0.05. Model-1 does not fit the data.
p-value of Model-2 is 1, suggests it passes goodness of fit as value is greater than 0.05. However, the ratio between the sum of Pearson's chi-squared and residual degrees of freedom is less than 1. It suggests variance is less than mean by approximately 35%(1- 0.6472). Hence, model violates Poisson distribution assumption and is underdispersed.
p-value of Model-3 is 0, fails goodness of fit as it is less than 0.05. Model-3 does not fit the data.
p-value of Model-4 is 1, suggests it passes goodness of fit as value is greater than 0.05. However, the ratio between the sum of Pearson's chi-squared and residual degrees of freedom is less than 1. It suggests variance is less than mean by approximately 35%(1- 0.6472). Hence, model violates Poisson distribution assumption and is underdispersed.

To conclude, all four Poisson models fail the goodness of fit test. Under-dispersion is identified in all models where variance is smaller than mean. Two common approaches used for correcting overdispersion are Quasi-Poisson and Negative binomial regression.

However, in case of underdispersion Poisson and Negative binomial regressions result in same coefficients.

Quasi-Poisson Models

Let’s built Quasi-Poisson models using same datasets and see if standard errors change. Following shows summary of Model-1.

Coefficients are same as Poisson model but standard errors changed. We can see standard errors reduced by decimal points. Same is the case with all four models. Following tables show the comparison between Poisson and Quasi-Poisson models. Along with estimate, I have added extra column ratio, it explains differences in standard errors. Since values are smaller than 1 we subtract ratio from 1 to get percentage difference.

Model-1’s standard errors are smaller by approximately 2.5%.
Model-2’s standard errors are lower by about 35.2%.
Model-3’s standard errors are smaller by around 2.9%.
Model-4’s standard errors are smaller by approximately 35.6%.

In conclusion, quasi-poisson models have proved that under-dispersion exists, but the model did not address it.

Negative Binomial Models

Negative Binomial Distribution is defined as,

Suppose there is a sequence of independent Bernoulli trials. Thus, each trial has two potential outcomes called “success” and “failure”. In each trial the probability of success is \(p\) and of failure is \((1-p)\). We are observing this sequence until a predefined number \(r\) of failures has occurred. Then the random number of successes we have seen, \(\mathbf{X}\), will have the negative binomial distribution \(\mathbf{X}\tilde{}\mathbf{NB}(r,p)\)

Then mean is defined as \[\mu = \mathbf{E}[X] = \frac{pr}{1-p}\]

And variance as \[\sigma^2 = \mathbf{Var(X)} = \frac{pr}{(1-p)^2} = \mu + \frac{\mu^2}{r}\]

I will be using same datasets to build negative binomial model.

Model-1

Model is similar to Poisson model in terms coefficients, standard errors, deviance. Following summary table shows comparision between poisson, quasipoisson and negative binomial models.

Model-2

Model-3

Model-4

Due to under-dispersion poisson and negative binomial models have similar coefficients and significance to the model.

Probability mass function the negative binomial distribution is,

\[\sigma^2 = var(X) = \mu + \frac{\mu^2}{r}\]

where \(r\), is dispersion parameter. When \(r\) is large, \(\frac{1}{r}\) tends to be zero, making \(\sigma^2 = \mu\). When \(\sigma^2 = \mu\), it violates negative binomial distribution assumptions.

theta is the dispersion parameter. I have pre-calculated \(\frac{1}{\theta}\) and is shown as proportion in the table.

Following are the results of the goodness of fit(chi-squared) test of each negative binomial model.

p-value of Model-1 is 0, fails goodness of fit test as the value is less than 0.05. Model-1 does not fit the data.
p-value of Model-2 is 1, suggests it passes goodness of fit test as the value is greater than 0.05. However, \(\frac{1}{\theta}\), value is very small making \(\sigma^2 = \mu\). Thus, the model fails goodness of fit test. In fact, in this case, the negative binomial model is same as Poisson model.
p-value of Model-3 is 0, fails goodness of fit test as the value is less than 0.05. Model-3 does not fit the data.
p-value of Model-4 is 1, suggests it passes goodness of fit as value is greater than 0.05. However, \(\frac{1}{\theta}\), value is very small making \(\sigma^2 = \mu\). Thus, the model fails goodness of fit test.

Article from stackexchange.com, suggest to identify the reason behind underdispersion, fix data and apply Negative Binomial regression methods.

Let’s examine the data distribution between various count ranges using rootogram plot.

The plot indicates if a bar is hanging below zero means underfitting and hanging over zero means overfitting. In our case, all four models have underfitting for zero bucket by a large amount. For this dataset, there is no difference between poisson or negative binomial models as zero is one of the outputs. A common approach is using Zero-Inflated Poisson or negative binomial regression. Since there is not much difference between Poisson or negative binomial regression for our dataset, I will be using Zero-Inflated Poisson regression.

Zero-Inflated Models

Let’s build Zero-Inflated Poisson models.

Zero-inflated model is created using zeroinfl function from pscl package. Summary of the model contains two sets of coefficients.

First set is for count data, and the model predicts mean(\(\mu\)) number of wine cases sold, where TARGET is not always zero. This output can be customized based on prior analysis, and we can use poisson, quasipoisson, negative binomial.
The second set of coefficients are in logit form and represent probabilities of always zero. The output may be interpreted as log odds or odds ratio.

Interpretation

Count Coefficients

The coefficients of variables FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, and Sulphates are not statistically significant to the model.
The coefficient of variable Alcohol suggests, for one unit increase in Alcohol log count of TARGET increases by 0.0077. In other words, if the alcohol content of the wine increases by one unit, log cases of wine sold will increase by a factor of 0.0077. It may be interpreted as there will be 0.7% increase in the number of wine cases sold if the alcohol content of wine goes up by one unit and all other variables held constant.

\[e^{0.0077} - 1 = 0.0077\]

The coefficient of variable AcidIndex suggests, for one unit increase in AcidIndex log count of TARGET decreases by 0.0168. In other words, if the acidic index of the wine increases by one unit, log cases of wine sold will decrease by a factor of 0.0168. It may be interpreted as there will be 1.67% decrease in the number of wine cases sold if the acidic index of wine goes up by one unit and all other variables held constant.

\[1 - e^{-0.0168} = 0.0167\]

The coefficient of variable LabelAppeal2 is 0.4949 and is interpreted, with respected to base reference group LabelAppeal1, that is wine bottle’s label rating -2. The expected log count for LabelAppeal2 is 0.4949 greater than the expected log count for LabelAppeal1. In other words, as the rating of bottle’s label changes from -2 to -1, there is a higher chance of selling more cases of wine with bottle’s label rating -1 compared to bottle’s label rating -2. The base reference group is always denoted by value 1. When the coefficient is positive, we subtract 1 from the exponential value of the coefficient otherwise subtract the exponential value of the coefficient from 1. Holding all other variables constant, there is 64% higher chance of selling more cases of wine when bottle’s label ratings equal -1 compared to bottle’s label ratings -2.

\[e^{0.4949} -1 = 0.6403\]

The coefficient of variable LabelAppeal5 is 1.1610, rules of interpretation remain same as LabelAppeal2. It is compared to base reference group LabelAppeal1, that is wine bottle label rating -2. In other words, holding all other variables constant when wine bottle’s label rating change from -2 to 2 there is 219% higher chance of selling more cases of wine with bottle’s label ratings 2 compared to bottle’s label rating -2. A similar interpretation may be applied to coefficients of variables LabelAppeal3 and LabelAppeal4, comparing them to LabelAppeal1.

\[e^{1.1610} -1 = 2.19\]

The coefficient of variable STARS2 is 0.0976, rules of interpretation remain same as LabelAppeal2. It is compared to base reference group STARS1. In this case, holding all other variables constant when overall wine rating increases from 1 star to 2 stars there is 10.25% higher chance of selling more 2 stars cases of wine compared to 1 star. A similar interpretation may be applied to coefficients of variable STARS3 and STARS4, comparing them to STARS1.

\[e^{0.0976} -1 = 0.1025\]

Zero coefficients

The coefficients of variables FixedAcidity, ResidualSugar, Chlorides, Density, and Alcohol are not statistically significant to the model.
For one unit increase VolatileAcidity, the log odds of TARGET = 0 increase by 0.2364. In other words as the volatile acidity of wine increase by one unit, the odds of wine case not being sold increase by 26.67% keeping all other predictors constant.

\[e^{0.2364} - 1 = 0.2667\]

For one unit increase CitricAcid, the log odds of TARGET = 0 decrease by 0.1484. In other words as citric acid content in the wine increase by one unit, the odds of wine case not being sold decrease by 13.797% keeping all other predictors constant.

\[1 - e^{0.2364} = 0.1379\]

The coefficients of variables FreeSulfurDioxide and TotalSulfurDioxide suggest, for every one unit increase, chances of wine not being sold decreases by log odds of 0.0006 and 0.0010 respectively.
The coefficient of variable LabelAppeal2 is 1.6214 and is interpreted, with respected to base reference group LabelAppeal1, that is wine bottle’s label rating -2. As the rating of bottle’s label changes from -2 to -1, there is a higher chance of wine not being sold when bottle’s label rating -1 compared to bottle’s label rating -2. The base reference group is always denoted by value 1. When the coefficient is positive, we subtract 1 from the exponential value of the coefficient otherwise subtract the exponential value of the coefficient from 1. Holding all other variables constant, there is 406% higher chance of wine not being sold when bottle’s label ratings equal -1 compared to bottle’s label ratings -2. A similar interpretation may be applied to coefficients of variables LabelAppeal3, LabelAppeal4 and LabelAppeal5, comparing them to LabelAppeal1.

\[e^{1.6214} - 1 = 4.0602\]

The coefficient of variable STARS2 is -0.4337, rules of interpretation remain same as LabelAppeal2. It is compared to base reference group STARS1. In this case, holding all other variables constant when overall wine rating increases from 1 star to 2 stars there is 35% lower chance that 2 stars case of wine not being sold compared to 1 star. A similar interpretation may be applied to coefficients of variable STARS3 and STARS4, comparing them to STARS1.

\[1 - e^{-0.4337} = 0.3519\]

Multiple Linear Regression Models

I will using same data to build multiple linear regression model.

Model-1

Model-1 suggests variables FixedAcidity and ResidualSugar are not statistically significant.
It is very similar to Poisson Model-1.
R² value 0.2975, indicates model can explain 29% of the variance in the data.
The blue line indicates the fit line on residuals Vs. fitted plot does not show any pattern.
Q-Q plot show some deviation at lower left corner. It might be due to outliers presence in the dataset.
Cook’s distance plot identifies observations 5170, 5562 and 11289 as outliers. Further analysis is required.

Model-2

Model-2 suggests variables VolatileAcidity, Density, Alcohol, AcidIndex, LabelAppeal and STARS are statistically significant.
It is very similar to Poisson Model-2.
R² value 0.4595, indicates model can explain 46% of the variance in the data.
The blue line indicates the fit line on residuals Vs. fitted plot does not show any pattern.
Q-Q plot show some deviation at lower left corner. It might be due to outliers presence in the dataset.
Cook’s distance plot identifies observations 369, 4922 and 10090 as outliers. Further analysis is required.

Model-3

Model-3 suggests variables Chlorides, Density, pH, Alcohol, AcidIndex, LabelAppeal and STARS are statistically significant.
TotalSulfurDioxide is significant at 10% level.
It is very similar to Poisson Model-3.
R² value 0.3184, indicates model can explain 32% of the variance in the data.
The blue line indicates the fit line on residuals Vs. fitted plot does not show any pattern.
Q-Q plot show some deviation at lower left corner. It might be due to outliers presence in the dataset.
Cook’s distance plot identifies observations 6837 as an outlier. Further analysis is required.

Model-4

Model-4 suggests variables VolatileAcidity, Chlorides, Alcohol, AcidIndex, LabelAppeal and STARS are statistically significant.
Density is significant at 10% level.
It is very similar to Poisson Model-4.
R² value 0.474, indicates model can explain 47% of the variance in the data.
The blue line indicates the fit line on residuals Vs. fitted plot does not show any pattern.
Q-Q plot show some deviation at lower left corner. It might be due to outliers presence in the dataset.
Cook’s distance plot identifies observations 3067 as an outlier. Further analysis is required.

Select Models

After looking at the summary of four models,

Poisson Model-1 and Model-3 failed goodness of fit test. Model-2 and Model-4 violates Poisson distribution assumption.
Quasi-Poission models resulted in smaller standard errors and indicated underdispersion.
Negative Binomial Model-1 and Model-3 failed goodness of fit test. Model-2 and Model-4 violates Negative Binomial distribution assumption.

Reduced Multiple Linear Model-3

Zero-Inflated Model-3

Comparing statistics of Multiple Linear models, I lean towards Model-3, \(R^2\) value of full model is 0.3184 indicating model explains 31% of data variation. Model is developed by creating categories without changing actual value. Also, the model is based on an entire dataset without excluding any observation.
Reduced Multiple Linear Model-3 it is developed using less number of variables and \(R^2\) value is 0.2936. lm(formula = TARGET ~ ChloridesC + Density + pH + AcidIndex + LabelAppeal + STARS, data = wine.cat.miss.stars.df)
Comparing statistics of Zero-Inflated, I lean towards Model-3, AIC value is 44860.26. AIC is higher compared to other Zero-Inflated models. However, Model-3 is developed by creating categories without changing actual value. Also, the model is based on an entire dataset without excluding any observation.
Reduced Zero-Inflated Model-3 it is developed using less number of variables and AIC value is 44858.04, lower than full model. zeroinfl(formula = TARGET ~ AcidIndex + LabelAppeal + STARS | pH + AcidIndex + +LabelAppeal + STARS, data = wine.cat.miss.stars.df)
Multiple Linear Model-3, produces negative count values for TARGET. Due to negative predicted values, I am more likely to reject the model.
Based on the predicted values, I am more likely to select reduced Zero-Inflated Model-3.

Conclusion

Analysing the dataset, I would conclude wine sales depend on acidic index, pH, label appeal and rating in terms of stars.
Plots suggest, optimal acidic index is between 6 - 9 units and pH is 2 - 4.5 units. Wine meeting the qualities will have better sales.
Removing observations that have NA values in variable STARS had no impact on building models, Zero-Inflated reduced models, Model-3 and Model-4 resulted in same variables. However, further analysis is needed for outlier analysis.

References

Appendix

R Code:

options(scipen=4)
library(VIM)
library(fBasics)
library(knitr)    # Report display, table format
library(kableExtra)
library(reshape2)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(grid)
library(magrittr)
library(mice)
library(DMwR)
library(caret)
library(lmtest)
library(pscl)
library(car)
library(arm)
library(MASS)
library(ggfortify)

wine.df <- as.data.frame(read.csv("https://raw.githubusercontent.com/akulapa/Akula-DATA621-Project04/master/wine-training-data.csv", header=T, stringsAsFactors = F, na.strings=c("","NA")))

wine.eval.df <- as.data.frame(read.csv("https://raw.githubusercontent.com/akulapa/Akula-DATA621-Project04/master/wine-evaluation-data.csv", header=T, stringsAsFactors = F, na.strings=c("","NA")))

str(wine.df)
summary(wine.df)

#missing variables plot
aggr_plot <- aggr(wine.df, numbers=F, sortVars=F, labels=names(wine.df), cex.axis=.45, gap=3, ylab=c("Missing data","Pattern"))

summary(aggr_plot)$missings %>%  
  data.frame() %>% filter(Count > 0) %>% 
  mutate(Percentage = Count*100/nrow(wine.df)) %>% 
  mutate(Percentage = paste0(round(Percentage,2),'%')) %>% 
  kable("html",caption = "Variables With Missing Values", row.names = F, digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12) 
tmp <- wine.df %>%
  dplyr::select (-INDEX) %>% 
  basicStats()

tmp <- data.frame(t(tmp))
tmp <- tmp[ , -which(names(tmp) %in% c("SE.Mean","LCL.Mean","UCL.Mean"))]
colnames(tmp)[which(names(tmp) == "X1..Quartile")] <- "1st. Quartile"
colnames(tmp)[which(names(tmp) == "X3..Quartile")] <- "3rd. Quartile"
colnames(tmp)[which(names(tmp) == "nobs")] <- "Observations"

tmp %>%
  kable(format="html", digits= 2, caption = "Wine Dataset Variable Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left")

melt(wine.df) %>% 
  data.frame() %>% 
  filter(value < 0) %>% 
  dplyr::select(Variable = variable, value) %>% 
  group_by(Variable) %>% 
  summarize(Count = n(), Percentage = n()*100/nrow(wine.df)) %>% 
  dplyr::select(Variable, Count, Percentage) %>% 
  mutate(Percentage = paste0(round(Percentage,2),'%')) %>% 
  kable("html",caption = "Variables With Negative Values", row.names = F, digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)

wine.df %>% 
  filter(TotalSulfurDioxide < 0) %>% 
  arrange(TotalSulfurDioxide) %>% 
  head(5) %>% 
  rbind(wine.df %>% filter(TotalSulfurDioxide > 0) %>% arrange(desc(TotalSulfurDioxide)) %>% head(5)) %>% 
  kable("html",caption = "pH Value Vs Sulphar-Di-Oxide Content", row.names = F, digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)

table(wine.df$TARGET)

qplot(wine.df$TARGET, geom="histogram", binwidth = 0.5, xlab = 'TARGET', ylab = 'Count', main = 'Number Of Observations')

factor(unique(wine.df$LabelAppeal))

# Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
}

# Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
  cormat[lower.tri(cormat)]<- NA
  return(cormat)
}

#Get all chemical process variables
wine_c.test.df <- wine.df %>% 
  dplyr::select(TARGET,FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, AcidIndex, LabelAppeal, STARS) %>% mutate(LabelAppeal = LabelAppeal + 3) #Make label apple positive

cormat <- round(cor(wine_c.test.df, method = "pearson", use = "pairwise.complete.obs"),2)
upper_tri <- get_upper_tri(cormat)
melted_cormat <- melt(upper_tri, na.rm = TRUE)

ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "red", high = "steelblue", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
    name="Pearson\nCorrelation") +
  theme_minimal()+ # minimal theme
 theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 7, hjust = 1))+
 theme(axis.text.y = element_text(vjust = 1, size = 7, hjust = 1))+
 coord_fixed()

ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 2) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))

#Convert negative values to positive and test again
wine_c.test.df <- abs(wine_c.test.df)
cormat <- round(cor(wine_c.test.df, method = "pearson", use = "pairwise.complete.obs"),2)
upper_tri <- get_upper_tri(cormat)
melted_cormat <- melt(upper_tri, na.rm = TRUE)

ggheatmap <- ggplot(melted_cormat, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "red", high = "steelblue", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
    name="Pearson\nCorrelation") +
  theme_minimal()+ # minimal theme
 theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 7, hjust = 1))+
 theme(axis.text.y = element_text(vjust = 1, size = 7, hjust = 1))+
 coord_fixed()

ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 2) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwidth = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))


wine_c.df <- wine.df[complete.cases(wine.df), ]
nrow(wine_c.df)

#Boxplots
wine_c.df %>% 
  dplyr::select(-INDEX) %>% 
  melt() %>% 
  ggplot(aes(factor(variable), value)) +
    geom_boxplot() + facet_wrap(~variable, scale="free", nrow = 8, ncol = 2) +
    theme(axis.text.x = element_text(vjust = 0.5, size = 6, hjust = 0.5, colour = 'black')) +
    theme(axis.text.y = element_text(vjust = 0.5, size = 6, hjust = 0.5, colour = 'black')) +
    labs(title="Boxplot: Each Variable",x="", y="") +
    theme(strip.background = element_blank(),  strip.text.x = element_blank()) + coord_flip()

#Histogram
wine_c.df %>% 
  dplyr::select(-INDEX) %>% 
  melt() %>% 
  ggplot(aes(x = value)) + 
      geom_histogram() + facet_wrap(~variable,scales = "free", nrow = 4, ncol = 4) +
    theme(axis.text.x = element_text(vjust = 0.5, size = 6, hjust = 0.5, colour = 'black')) +
    theme(axis.text.y = element_text(vjust = 0.5, size = 6, hjust = 0.5, colour = 'black')) +
    labs(title="Histogram: Each Variable",x="", y="") +
    theme(strip.background = element_blank(),  strip.text.x = element_text(vjust = 0.5, size = 6, hjust = 0.5, colour = 'black'))

#Lets test impute process
wine_c_abs.df <- wine_c.df %>% dplyr::select(-INDEX, -LabelAppeal) %>% abs()
wine_c_abs.df <- cbind(wine_c_abs.df, LabelAppeal = wine_c.df$LabelAppeal)

wine_c_abs.df %>% 
  melt() %>% 
  ggplot(aes(x = value)) + 
      geom_histogram() + facet_wrap(~variable,scales = "free", nrow = 4, ncol = 4) +
    theme(axis.text.x = element_text(vjust = 0.5, size = 6, hjust = 0.5, colour = 'black')) +
    theme(axis.text.y = element_text(vjust = 0.5, size = 6, hjust = 0.5, colour = 'black')) +
    labs(title="Histogram: Each Variable",x="", y="") +
    theme(strip.background = element_blank(),  strip.text.x = element_text(vjust = 0.5, size = 6, hjust = 0.5, colour = 'black'))

wine_c_abs.df <- wine_c.df %>% dplyr::select(-INDEX) %>% mutate(LabelAppeal = LabelAppeal + 3) %>% abs()
wine_c.test.df <- wine_c_abs.df

#Get all chemical process variables
wine_c.test.df <- wine_c.test.df %>% 
  dplyr::select(FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, AcidIndex)

#Replace with NA values
set.seed(7374)
wine_c.test.df[sample(1:nrow(wine_c.test.df), 300), "ResidualSugar"] <- NA
wine_c.test.df[sample(1:nrow(wine_c.test.df), 300), "Chlorides"] <- NA
wine_c.test.df[sample(1:nrow(wine_c.test.df), 300), "FreeSulfurDioxide"] <- NA
wine_c.test.df[sample(1:nrow(wine_c.test.df), 300), "TotalSulfurDioxide"] <- NA
wine_c.test.df[sample(1:nrow(wine_c.test.df), 150), "pH"] <- NA
wine_c.test.df[sample(1:nrow(wine_c.test.df), 600), "Sulphates"] <- NA
wine_c.test.df[sample(1:nrow(wine_c.test.df), 300), "Alcohol"] <- NA

#kNN imputation
wine_c.test.knn <- knnImputation(wine_c.test.df, 10, meth='weighAvg')

misscols <- c('ResidualSugar', 'Chlorides', 'FreeSulfurDioxide', 'TotalSulfurDioxide', 'pH', 'Sulphates', 'Alcohol')

#Check best imputation method, using mape or rmse
Accuracy.df <- data.frame()
i = 1
for(cn in misscols){
  f <- paste0('wine_c_abs.df$',cn,'[is.na(wine_c.test.df$',cn,')]')
  actual <- eval(parse(text=f))
  
  f <- paste0('wine_c.test.knn$',cn,'[is.na(wine_c.test.df$',cn,')]')
  predicts <- eval(parse(text=f))
  
  error.rate <- regr.eval(actual, predicts)
  if(i==1){
    Accuracy.df <- data.frame(error.rate, stringsAsFactors = F) %>% 
      t() %>% 
      data.frame()
    i = i + 1
  } else {
    
    A <- data.frame(error.rate, stringsAsFactors = F) %>% 
                  t() %>% 
                  data.frame()
    Accuracy.df <- rbind(Accuracy.df, A)
  }

}

#impute STARS
wine_c.test.df <- cbind(wine_c.test.knn, LabelAppeal = wine_c_abs.df$LabelAppeal, STARS=wine_c_abs.df$STARS)
wine_c.test.df[sample(1:nrow(wine_c.test.df), 600), "STARS"] <- NA

wine_c.test.df$LabelAppeal <- factor(wine_c.test.df$LabelAppeal, levels = c(1,2,3,4,5))
wine_c.test.df$STARS <- factor(wine_c.test.df$STARS, levels = c(1,2,3,4))

#kNN imputation
wine_c.test.knn <- knnImputation(wine_c.test.df, 10, meth='weighAvg')

actual <- wine_c_abs.df$STARS[is.na(wine_c.test.df$STARS)]
predicts <- wine_c.test.knn$STARS[is.na(wine_c.test.df$STARS)]
predicts <- as.numeric(predicts)
error.rate <- regr.eval(actual, predicts)

A <- data.frame(error.rate, stringsAsFactors = F) %>% 
              t() %>% 
              data.frame()
Accuracy.df <- rbind(Accuracy.df, A)

Accuracy.df$mae <- as.character(Accuracy.df$mae)
Accuracy.df$mse <- as.character(Accuracy.df$mse)
Accuracy.df$rmse <- as.character(Accuracy.df$rmse)
Accuracy.df$mape <- as.character(Accuracy.df$mape)
Accuracy.df$mape <- ifelse(Accuracy.df$mape == "Inf", NA, Accuracy.df$mape)
Accuracy.df$mae <- as.numeric(Accuracy.df$mae)
Accuracy.df$mse <- as.numeric(Accuracy.df$mse)
Accuracy.df$rmse <- as.numeric(Accuracy.df$rmse)
Accuracy.df$mape <- as.numeric(Accuracy.df$mape)
Accuracy.df <- data.frame(Accuracy.df, stringsAsFactors=F)
row.names(Accuracy.df) <- c('ResidualSugar', 'Chlorides', 'FreeSulfurDioxide', 'TotalSulfurDioxide', 'pH', 'Sulphates', 'Alcohol','STARS')

Accuracy.df %>% 
  kable("html",caption = "Imputation Accuracy Using kNN - k=10", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 12)

#**************Imputation - abs dataset*******************#
#Step 1
#Imput wine chemical properties
wine_cp.df <- wine.df %>% 
  dplyr::select(FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, AcidIndex)

wine_cp.df <- abs(wine_cp.df)

#kNN Imputation
wine_cp.knn <- knnImputation(wine_cp.df, 30, meth='weighAvg')

#Bind leftout variables after imputation to make it complete dataset, convert LabelAppeal to positive
wine_cp.knn <- cbind(wine_cp.knn, LabelAppeal = wine.df$LabelAppeal + 3, STARS = wine.df$STARS)
wine_cp.knn$LabelAppeal <- factor(wine_cp.knn$LabelAppeal, levels = c(1,2,3,4,5))
wine_cp.knn$STARS <- factor(wine_cp.knn$STARS, levels = c(1,2,3,4))

#kNN imputation
wine.miss.stars.df <- knnImputation(wine_cp.knn, 30, meth='weighAvg')
wine.miss.stars.df <- cbind(TARGET = wine.df$TARGET, wine.miss.stars.df) #1st dataset for the project

#Remove Observations that have missing STARS
wine.nomiss.stars.df <- wine.miss.stars.df[complete.cases(wine_cp.knn), ] #2nd dataset for the project

#**************Imputation - with negative observations in dataset*******************#
#Get all chemical process variables
wine.asis.df <- wine.df %>% 
  dplyr::select(FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, AcidIndex)

#Convert negative variables in category
negcols <- c('FixedAcidity', 'VolatileAcidity', 'CitricAcid', 'ResidualSugar', 'Chlorides', 'FreeSulfurDioxide', 'TotalSulfurDioxide', 'Sulphates', 'Alcohol')
i = 1
for(nc in negcols){
  #Get min value of the column
  f <- paste0('min(wine.asis.df$',nc,', na.rm=T)')
  minVal <- eval(parse(text=f))

  #Get max value of the column
  f <- paste0('max(wine.asis.df$',nc,', na.rm=T)')
  maxVal <- eval(parse(text=f))
  
  #Get incremental value
  if (maxVal < 10){incVal = 1}
  if (maxVal >= 10 && maxVal < 50){incVal = 5}
  if (maxVal >= 50 && maxVal < 100){incVal = 10}
  if (maxVal >= 100 && maxVal < 500){incVal = 50}
  if (maxVal >= 500 && maxVal < 1000){incVal = 100}
  seqRange <- seq(minVal-incVal,maxVal+incVal,by=incVal)
  
  #Apply range to dataset
  f <- paste0('wine.asis.df$',nc,'R<-cut(wine.asis.df$',nc, ',seq(', minVal-incVal,',', maxVal+incVal, ',by=', incVal, '))')
  eval(parse(text=f))
  
  #Assign range a category 1:max value and apply to dataset
  f <- paste0('wine.asis.df$',nc,'C<-cut(wine.asis.df$',nc, ',seq(', minVal-incVal,',', maxVal+incVal, ',by=', incVal, '), labels=c(1:', length(seqRange)-1,'))')
  eval(parse(text=f))
}

#Sample dataset
wine.asis.df %>% 
  dplyr::select (FixedAcidity, FixedAcidityR, FixedAcidityC, 
           ResidualSugar, ResidualSugarR, ResidualSugarC,
           TotalSulfurDioxide, TotalSulfurDioxideR, TotalSulfurDioxideC) %>% 
  filter(TotalSulfurDioxide < 0 & FixedAcidity < 0 & ResidualSugar <0) %>% 
  head(5) %>% 
  rbind(wine.asis.df %>% 
  dplyr::select (FixedAcidity, FixedAcidityR, FixedAcidityC, 
           ResidualSugar, ResidualSugarR, ResidualSugarC,
           TotalSulfurDioxide, TotalSulfurDioxideR, TotalSulfurDioxideC) %>% 
  filter(TotalSulfurDioxide > 0 & FixedAcidity > 0 & ResidualSugar >0) %>% head(5)) %>% 
  kable("html",caption = "Binning Variables With Negative Values", row.names = F, digits = 2) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

#Get all newly created categorical variables
wine.asis_c.df <- wine.asis.df %>% 
  dplyr::select(FixedAcidityC, VolatileAcidityC, CitricAcidC, ResidualSugarC, ChloridesC, FreeSulfurDioxideC, TotalSulfurDioxideC, Density, pH, SulphatesC, AlcoholC, AcidIndex)

#kNN imputation
wine.test.knn <- knnImputation(wine.asis_c.df, 30, meth='weighAvg')

wine.asis_c.df <- cbind(wine.test.knn, LabelAppeal = wine.df$LabelAppeal + 3, STARS = wine.df$STARS)
wine.asis_c.df$LabelAppeal <- factor(wine.asis_c.df$LabelAppeal, levels = c(1,2,3,4,5))
wine.asis_c.df$STARS <- factor(wine.asis_c.df$STARS, levels = c(1,2,3,4))

wine.cat.miss.stars.df <- knnImputation(wine.asis_c.df, 30, meth='weighAvg') 

wine.cat.miss.stars.df <- cbind(TARGET = wine.df$TARGET, wine.cat.miss.stars.df) #3rd dataset for the project

#Remove Observations that have missing STARS
wine.cat.nomiss.stars.df <- wine.cat.miss.stars.df[complete.cases(wine.asis_c.df), ] #4th dataset for the project

#Sample binning data before imputation
table(wine.asis.df$AlcoholR)
ggplot(wine.asis.df, aes(x=AlcoholR)) +  geom_bar() + labs(title="Binning Alcohol Content",x="Alcohol", y="Count")

#wine.miss.stars.df - abs all, missing stars included
#wine.nomiss.stars.df -abs all, missing stars excluded
#wine.cat.miss.stars.df - categorical all, missing stars included
#wine.cat.nomiss.stars.df -categorical all, missing stars excluded

#With Missing STARS
wine.mean.var <- c(mean = mean(wine.df$TARGET), variance = var(wine.df$TARGET), ratio = var(wine.df$TARGET)/mean(wine.df$TARGET))
wine.mean.var

#Without Missing STARS
wine.nomiss.mean.var <- c(mean = mean(wine.df$TARGET[complete.cases(wine.df$STARS)]), variance = var(wine.df$TARGET[complete.cases(wine.df$STARS)]), ratio = var(wine.df$TARGET[complete.cases(wine.df$STARS)])/mean(wine.df$TARGET[complete.cases(wine.df$STARS)]))
wine.nomiss.mean.var

#Poisson model building
#Stars imputed
wine.miss.stars.p <- glm(TARGET ~ ., data = wine.miss.stars.df, family = poisson)
summary(wine.miss.stars.p)

#Stars removed
wine.nomiss.stars.p <- glm(TARGET ~ ., data = wine.nomiss.stars.df, family = poisson)
summary(wine.nomiss.stars.p)

#Negative values converted to categories Stars included
wine.cat.miss.stars.p <- glm(TARGET ~ ., data = wine.cat.miss.stars.df, family = poisson)
summary(wine.cat.miss.stars.p)

#Negative values converted to categories Stars excluded
wine.cat.nomiss.stars.p <- glm(TARGET ~ ., data = wine.cat.nomiss.stars.df, family = poisson)
summary(wine.cat.nomiss.stars.p)

#Comparew stats of each model
wine.fit <- data.frame(`Model-1` = wine.miss.stars.p$deviance,
                       `Model-2` = wine.nomiss.stars.p$deviance,
                       `Model-3` = wine.cat.miss.stars.p$deviance,
                       `Model-4` = wine.cat.nomiss.stars.p$deviance
                       )

wine.fit <- rbind(wine.fit, data.frame(`Model-1` = wine.miss.stars.p$df.residual,
                       `Model-2` = wine.nomiss.stars.p$df.residual,
                       `Model-3` = wine.cat.miss.stars.p$df.residual,
                       `Model-4` = wine.cat.nomiss.stars.p$df.residual
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` = (qchisq(0.95, wine.miss.stars.p$df.residual)),
                       `Model-2` = (qchisq(0.95, wine.nomiss.stars.p$df.residual)),
                       `Model-3` = (qchisq(0.95, wine.cat.miss.stars.p$df.residual)),
                       `Model-4` = (qchisq(0.95, wine.cat.nomiss.stars.p$df.residual))
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` = (1-pchisq(wine.miss.stars.p$deviance, wine.miss.stars.p$df.residual)),
                       `Model-2` = (1-pchisq(wine.nomiss.stars.p$deviance, wine.nomiss.stars.p$df.residual)),
                       `Model-3` = (1-pchisq(wine.cat.miss.stars.p$deviance, wine.cat.miss.stars.p$df.residual)),
                       `Model-4` = (1-pchisq(wine.cat.nomiss.stars.p$deviance, wine.cat.nomiss.stars.p$df.residual))
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` = sum(residuals(wine.miss.stars.p, 'pearson')^2),
                       `Model-2` = sum(residuals(wine.nomiss.stars.p, 'pearson')^2),
                       `Model-3` = sum(residuals(wine.cat.miss.stars.p, 'pearson')^2),
                       `Model-4` = sum(residuals(wine.cat.nomiss.stars.p, 'pearson')^2)
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` = sum(residuals(wine.miss.stars.p, 'pearson')^2)/wine.miss.stars.p$df.residual,
                       `Model-2` = sum(residuals(wine.nomiss.stars.p, 'pearson')^2)/wine.nomiss.stars.p$df.residual,
                       `Model-3` = sum(residuals(wine.cat.miss.stars.p, 'pearson')^2)/wine.cat.miss.stars.p$df.residual,
                       `Model-4` = sum(residuals(wine.cat.nomiss.stars.p, 'pearson')^2)/wine.cat.nomiss.stars.p$df.residual
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` = sqrt(sum(residuals(wine.miss.stars.p, 'pearson')^2)/wine.miss.stars.p$df.residual),
                       `Model-2` = sqrt(sum(residuals(wine.nomiss.stars.p, 'pearson')^2)/wine.nomiss.stars.p$df.residual),
                       `Model-3` = sqrt(sum(residuals(wine.cat.miss.stars.p, 'pearson')^2)/wine.cat.miss.stars.p$df.residual),
                       `Model-4` = sqrt(sum(residuals(wine.cat.nomiss.stars.p, 'pearson')^2)/wine.cat.nomiss.stars.p$df.residual)
                       ))

row.names(wine.fit) <- c('deviance', 'df', 'chisq', 'p-value', 'pearson\'s chi-sq', 'proportion', 'adj.factor phi')

wine.fit%>% 
  kable("html",caption = "Models Stats", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)
  
#Poisson model with no significance variables removed 
#Stars imputed
wine.miss.stars.p2 <- update(wine.miss.stars.p, .~.-FixedAcidity-ResidualSugar)
summary(wine.miss.stars.p2)

#Poisson model with no significance variables removed 
#Stars removed
#wine.nomiss.stars.p2 <- update(wine.nomiss.stars.p, .~.-FixedAcidity -CitricAcid -ResidualSugar -Chlorides -FreeSulfurDioxide -TotalSulfurDioxide -Density -pH -Sulphates)
wine.nomiss.stars.p2 <- update(wine.nomiss.stars.p, .~.-FixedAcidity -CitricAcid -ResidualSugar -TotalSulfurDioxide -Density -pH -Sulphates)
summary(wine.nomiss.stars.p2)

#Poisson model with no significance variables removed 
#Negative values converted to categories Stars included
wine.cat.miss.stars.p2 <- glm(TARGET ~ Density + pH + AcidIndex + LabelAppeal + STARS, data = wine.cat.miss.stars.df, family = poisson)
summary(wine.cat.miss.stars.p2)

#Poisson model with no significance variables removed 
#Negative values converted to categories Stars excluded
wine.cat.nomiss.stars.p2 <-glm(TARGET ~ AcidIndex + LabelAppeal + STARS, data = wine.cat.nomiss.stars.df, family = poisson)
summary(wine.cat.nomiss.stars.p2)

#Stats of models without non signifiance variables
wine.fit <- data.frame(`Model-1` =  wine.miss.stars.p2$deviance,
                       `Model-2` =  wine.nomiss.stars.p2$deviance,
                       `Model-3` =  wine.cat.miss.stars.p2$deviance,
                       `Model-4` =  wine.cat.nomiss.stars.p2$deviance
                       )

wine.fit <- rbind(wine.fit, data.frame(`Model-1` =  wine.miss.stars.p2$df.residual,
                       `Model-2` =  wine.nomiss.stars.p2$df.residual,
                       `Model-3` =  wine.cat.miss.stars.p2$df.residual,
                       `Model-4` =  wine.cat.nomiss.stars.p2$df.residual
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` =  (qchisq(0.95, wine.miss.stars.p2$df.residual)),
                       `Model-2` =  (qchisq(0.95, wine.nomiss.stars.p2$df.residual)),
                       `Model-3` =  (qchisq(0.95, wine.cat.miss.stars.p2$df.residual)),
                       `Model-4` =  (qchisq(0.95, wine.cat.nomiss.stars.p2$df.residual))
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` =  (1-pchisq(wine.miss.stars.p2$deviance, wine.miss.stars.p2$df.residual)),
                       `Model-2` =  (1-pchisq(wine.nomiss.stars.p2$deviance, wine.nomiss.stars.p2$df.residual)),
                       `Model-3` =  (1-pchisq(wine.cat.miss.stars.p2$deviance, wine.cat.miss.stars.p2$df.residual)),
                       `Model-4` =  (1-pchisq(wine.cat.nomiss.stars.p2$deviance, wine.cat.nomiss.stars.p2$df.residual))
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` =  sum(residuals(wine.miss.stars.p2, 'pearson')^2),
                       `Model-2` =  sum(residuals(wine.nomiss.stars.p2, 'pearson')^2),
                       `Model-3` =  sum(residuals(wine.cat.miss.stars.p2, 'pearson')^2),
                       `Model-4` =  sum(residuals(wine.cat.nomiss.stars.p2, 'pearson')^2)
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` =  sum(residuals(wine.miss.stars.p2, 'pearson')^2)/wine.miss.stars.p2$df.residual,
                       `Model-2` =  sum(residuals(wine.nomiss.stars.p2, 'pearson')^2)/wine.nomiss.stars.p2$df.residual,
                       `Model-3` =  sum(residuals(wine.cat.miss.stars.p2, 'pearson')^2)/wine.cat.miss.stars.p2$df.residual,
                       `Model-4` =  sum(residuals(wine.cat.nomiss.stars.p2, 'pearson')^2)/wine.cat.nomiss.stars.p2$df.residual
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` =  sqrt(sum(residuals(wine.miss.stars.p2, 'pearson')^2)/wine.miss.stars.p2$df.residual),
                       `Model-2` =  sqrt(sum(residuals(wine.nomiss.stars.p2, 'pearson')^2)/wine.nomiss.stars.p2$df.residual),
                       `Model-3` =  sqrt(sum(residuals(wine.cat.miss.stars.p2, 'pearson')^2)/wine.cat.miss.stars.p2$df.residual),
                       `Model-4` =  sqrt(sum(residuals(wine.cat.nomiss.stars.p2, 'pearson')^2)/wine.cat.nomiss.stars.p2$df.residual)
                       ))

row.names(wine.fit) <- c('deviance', 'df', 'chisq', 'p-value', 'pearson\'s chi-sq', 'proportion', 'adj.factor phi')

wine.fit%>% 
  kable("html",caption = "Models Stats", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

#quasipoisson model building
#Stars imputed
wine.miss.stars.q <- glm(TARGET ~ ., data = wine.miss.stars.df, family = quasipoisson)
summary(wine.miss.stars.q)

#Stars removed
wine.nomiss.stars.q <- glm(TARGET ~ ., data = wine.nomiss.stars.df, family = quasipoisson)
summary(wine.nomiss.stars.q)

#Negative values converted to categories Stars included
wine.cat.miss.stars.q <- glm(TARGET ~ ., data = wine.cat.miss.stars.df, family = quasipoisson)
summary(wine.cat.miss.stars.q)

#Negative values converted to categories Stars excluded
wine.cat.nomiss.stars.q <- glm(TARGET ~ ., data = wine.cat.nomiss.stars.df, family = quasipoisson)
summary(wine.cat.nomiss.stars.q)

#Stats comparision between Poission and QuasiPoisson
c.df <- data.frame(Poisson.Inc = coef(wine.miss.stars.p), 
                   Quasi.Inc = coef(wine.miss.stars.q), 
                   Poisson.SE = se.coef(wine.miss.stars.p), 
                   Quasi.SE = se.coef(wine.miss.stars.q), 
                   Poisson.pVal = coef(summary(wine.miss.stars.p))[,'Pr(>|z|)'], 
                   Poisson.Sig = ifelse(coef(summary(wine.miss.stars.p))[,'Pr(>|z|)']<=0.05,'*',''),
                   Quasi.pVal = coef(summary(wine.miss.stars.q))[,'Pr(>|t|)'], 
                   Quasi.Sig = ifelse(coef(summary(wine.miss.stars.q))[,'Pr(>|t|)']<=0.05,'*',''),
                   ratio = se.coef(wine.miss.stars.q)/se.coef(wine.miss.stars.p)
                   )

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson - Model-1", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

c.df <- data.frame(Poisson.Inc = coef(wine.nomiss.stars.p), 
                   Quasi.Inc = coef(wine.nomiss.stars.q), 
                   Poisson.SE = se.coef(wine.nomiss.stars.p), 
                   Quasi.SE = se.coef(wine.nomiss.stars.q), 
                   Poisson.pVal = coef(summary(wine.nomiss.stars.p))[,'Pr(>|z|)'], 
                   Poisson.Sig = ifelse(coef(summary(wine.nomiss.stars.p))[,'Pr(>|z|)']<=0.05,'*',''),
                   Quasi.pVal = coef(summary(wine.nomiss.stars.q))[,'Pr(>|t|)'], 
                   Quasi.Sig = ifelse(coef(summary(wine.nomiss.stars.q))[,'Pr(>|t|)']<=0.05,'*',''),
                   ratio = se.coef(wine.nomiss.stars.q)/se.coef(wine.nomiss.stars.p)
                   )

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson - Model-2", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

c.df <- data.frame(Poisson.Inc = coef(wine.cat.miss.stars.p), 
                   Quasi.Inc = coef(wine.cat.miss.stars.q), 
                   Poisson.SE = se.coef(wine.cat.miss.stars.p), 
                   Quasi.SE = se.coef(wine.cat.miss.stars.q), 
                   Poisson.pVal = coef(summary(wine.cat.miss.stars.p))[,'Pr(>|z|)'], 
                   Poisson.Sig = ifelse(coef(summary(wine.cat.miss.stars.p))[,'Pr(>|z|)']<=0.05,'*',''),
                   Quasi.pVal = coef(summary(wine.cat.miss.stars.q))[,'Pr(>|t|)'], 
                   Quasi.Sig = ifelse(coef(summary(wine.cat.miss.stars.q))[,'Pr(>|t|)']<=0.05,'*',''),
                   ratio = se.coef(wine.cat.miss.stars.q)/se.coef(wine.cat.miss.stars.p)
                   )

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson - Model-3", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

c.df <- data.frame(Poisson.Inc = coef(wine.cat.nomiss.stars.p), 
                   Quasi.Inc = coef(wine.cat.nomiss.stars.q), 
                   Poisson.SE = se.coef(wine.cat.nomiss.stars.p), 
                   Quasi.SE = se.coef(wine.cat.nomiss.stars.q), 
                   Poisson.pVal = coef(summary(wine.cat.nomiss.stars.p))[,'Pr(>|z|)'], 
                   Poisson.Sig = ifelse(coef(summary(wine.cat.nomiss.stars.p))[,'Pr(>|z|)']<=0.05,'*',''), 
                   Quasi.pVal = coef(summary(wine.cat.nomiss.stars.q))[,'Pr(>|t|)'], 
                   Quasi.Sig = ifelse(coef(summary(wine.cat.nomiss.stars.q))[,'Pr(>|t|)']<=0.05,'*',''),
                   ratio = se.coef(wine.cat.nomiss.stars.q)/se.coef(wine.cat.nomiss.stars.p)
                   )

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson - Model-4", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

#QuasiPoisson model with no significance variables removed 
#Stars imputed
wine.miss.stars.q2 <- update(wine.miss.stars.q, .~.-FixedAcidity-ResidualSugar)
summary(wine.miss.stars.q2)

#QuasiPoisson model with no significance variables removed 
#Stars removed
wine.nomiss.stars.q2 <- update(wine.nomiss.stars.q, .~.-FixedAcidity -CitricAcid -ResidualSugar -TotalSulfurDioxide -Density -pH -Sulphates)
summary(wine.nomiss.stars.q2)

#QuasiPoisson model with no significance variables removed 
#Negative values converted to categories Stars included
wine.cat.miss.stars.q2 <- glm(TARGET ~ Density + pH + AcidIndex + LabelAppeal + STARS, data = wine.cat.miss.stars.df, family = quasipoisson)
summary(wine.cat.miss.stars.q2)

#QuasiPoisson model with no significance variables removed 
#Negative values converted to categories Stars excluded
wine.cat.nomiss.stars.q2 <-glm(TARGET ~ AcidIndex + LabelAppeal + STARS, data = wine.cat.miss.stars.df, family = quasipoisson)
summary(wine.cat.nomiss.stars.q2)

#Comparision without non significant variables
#Stats comparision between Poission and QuasiPoisson
c.df <- data.frame(Poisson.Inc = coef(wine.miss.stars.p2), 
                   Quasi.Inc = coef(wine.miss.stars.q2), 
                   Poisson.SE = se.coef(wine.miss.stars.p2), 
                   Quasi.SE = se.coef(wine.miss.stars.q2), 
                   Poisson.Sig = coef(summary(wine.miss.stars.p2))[,'Pr(>|z|)'], 
                   Quasi.Sig = coef(summary(wine.miss.stars.q2))[,'Pr(>|t|)'], 
                   ratio = se.coef(wine.miss.stars.q2)/se.coef(wine.miss.stars.p2))

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson - Model-1", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

c.df <- round(data.frame(Poisson.Inc = coef(wine.nomiss.stars.p2), Quasi.Inc = coef(wine.nomiss.stars.q2), Poisson.SE = se.coef(wine.nomiss.stars.p2), Quasi.SE = se.coef(wine.nomiss.stars.q2), Poisson.Sig = coef(summary(wine.nomiss.stars.p2))[,'Pr(>|z|)'], Quasi.Sig = coef(summary(wine.nomiss.stars.q2))[,'Pr(>|t|)'], ratio = se.coef(wine.nomiss.stars.q2)/se.coef(wine.nomiss.stars.p2)),4)

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson - Model-2", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)


c.df <- round(data.frame(Poisson.Inc = coef(wine.cat.miss.stars.p2), Quasi.Inc = coef(wine.cat.miss.stars.q2), Poisson.SE = se.coef(wine.cat.miss.stars.p2), Quasi.SE = se.coef(wine.cat.miss.stars.q2), Poisson.Sig = coef(summary(wine.cat.miss.stars.p2))[,'Pr(>|z|)'], Quasi.Sig = coef(summary(wine.cat.miss.stars.q2))[,'Pr(>|t|)'], ratio = se.coef(wine.cat.miss.stars.q2)/se.coef(wine.cat.miss.stars.p2)),4)

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson - Model-3", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

c.df <- round(data.frame(Poisson.Inc = coef(wine.cat.nomiss.stars.p2), Quasi.Inc = coef(wine.cat.nomiss.stars.q2), Poisson.SE = se.coef(wine.cat.nomiss.stars.p2), Quasi.SE = se.coef(wine.cat.nomiss.stars.q2), Poisson.Sig = coef(summary(wine.cat.nomiss.stars.p2))[,'Pr(>|z|)'], Quasi.Sig = coef(summary(wine.cat.nomiss.stars.q2))[,'Pr(>|t|)'], ratio = se.coef(wine.cat.nomiss.stars.q2)/se.coef(wine.cat.nomiss.stars.p2)),4)

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson - Model-4", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

#Negative Binomial
#Missing Stars included
wine.miss.stars.nb <- glm.nb(TARGET ~ ., data = wine.miss.stars.df)
summary(wine.miss.stars.nb)

#Negative Binomial
#Missing stars removed
wine.nomiss.stars.nb <- glm.nb(TARGET ~ ., data = wine.nomiss.stars.df)
summary(wine.nomiss.stars.nb)

#Negative Binomial
#Negative values converted to categories Stars included
wine.cat.miss.stars.nb <- glm.nb(TARGET ~ ., data = wine.cat.miss.stars.df)
summary(wine.cat.miss.stars.nb)

#Negative Binomial
#Negative values converted to categories Stars excluded
wine.cat.nomiss.stars.nb <- glm.nb(TARGET ~ ., data = wine.cat.nomiss.stars.df)
summary(wine.cat.nomiss.stars.nb)

#Comparision without non significant variables
#Stats comparision between Poission and QuasiPoisson and Negative Binomial
c.df <- data.frame(Poisson.Inc = coef(wine.miss.stars.p),
               Quasi.Inc = coef(wine.miss.stars.q), 
               Neg.Binom.Inc = coef(wine.miss.stars.nb), 
               Poisson.SE = se.coef(wine.miss.stars.p), 
               Quasi.SE = se.coef(wine.miss.stars.q), 
               Neg.Binom.SE = coef(summary(wine.miss.stars.nb))[,'Std. Error'], 
               Poisson.pVal = coef(summary(wine.miss.stars.p))[,'Pr(>|z|)'], 
               Poisson.Sig = ifelse(coef(summary(wine.miss.stars.p))[,'Pr(>|z|)']<= 0.05,'*',''), 
               Quasi.pVal = coef(summary(wine.miss.stars.q))[,'Pr(>|t|)'], 
               Quasi.Sig = ifelse(coef(summary(wine.miss.stars.q))[,'Pr(>|t|)']<= 0.05,'*',''), 
               Neg.Binom.pVal = coef(summary(wine.miss.stars.nb))[,'Pr(>|z|)'],
               Neg.Binom.Sig = ifelse(coef(summary(wine.miss.stars.nb))[,'Pr(>|z|)']<= 0.05,'*','')
               )

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson Vs Negative Binomial - Model-1", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)
 
#Comparision without non significant variables
#Stats comparision between Poission and QuasiPoisson and Negative Binomial
c.df <- data.frame(Poisson.Inc = coef(wine.nomiss.stars.p), 
                   Quasi.Inc = coef(wine.nomiss.stars.q), 
                   Neg.Binom.Inc = coef(wine.nomiss.stars.nb), 
                   Poisson.SE = se.coef(wine.nomiss.stars.p), 
                   Quasi.SE = se.coef(wine.nomiss.stars.q), 
                   Neg.Binom.SE = coef(summary(wine.nomiss.stars.nb))[,'Std. Error'], 
                   Poisson.pVal = coef(summary(wine.nomiss.stars.p))[,'Pr(>|z|)'], 
                   Poisson.Sig = ifelse(coef(summary(wine.nomiss.stars.p))[,'Pr(>|z|)']<= 0.05,'*',''),
                   Quasi.pVal = coef(summary(wine.nomiss.stars.q))[,'Pr(>|t|)'], 
                   Quasi.Sig = ifelse(coef(summary(wine.nomiss.stars.q))[,'Pr(>|t|)']<= 0.05,'*',''),
                   Neg.Binom.pVal = coef(summary(wine.nomiss.stars.nb))[,'Pr(>|z|)'],
                   Neg.Binom.Sig = ifelse(coef(summary(wine.nomiss.stars.nb))[,'Pr(>|z|)']<= 0.05,'*','')
                   )

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson Vs Negative Binomial - Model-2", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)
 
#Comparision without non significant variables
#Stats comparision between Poission and QuasiPoisson and Negative Binomial
c.df <- data.frame(Poisson.Inc = coef(wine.cat.miss.stars.p), 
                   Quasi.Inc = coef(wine.cat.miss.stars.q), 
                   Neg.Binom.Inc = coef(wine.cat.miss.stars.nb), 
                   Poisson.SE = se.coef(wine.cat.miss.stars.p), 
                   Quasi.SE = se.coef(wine.cat.miss.stars.q), 
                   Neg.Binom.SE = round(coef(summary(wine.cat.miss.stars.nb))[,'Std. Error'],4), 
                   Poisson.pVal = coef(summary(wine.cat.miss.stars.p))[,'Pr(>|z|)'], 
                   Poisson.Sig = ifelse(coef(summary(wine.cat.miss.stars.p))[,'Pr(>|z|)']<=0.05,'*',''), 
                   Quasi.pVal = coef(summary(wine.cat.miss.stars.q))[,'Pr(>|t|)'], 
                   Quasi.Sig = ifelse(coef(summary(wine.cat.miss.stars.q))[,'Pr(>|t|)']<=0.05,'*',''), 
                   Neg.Binom.pVal = coef(summary(wine.cat.miss.stars.nb))[,'Pr(>|z|)'],
                   Neg.Binom.Sig = ifelse(coef(summary(wine.cat.miss.stars.nb))[,'Pr(>|z|)']<=0.05,'*','')
                   )

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson Vs Negative Binomial - Model-3", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)
 
#Comparision without non significant variables
#Stats comparision between Poission and QuasiPoisson and Negative Binomial
c.df <- data.frame(Poisson.Inc = coef(wine.cat.nomiss.stars.p), 
                   Quasi.Inc = coef(wine.cat.nomiss.stars.q), 
                   Neg.Binom.Inc = coef(wine.cat.nomiss.stars.nb), 
                   Poisson.SE = se.coef(wine.cat.nomiss.stars.p), 
                   Quasi.SE = se.coef(wine.cat.nomiss.stars.q), 
                   Neg.Binom.SE = coef(summary(wine.cat.nomiss.stars.nb))[,'Std. Error'], 
                   Poisson.pVal = coef(summary(wine.cat.nomiss.stars.p))[,'Pr(>|z|)'], 
                   Poisson.Sig = ifelse(coef(summary(wine.cat.nomiss.stars.p))[,'Pr(>|z|)']<=0.05,'*',''), 
                   Quasi.pVal = coef(summary(wine.cat.nomiss.stars.q))[,'Pr(>|t|)'], 
                   Quasi.Sig = ifelse(coef(summary(wine.cat.nomiss.stars.q))[,'Pr(>|t|)']<=0.05,'*',''), 
                   Neg.Binom.pVal = coef(summary(wine.cat.nomiss.stars.nb))[,'Pr(>|z|)'],
                   Neg.Binom.Sig = ifelse(coef(summary(wine.cat.nomiss.stars.nb))[,'Pr(>|z|)']<=0.05,'*','')
                   )

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson Vs Negative Binomial - Model-4", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)
 
#Negative Binomial
wine.miss.stars.nb2 <- update(wine.miss.stars.nb, .~. -FixedAcidity -ResidualSugar)
summary(wine.miss.stars.nb2)

#Negative Binomial
#Stars imputed
#wine.nomiss.stars.nb2 <- update(wine.nomiss.stars.nb, .~.-FixedAcidity -CitricAcid -ResidualSugar -Chlorides -FreeSulfurDioxide -TotalSulfurDioxide -Density -pH -Sulphates)
wine.nomiss.stars.nb2 <- update(wine.nomiss.stars.nb, .~. -FixedAcidity -CitricAcid -ResidualSugar -TotalSulfurDioxide -Density -pH -Sulphates)
summary(wine.nomiss.stars.nb2)

#Negative Binomial
#Stars removed
wine.cat.miss.stars.nb2 <- glm.nb(TARGET ~ Density + pH + AcidIndex + LabelAppeal + STARS, data = wine.cat.miss.stars.df)
summary(wine.cat.miss.stars.nb2)

#Negative Binomial
#Negative values converted to categories Stars included
wine.cat.nomiss.stars.nb2 <-glm.nb(TARGET ~ AcidIndex + LabelAppeal + STARS, data = wine.cat.nomiss.stars.df)
summary(wine.cat.nomiss.stars.nb2)

#Comparision without non significant variables
#Stats comparision between Poission and QuasiPoisson and Negative Binomial
c.df <- round(data.frame(Poisson.Inc = coef(wine.miss.stars.p2), Quasi.Inc = coef(wine.miss.stars.q2), Neg.Binom.Inc = coef(wine.miss.stars.nb2), Poisson.SE = se.coef(wine.miss.stars.p2), Quasi.SE = se.coef(wine.miss.stars.q2), Neg.Binom.SE = coef(summary(wine.miss.stars.nb2))[,'Std. Error'], Poisson.Sig = coef(summary(wine.miss.stars.p2))[,'Pr(>|z|)'], Quasi.Sig = coef(summary(wine.miss.stars.q2))[,'Pr(>|t|)'], Neg.Binom.Sig = coef(summary(wine.miss.stars.nb2))[,'Pr(>|z|)']),4)

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson Vs Negative Binomial - Model-1", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

#Comparision without non significant variables
#Stats comparision between Poission and QuasiPoisson and Negative Binomial
c.df <- round(data.frame(Poisson.Inc = coef(wine.nomiss.stars.p2), Quasi.Inc = coef(wine.nomiss.stars.q2), Neg.Binom.Inc = coef(wine.nomiss.stars.nb2), Poisson.SE = se.coef(wine.nomiss.stars.p2), Quasi.SE = se.coef(wine.nomiss.stars.q2), Neg.Binom.SE = coef(summary(wine.nomiss.stars.nb2))[,'Std. Error'], Poisson.Sig = coef(summary(wine.nomiss.stars.p2))[,'Pr(>|z|)'], Quasi.Sig = coef(summary(wine.nomiss.stars.q2))[,'Pr(>|t|)'], Neg.Binom.Sig = coef(summary(wine.nomiss.stars.nb2))[,'Pr(>|z|)']),4)

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson Vs Negative Binomial - Model-2", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

#Comparision without non significant variables
#Stats comparision between Poission and QuasiPoisson and Negative Binomial
c.df <- round(data.frame(Poisson.Inc = coef(wine.cat.miss.stars.p2), Quasi.Inc = coef(wine.cat.miss.stars.q2), Neg.Binom.Inc = coef(wine.cat.miss.stars.nb2), Poisson.SE = se.coef(wine.cat.miss.stars.p2), Quasi.SE = se.coef(wine.cat.miss.stars.q2), Neg.Binom.SE = coef(summary(wine.cat.miss.stars.nb2))[,'Std. Error'], Poisson.Sig = coef(summary(wine.cat.miss.stars.p2))[,'Pr(>|z|)'], Quasi.Sig = coef(summary(wine.cat.miss.stars.q2))[,'Pr(>|t|)'], Neg.Binom.Sig = coef(summary(wine.cat.miss.stars.nb2))[,'Pr(>|z|)']),4)

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson Vs Negative Binomial - Model-3", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

#Comparision without non significant variables
#Stats comparision between Poission and QuasiPoisson and Negative Binomial
c.df <- round(data.frame(Poisson.Inc = coef(wine.cat.nomiss.stars.p2), Quasi.Inc = coef(wine.cat.nomiss.stars.q2), Neg.Binom.Inc = coef(wine.cat.nomiss.stars.nb2), Poisson.SE = se.coef(wine.cat.nomiss.stars.p2), Quasi.SE = se.coef(wine.cat.nomiss.stars.q2), Neg.Binom.SE = coef(summary(wine.cat.nomiss.stars.nb2))[,'Std. Error'], Poisson.Sig = coef(summary(wine.cat.nomiss.stars.p2))[,'Pr(>|z|)'], Quasi.Sig = coef(summary(wine.cat.nomiss.stars.q2))[,'Pr(>|t|)'], Neg.Binom.Sig = coef(summary(wine.cat.nomiss.stars.nb2))[,'Pr(>|z|)']),4)

c.df%>% 
  kable("html",caption = "Poission Vs QuasiPoisson Vs Negative Binomial - Model-4", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

#Stats of nb models 
wine.fit <- data.frame(`Model-1` = wine.miss.stars.nb$deviance,
                       `Model-2` = wine.nomiss.stars.nb$deviance,
                       `Model-3` = wine.cat.miss.stars.nb$deviance,
                       `Model-4` = wine.cat.nomiss.stars.nb$deviance
                       )

wine.fit <- rbind(wine.fit, data.frame(`Model-1` = wine.miss.stars.nb$df.residual,
                       `Model-2` = wine.nomiss.stars.nb$df.residual,
                       `Model-3` = wine.cat.miss.stars.nb$df.residual,
                       `Model-4` = wine.cat.nomiss.stars.nb$df.residual
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` = (qchisq(0.95, wine.miss.stars.nb$df.residual)),
                       `Model-2` = (qchisq(0.95, wine.nomiss.stars.nb$df.residual)),
                       `Model-3` = (qchisq(0.95, wine.cat.miss.stars.nb$df.residual)),
                       `Model-4` = (qchisq(0.95, wine.cat.nomiss.stars.nb$df.residual))
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` = (1-pchisq(wine.miss.stars.nb$deviance, wine.miss.stars.nb$df.residual)),
                       `Model-2` = (1-pchisq(wine.nomiss.stars.nb$deviance, wine.nomiss.stars.nb$df.residual)),
                       `Model-3` = (1-pchisq(wine.cat.miss.stars.nb$deviance, wine.cat.miss.stars.nb$df.residual)),
                       `Model-4` = (1-pchisq(wine.cat.nomiss.stars.nb$deviance, wine.cat.nomiss.stars.nb$df.residual))
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` =  wine.miss.stars.nb$theta,
                       `Model-2` = wine.nomiss.stars.nb$theta,
                       `Model-3` = wine.cat.miss.stars.nb$theta,
                       `Model-4` = wine.cat.nomiss.stars.nb$theta
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` = 1/wine.miss.stars.nb$theta,
                       `Model-2` = 1/wine.nomiss.stars.nb$theta,
                       `Model-3` = 1/wine.cat.miss.stars.nb$theta,
                       `Model-4` = 1/wine.cat.nomiss.stars.nb$theta
                       ))
row.names(wine.fit) <- c('deviance', 'df', 'chisq', 'p-value', 'theta', 'proportion')

wine.fit%>% 
  kable("html",caption = "Models Stats - Negative Binomial", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

par(mfrow=c(2, 2),oma = c(0, 0, 2, 0))
countreg::rootogram(wine.miss.stars.p, main = "Model-1")
countreg::rootogram(wine.nomiss.stars.p, main = "Model-2")
countreg::rootogram(wine.cat.miss.stars.p, main = "Model-3")
countreg::rootogram(wine.cat.nomiss.stars.p, main = "Model-4")
mtext('RootOGram - Poission Regression', outer = TRUE, cex = 1.5)

par(mfrow=c(2, 2),oma = c(0, 0, 2, 0))
countreg::rootogram(wine.miss.stars.nb, main = "Model-1")
countreg::rootogram(wine.nomiss.stars.nb, main = "Model-2")
countreg::rootogram(wine.cat.miss.stars.nb, main = "Model-3")
countreg::rootogram(wine.cat.nomiss.stars.nb, main = "Model-4")
mtext('RootOGram - Negative Binomial Regression', outer = TRUE, cex = 1.5)

#Zero Inflated model
#Missing stars included
wine.miss.stars.zip <- zeroinfl(TARGET ~ ., data = wine.miss.stars.df)
summary(wine.miss.stars.zip)

#Zero Inflated model
#Missing stars excluded
wine.nomiss.stars.zip <- zeroinfl(TARGET ~ ., data = wine.nomiss.stars.df)
summary(wine.nomiss.stars.zip)

#Zero Inflated model
#Negative values converted to categories Stars included
wine.cat.miss.stars.zip <- zeroinfl(TARGET ~ Density + pH + AcidIndex + LabelAppeal + STARS, data = wine.cat.miss.stars.df)
summary(wine.cat.miss.stars.zip)

#Zero Inflated model
#Negative values converted to categories Stars excluded
wine.cat.nomiss.stars.zip <- zeroinfl(TARGET ~ AcidIndex + LabelAppeal + STARS, data = wine.cat.nomiss.stars.df)
summary(wine.cat.nomiss.stars.zip)

#Zero Inflated model
#Missing stars included
wine.miss.stars.zip <- zeroinfl(TARGET ~ ., data = wine.miss.stars.df)
summary(wine.miss.stars.zip)

#Zero Inflated model
#Missing stars excluded
wine.nomiss.stars.zip <- zeroinfl(TARGET ~ ., data = wine.nomiss.stars.df)
summary(wine.nomiss.stars.zip)

#Zero Inflated model
#Negative values converted to categories Stars included
wine.cat.miss.stars.zip <- zeroinfl(TARGET ~ Density + pH + AcidIndex + LabelAppeal + STARS, data = wine.cat.miss.stars.df)
summary(wine.cat.miss.stars.zip)

#Zero Inflated model
#Negative values converted to categories Stars excluded
wine.cat.nomiss.stars.zip <- zeroinfl(TARGET ~ AcidIndex + LabelAppeal + STARS, data = wine.cat.nomiss.stars.df)
summary(wine.cat.nomiss.stars.zip)

#Comparision without non significant variables
#Stats comparision between Poission and QuasiPoisson and Negative Binomial
c.df <- data.frame(Coefficient = summary(wine.miss.stars.zip)$coef$count[,1],
                   SE = summary(wine.miss.stars.zip)$coef$count[,2],
                   pVal = summary(wine.miss.stars.zip)$coef$count[,4],
                   Sig = ifelse(summary(wine.miss.stars.zip)$coef$count[,4]<=0.05,'*',''),
                   Exponentiated.coeff = exp(summary(wine.miss.stars.zip)$coef$count[,1])
                   )

c.df%>% 
  kable("html",caption = "Zero-Inflated Count Coefficients - Model-1", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

#Comparision without non significant variables
#Stats comparision between Poission and QuasiPoisson and Negative Binomial
c.df <- data.frame(Coefficient = summary(wine.miss.stars.zip)$coef$zero[,1], 
                   SE = summary(wine.miss.stars.zip)$coef$zero[,2],
                   pVal = summary(wine.miss.stars.zip)$coef$zero[,4],
                   Sig = ifelse(summary(wine.miss.stars.zip)$coef$zero[,4]<=0.05,'*',''),
                   `Odds Ratio` = exp(summary(wine.miss.stars.zip)$coef$zero[,1])
                   )
c.df%>% 
  kable("html",caption = "Zero-Inflated Logit Coefficients - Model-1", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

#Stats of nb models 
wine.fit <- data.frame(`Model-1` = AIC(wine.miss.stars.zip),
                       `Model-2` = AIC(wine.nomiss.stars.zip),
                       `Model-3` = AIC(wine.cat.miss.stars.zip),
                       `Model-4` = AIC(wine.cat.nomiss.stars.zip)
                       )

wine.fit <- rbind(wine.fit, data.frame(`Model-1` = mean(predict(wine.miss.stars.zip, type = "zero") + (1 - predict(wine.miss.stars.zip, type = "zero")) * exp(-predict(wine.miss.stars.zip, type = "count"))),
                       `Model-2` = mean(predict(wine.nomiss.stars.zip, type = "zero") + (1 - predict(wine.nomiss.stars.zip, type = "zero")) * exp(-predict(wine.nomiss.stars.zip, type = "count"))),
                       `Model-3` = mean(predict(wine.cat.miss.stars.zip, type = "zero") + (1 - predict(wine.cat.miss.stars.zip, type = "zero")) * exp(-predict(wine.cat.miss.stars.zip, type = "count"))),
                       `Model-4` = mean(predict(wine.cat.nomiss.stars.zip, type = "zero") + (1 - predict(wine.cat.nomiss.stars.zip, type = "zero")) * exp(-predict(wine.cat.nomiss.stars.zip, type = "count")))
                       ))

wine.fit <- rbind(wine.fit, data.frame(`Model-1` = mean(wine.miss.stars.df$TARGET==0),
                       `Model-2` = mean(wine.nomiss.stars.df$TARGET==0),
                       `Model-3` = mean(wine.cat.miss.stars.df$TARGET==0),
                       `Model-4` = mean(wine.cat.nomiss.stars.df$TARGET==0)
                       ))

row.names(wine.fit) <- c('AIC', 'Probability-Zero', 'Zeros.Dataset')

wine.fit%>% 
  kable("html",caption = "Zero Inflated Models Stats", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

#Linear Models
#Missing stars included
wine.miss.stars.lm <- lm(TARGET ~ ., data = wine.miss.stars.df)
summary(wine.miss.stars.lm)

#Linear Models
#Missing stars excluded
wine.nomiss.stars.lm <- lm(TARGET ~ ., data = wine.nomiss.stars.df)
summary(wine.nomiss.stars.lm)

#Linear Models
#Negative values converted to categories Stars included
wine.cat.miss.stars.lm <- lm(TARGET ~ ., data = wine.cat.miss.stars.df)
summary(wine.cat.miss.stars.lm)

#Linear Models
#Negative values converted to categories Stars excluded
wine.cat.nomiss.stars.lm <- lm(TARGET ~ ., data = wine.cat.nomiss.stars.df)
summary(wine.cat.nomiss.stars.lm)

#model1
autoplot(wine.miss.stars.lm, which = 1:6, ncol = 3, label.size = 2)

#model2
autoplot(wine.nomiss.stars.lm, which = 1:6, ncol = 3, label.size = 2)

#model3
autoplot(wine.cat.miss.stars.lm, which = 1:6, ncol = 3, label.size = 2)

#model4
autoplot(wine.cat.nomiss.stars.lm, which = 1:6, ncol = 3, label.size = 2)

#Linear Models
#Negative values converted to categories Stars included
wine.cat.miss.stars.lm2 <- lm(TARGET ~ ChloridesC + Density + pH + AcidIndex + LabelAppeal + STARS, data = wine.cat.miss.stars.df)
summary(wine.cat.miss.stars.lm2)

wine.cat.miss.stars.zip2 <- zeroinfl(TARGET ~ AcidIndex + LabelAppeal + STARS| 
                                    pH + AcidIndex + + LabelAppeal + STARS, 
                                    data = wine.cat.miss.stars.df)
summary(wine.cat.miss.stars.zip2)

#Stats of nb models 
wine.fit <- data.frame(`Model-1` = summary(wine.miss.stars.lm)$r.squared,
                       `Model-2` = summary(wine.nomiss.stars.lm)$r.squared,
                       `Model-3` = summary(wine.cat.miss.stars.lm)$r.squared,
                       `Model-4` = summary(wine.cat.nomiss.stars.lm)$r.squared
                       )

row.names(wine.fit) <- c('R-Squared')

wine.fit%>% 
  kable("html",caption = "Multiple Linear Regression Models", row.names = T, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

wine.asis.df <- wine.eval.df %>% 
  dplyr::select(FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, AcidIndex)

#Convert negative variables in category
negcols <- c('FixedAcidity', 'VolatileAcidity', 'CitricAcid', 'ResidualSugar', 'Chlorides', 'FreeSulfurDioxide', 'TotalSulfurDioxide', 'Sulphates', 'Alcohol')
i = 1
for(nc in negcols){
  #Get min value of the column
  f <- paste0('min(wine.asis.df$',nc,', na.rm=T)')
  minVal <- eval(parse(text=f))

  #Get max value of the column
  f <- paste0('max(wine.asis.df$',nc,', na.rm=T)')
  maxVal <- eval(parse(text=f))
  
  #Get incremental value
  if (maxVal < 10){incVal = 1}
  if (maxVal >= 10 && maxVal < 50){incVal = 5}
  if (maxVal >= 50 && maxVal < 100){incVal = 10}
  if (maxVal >= 100 && maxVal < 500){incVal = 50}
  if (maxVal >= 500 && maxVal < 1000){incVal = 100}
  seqRange <- seq(minVal-incVal,maxVal+incVal,by=incVal)
  
  #Apply range to dataset
  f <- paste0('wine.asis.df$',nc,'R<-cut(wine.asis.df$',nc, ',seq(', minVal-incVal,',', maxVal+incVal, ',by=', incVal, '))')
  eval(parse(text=f))
  
  #Assign range a category 1:max value and apply to dataset
  f <- paste0('wine.asis.df$',nc,'C<-cut(wine.asis.df$',nc, ',seq(', minVal-incVal,',', maxVal+incVal, ',by=', incVal, '), labels=c(1:', length(seqRange)-1,'))')
  eval(parse(text=f))
}

#Get all newly created categorical variables
wine.asis_c.df <- wine.asis.df %>% 
  dplyr::select(FixedAcidityC, VolatileAcidityC, CitricAcidC, ResidualSugarC, ChloridesC, FreeSulfurDioxideC, TotalSulfurDioxideC, Density, pH, SulphatesC, AlcoholC, AcidIndex)

#kNN imputation
wine.eval.knn <- knnImputation(wine.asis_c.df, 30, meth='weighAvg')

wine.asis_c.df <- cbind(wine.eval.knn, LabelAppeal = wine.eval.df$LabelAppeal + 3, STARS = wine.eval.df$STARS)
wine.asis_c.df$LabelAppeal <- factor(wine.asis_c.df$LabelAppeal, levels = c(1,2,3,4,5))
wine.asis_c.df$STARS <- factor(wine.asis_c.df$STARS, levels = c(1,2,3,4))

wine.cat.eval.df <- knnImputation(wine.asis_c.df, 30, meth='weighAvg') 
wine.cat.eval.df <- cbind(TARGET = wine.eval.df$TARGET + 3, wine.cat.eval.df)

wine.eval.lm.df <- wine.cat.eval.df %>% 
  dplyr::select(TARGET,ChloridesC,Density,pH,AcidIndex,LabelAppeal,STARS)

wine.eval.lm.df$TARGET <- predict(wine.cat.miss.stars.lm2, newdata = wine.eval.lm.df, type = "response")
wine.eval.lm.df$LabelAppeal <- wine.eval.df$LabelAppeal
wine.eval.lm.df$LabelAppeal <- factor(wine.eval.lm.df$LabelAppeal, levels = c(-2:2))
wine.eval.lm.df <- wine.eval.lm.df[with(wine.eval.lm.df, order(TARGET)), ]

summary(wine.cat.miss.stars.lm2)
wine.eval.lm.df%>%
  head(25) %>% 
  kable("html",caption = "Predicted Wine Case Sales - Multiple Linear Reduced Model-3", row.names = F, digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

wine.eval.zip.df <- wine.cat.eval.df %>% 
  dplyr::select(TARGET,AcidIndex,pH,LabelAppeal,STARS)

wine.eval.zip.df$TARGET <- predict(wine.cat.miss.stars.zip2, newdata = wine.eval.zip.df, type = "response")

wine.eval.zip.df$LabelAppeal <- wine.eval.df$LabelAppeal
wine.eval.zip.df$LabelAppeal <- factor(wine.eval.zip.df$LabelAppeal, levels = c(-2:2))
wine.eval.zip.df$STARS <- factor(wine.eval.zip.df$STARS, levels = c(1:4))

wine.eval.zip.df%>%
  head(25) %>% 
  kable("html",caption = "Predicted Wine Case Sales - Zero-Inflated Reduced Model-3", row.names = F, digits = 0) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", font_size = 12)

ggplot(wine.eval.zip.df, aes(x = AcidIndex, y = TARGET, colour = LabelAppeal)) +
  geom_point(aes(y = TARGET), alpha=.45, position=position_jitter(h=.2)) +
  labs(title="Predicted Sales - Acid Index", x = "Acid Index", y = "Predicted Case Sales")

ggplot(wine.eval.zip.df, aes(x = pH, y = TARGET, colour = LabelAppeal)) +
  geom_point(aes(y = TARGET), alpha=.45, position=position_jitter(h=.2)) +
  labs(title="Predicted Sales - pH", x = "pH", y = "Predicted Case Sales")

ggplot(wine.eval.zip.df, aes(x = AcidIndex, y = TARGET, colour = STARS)) +
  geom_point(aes(y = TARGET), alpha=.45, position=position_jitter(h=.2)) +
  labs(title="Predicted Sales - Acid Index", x = "Acid Index", y = "Predicted Case Sales")

ggplot(wine.eval.zip.df, aes(x = pH, y = TARGET, colour = STARS)) +
  geom_point(aes(y = TARGET), alpha=.45, position=position_jitter(h=.2)) +
  labs(title="Predicted Sales - pH", x = "pH", y = "Predicted Case Sales")

DATA 621: Business Analytics And Data Mining - HW5

Pavan Akula

May 1, 2018

Overview

Data Exploration

Correlation

Conclusion of Data Exploration

Data Preparation

Data Conversion Method

Correlation Plot

Imputing Missing Values

Binning Method

Imputation Process

Conclusion of Data Preparation

Build Models

Poisson Model

Mode-1

Interpretation

Mode-2

Mode-3

Mode-4

Analysis Of Models

Model Stats Interpretation

Quasi-Poisson Models

Negative Binomial Models

Model-1

Model-2

Model-3

Model-3

Model-3

Model-4

Zero-Inflated Models

Interpretation

Count Coefficients

Zero coefficients

Multiple Linear Regression Models

Model-1

Model-2

Model-3

Model-4

Select Models

Reduced Multiple Linear Model-3

Zero-Inflated Model-3

Conclusion

References

Appendix