We are given a dataset containing information on commercially avilable wines. The variables describe the chemical attributes of wine being sold. Our response variable in this case is the number of wine cases purchased by wine distribution companies. Each record in the data represents the number of cases sold for a wine with the specific chemical attributes.

Our goal is to model the data and predict the number of cases sold as a function of wine’s chemical attributes. The use case is for a wine manufacturer to adjust their wine offerings in order to maximize sales. The model of use for this case study will be a count regression model.

Read the data in

##    INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## 1      1      3          3.2           1.160      -0.98         54.20
## 2      2      3          4.5           0.160      -0.81         26.10
## 3      4      5          7.1           2.640      -0.88         14.80
## 4      5      3          5.7           0.385       0.04         18.80
## 5      6      4          8.0           0.330      -1.26          9.40
## 6      7      0         11.3           0.320       0.59          2.20
## 7      8      0          7.7           0.290      -0.40         21.50
## 8     11      4          6.5          -1.220       0.34          1.40
## 9     12      3         14.8           0.270       1.05         11.25
## 10    13      6          5.5          -0.220       0.39          1.80
##    Chlorides FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates
## 1     -0.567                NA                268 0.99280 3.33     -0.59
## 2     -0.425                15               -327 1.02792 3.38      0.70
## 3      0.037               214                142 0.99518 3.12      0.48
## 4     -0.425                22                115 0.99640 2.24      1.83
## 5         NA              -167                108 0.99457 3.12      1.77
## 6      0.556               -37                 15 0.99940 3.20      1.29
## 7      0.060               287                156 0.99572 3.49      1.21
## 8      0.040               523                551 1.03236 3.20        NA
## 9     -0.007              -213                 NA 0.99620 4.93      0.26
## 10    -0.277                62                180 0.94724 3.09      0.75
##    Alcohol LabelAppeal AcidIndex STARS
## 1      9.9           0         8     2
## 2       NA          -1         7     3
## 3     22.0          -1         8     3
## 4      6.2          -1         6     1
## 5     13.7           0         9     2
## 6     15.4           0        11    NA
## 7     10.3           0         8    NA
## 8     11.6           1         7     3
## 9     15.0           0         6    NA
## 10    12.6           0         8     4
  1. EDA

How many records and variables?

##  [1] "TARGET"             "FixedAcidity"       "VolatileAcidity"   
##  [4] "CitricAcid"         "ResidualSugar"      "Chlorides"         
##  [7] "FreeSulfurDioxide"  "TotalSulfurDioxide" "Density"           
## [10] "pH"                 "Sulphates"          "Alcohol"           
## [13] "LabelAppeal"        "AcidIndex"          "STARS"
## 'data.frame':    12795 obs. of  15 variables:
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...

For this particular study, we will be implimenting new features into our EDA using a package called DataExplorer. More information can be found here https://datascienceplus.com/blazing-fast-eda-in-r-with-dataexplorer/

Using DataExplorer, we can examine the dimensions of each variable.

## Warning: package 'DataExplorer' was built under R version 3.4.4

This interactive chart details the data type for each variable in addition to number or rows and variables within the dataset.

There are 12,795 records and 15 variables. The index variable can be removed all together since it only serves as a row number. The data types seem have correct data types.

Missing value analysis

STARS is missing over 25% of its data. Sulphates has 9% missing data. The missing variables will be treated in the data preperation step.

Lets examine the spread of each variable

It looks like the spread of most variables are even, with a near normal distribution. AcidIndex has a right skew. Most variables have significant peaks. This will be closely analyzed with an outlier analysis.

Lets closely examine the distribution of the response variable

The distribution of the response variable is close to normal, however there are a significant number of records that have the number 0.

Overall Summary of the data

##      TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
##  Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
##  1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
##  Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
##  Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
##  3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
##  Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
##                                                                       
##  ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
##  1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
##  Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
##  Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
##  3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
##  Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
##  NA's   :616        NA's   :638       NA's   :647       NA's   :682       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
##  1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
##  Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
##  Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
##  Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
##                   NA's   :395     NA's   :1210      NA's   :653    
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##                                       NA's   :3359

There are several variables that should not contain any negative entires. This brings the problem of data collection into consideration. This will have to be furthur investigated in the data perperation step.

Lets point out the summary of the response variable

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.029   4.000   8.000
## [1] 3.710895

The mean is nearly equal to the variance. This is a strong indicator

How many negative values does each variable contain?

##             TARGET       FixedAcidity    VolatileAcidity 
##                  0               1621               2827 
##         CitricAcid      ResidualSugar          Chlorides 
##               2966                 NA                 NA 
##  FreeSulfurDioxide TotalSulfurDioxide            Density 
##                 NA                 NA                  0 
##                 pH          Sulphates            Alcohol 
##                 NA                 NA                 NA 
##        LabelAppeal          AcidIndex              STARS 
##               3640                  0                 NA

Label Appeal can have negative values in its domain. A negative value means customers do not like the design. Several chemical attributes have negative numbers. 23% of the values in Citric Acid are negative. In our data prep, we will confirm if negative numbers are in the domain of these variables based on the definition of their chemical attributes.

How does each variable correlate to the response variable?

##             TARGET       FixedAcidity    VolatileAcidity 
##        1.000000000       -0.049010939       -0.088793212 
##         CitricAcid      ResidualSugar          Chlorides 
##        0.008684633                 NA                 NA 
##  FreeSulfurDioxide TotalSulfurDioxide            Density 
##                 NA                 NA       -0.035517502 
##                 pH          Sulphates            Alcohol 
##                 NA                 NA                 NA 
##        LabelAppeal          AcidIndex              STARS 
##        0.356500469       -0.246049449                 NA

There are several variables that do not have a correlation with the number of sales. The highest positive correlation is label appeal which is the marketing score. This correlation makes sense as one would imagine a higher marketing score leads to more sales. Acid index has the strongest negative correlation however it is difficult to see the connection without prior knowledge of wine attributes.

How do the other variables correlate with each other?

There are numerous variables that do not have any correlation at all with any other variable. Through modeling, we will determine if they are significant. STARS has no correlation with any variable except its self. STARS also has the highest percentage of missing data. It’s hard to say if the missing data is a reason why it does not correlate with other predictors. This only gives evidence in support of removing STARS all together.

Outlier Analysis

Target

## Outliers identified: 17 nPropotion (%) of outliers: 0.1 nMean of the outliers: 8 nMean without removing outliers: 3.03 nMean if we remove outliers: 3.02 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

Chlorides

## Outliers identified: 3021 nPropotion (%) of outliers: 33.1 nMean of the outliers: 0.05 nMean without removing outliers: 0.05 nMean if we remove outliers: 0.06 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

Alcohol

## Outliers identified: 928 nPropotion (%) of outliers: 8.3 nMean of the outliers: 9.47 nMean without removing outliers: 10.49 nMean if we remove outliers: 10.57 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

Fixed Acidity

## Outliers identified: 2455 nPropotion (%) of outliers: 23.7 nMean of the outliers: 6.76 nMean without removing outliers: 7.08 nMean if we remove outliers: 7.15 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

Free Sulfur Dioxide

## Outliers identified: 3712 nPropotion (%) of outliers: 44 nMean of the outliers: 24.17 nMean without removing outliers: 30.85 nMean if we remove outliers: 33.78 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

Label Appeal

## Outliers identified: 0 nPropotion (%) of outliers: 0 nMean of the outliers: NaN nMean without removing outliers: -0.01 nMean if we remove outliers: -0.01 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

Volatile Acidity

## Outliers identified: 2599 nPropotion (%) of outliers: 25.5 nMean of the outliers: 0.21 nMean without removing outliers: 0.32 nMean if we remove outliers: 0.35 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

Total Sulfur Dioxide

## Outliers identified: 1590 nPropotion (%) of outliers: 15.1 nMean of the outliers: 127.61 nMean without removing outliers: 120.71 nMean if we remove outliers: 119.67 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

Acid Index

## Outliers identified: 1151 nPropotion (%) of outliers: 9.9 nMean of the outliers: 10.55 nMean without removing outliers: 7.77 nMean if we remove outliers: 7.5 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

Citric Acid

## Outliers identified: 2688 nPropotion (%) of outliers: 26.6 nMean of the outliers: 0.29 nMean without removing outliers: 0.31 nMean if we remove outliers: 0.31 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

Denisty

## Outliers identified: 3823 nPropotion (%) of outliers: 42.6 nMean of the outliers: 0.99 nMean without removing outliers: 0.99 nMean if we remove outliers: 0.99 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

Residual Sugar

## Outliers identified: 3298 nPropotion (%) of outliers: 37.1 nMean of the outliers: 3.5 nMean without removing outliers: 5.42 nMean if we remove outliers: 6.13 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

pH

## Outliers identified: 1864 nPropotion (%) of outliers: 17.7 nMean of the outliers: 3.2 nMean without removing outliers: 3.21 nMean if we remove outliers: 3.21 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

STARS

## Outliers identified: 0 nPropotion (%) of outliers: 0 nMean of the outliers: NaN nMean without removing outliers: 2.04 nMean if we remove outliers: 2.04 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

Sulphates

outlierKD(wine_training2, Sulphates)

## Outliers identified: 2606 nPropotion (%) of outliers: 29 nMean of the outliers: 0.46 nMean without removing outliers: 0.53 nMean if we remove outliers: 0.55 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

The removal of outliers seems to tighten the overall spread of certain variables such as pH. Removing outliers for other variables does not really make a change to the overall distribution. Through modeling, we will have a better idea if we should remove outliers or not.

  1. Data Preperation Recall some of our findings regarding the data Number of missing values
##             TARGET       FixedAcidity    VolatileAcidity 
##                  0                  0                  0 
##         CitricAcid      ResidualSugar          Chlorides 
##                  0                616                638 
##  FreeSulfurDioxide TotalSulfurDioxide            Density 
##                647                682                  0 
##                 pH          Sulphates            Alcohol 
##                395               1210                653 
##        LabelAppeal          AcidIndex              STARS 
##                  0                  0               3359

Stars is defined as the rating given by wine experts. One could assume that a high rating should be correlated or related to the response variable. We are going to impute the STARS variable with its median value and recheck the correlation number.

Summary of STARS after we impute missing values with the median vs non imputed values

## 
##  3359 values imputed to 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.031   2.000   4.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   2.042   3.000   4.000    3359

Plots of Spread

Before we impute the other variables, we need to check if negative values belong in their domin. If negative values do not make sense in the context of the variable, then some potential fixes are to consider the absolute value or drop them from the data frame all together.

##             TARGET       FixedAcidity    VolatileAcidity 
##                  0               1621               2827 
##         CitricAcid      ResidualSugar          Chlorides 
##               2966                 NA                 NA 
##  FreeSulfurDioxide TotalSulfurDioxide            Density 
##                 NA                 NA                  0 
##                 pH          Sulphates            Alcohol 
##                 NA                 NA                 NA 
##        LabelAppeal          AcidIndex              STARS 
##               3640                  0                  0
##             TARGET       FixedAcidity    VolatileAcidity 
##                  0                  0                  0 
##         CitricAcid      ResidualSugar          Chlorides 
##                  0                616                638 
##  FreeSulfurDioxide TotalSulfurDioxide            Density 
##                647                682                  0 
##                 pH          Sulphates            Alcohol 
##                395               1210                653 
##        LabelAppeal          AcidIndex              STARS 
##                  0                  0                  0

After scanning several documents regarding the chemical properties of wine,is it reasonable to transform the variables with negative values into positive for chemical attributes?

We need to provide justification for transforming chemical attributes into positive only values.

Residual sugar concentration is a measure of the amount of sugar solids in a given volume of wine following the end of fermentation and any sugar addition when making a sweet wine.

Volatile Acidity and citric is also a measurement that can be expressed in g/l or mg/l.

Fixed Acidity levels found in wine can vary greatly but in general one would expect to see 1,000 to 4,000 mg/L tartaric acid, 0 to 8,000 mg/L malic acid, 0 to 500 mg/L citric acid, and 500 to 2,000 mg/L succinic acid

Based on some understanding on these chemical attributes, it makes sense to treat negative values. Any acidity variable is a measure and measures cannot be negative unless we examine a rate of change. However rate of change is not needed for this case study.

Documentation: https://winemakermag.com/501-measuring-residual-sugar-techniques http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity https://en.wikipedia.org/wiki/Acids_in_wine

If positve plus imputed values has a deterimental effect on the model, then we will use the unaltered data wine_training2 and consider some alternate transformation.

## 
##  616 values imputed to 12.9 
## 
## 
##  638 values imputed to 0.098 
## 
## 
##  647 values imputed to 56 
## 
## 
##  682 values imputed to 154 
## 
## 
##  395 values imputed to 3.2 
## 
## 
##  1210 values imputed to 0.59 
## 
## 
##  653 values imputed to 10.4 
## 
## 
##  3359 values imputed to 2
##      TARGET       FixedAcidity    VolatileAcidity    CitricAcid    
##  Min.   :0.000   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.: 5.600   1st Qu.:0.2500   1st Qu.:0.2800  
##  Median :3.000   Median : 7.000   Median :0.4100   Median :0.4400  
##  Mean   :3.029   Mean   : 8.063   Mean   :0.6411   Mean   :0.6863  
##  3rd Qu.:4.000   3rd Qu.: 9.800   3rd Qu.:0.9100   3rd Qu.:0.9700  
##  Max.   :8.000   Max.   :34.400   Max.   :3.6800   Max.   :3.8600  
##  ResidualSugar      Chlorides      FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :  0.00   Min.   :0.0000   Min.   :  0.0     Min.   :   0.0    
##  1st Qu.:  4.00   1st Qu.:0.0460   1st Qu.: 29.0     1st Qu.: 102.0    
##  Median : 12.90   Median :0.0980   Median : 56.0     Median : 154.0    
##  Mean   : 22.86   Mean   :0.2163   Mean   :104.1     Mean   : 201.6    
##  3rd Qu.: 37.20   3rd Qu.:0.3530   3rd Qu.:164.0     3rd Qu.: 251.0    
##  Max.   :141.15   Max.   :1.3510   Max.   :623.0     Max.   :1057.0    
##     Density             pH          Sulphates         Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :0.0000   Min.   : 0.00  
##  1st Qu.:0.9877   1st Qu.:2.970   1st Qu.:0.4500   1st Qu.: 9.10  
##  Median :0.9945   Median :3.200   Median :0.5900   Median :10.40  
##  Mean   :0.9942   Mean   :3.207   Mean   :0.8224   Mean   :10.52  
##  3rd Qu.:1.0005   3rd Qu.:3.450   3rd Qu.:1.0000   3rd Qu.:12.20  
##  Max.   :1.0992   Max.   :6.130   Max.   :4.2400   Max.   :26.50  
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:2.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.031  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:2.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000

Howdid the distributions change?

How Does correlation change? (wine_training3)

We start to see evidence of more relationships within the data after we have transformed the data. Once we do modeling, we will be able to determine if the data has strong integrity. To reiterate, the entires in the variables relating to chemical properties were made positive because we discovered that they pertain to measurments in units such as mg and l. It does not make sense for measurments to be negative in the physical world. I assume the negative values were a result of a data collection error.

  1. Build Models

Objective: Using the training data set, build at least two different poisson regression models, at least two different negative binomial regression models, and at least two multiple linear regression models, using different variables (or the same variables with different transformations)

Full Poisson Regression Model on Transformed Data

## Loading required package: grid
## 
## Attaching package: 'faraway'
## The following object is masked from 'package:survival':
## 
##     rats
## The following object is masked from 'package:lattice':
## 
##     melanoma
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
## 
## Attaching package: 'car'
## The following objects are masked from 'package:faraway':
## 
##     logit, vif
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 3.4.4
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'boot'
## The following object is masked from 'package:car':
## 
##     logit
## The following objects are masked from 'package:faraway':
## 
##     logit, melanoma
## The following object is masked from 'package:survival':
## 
##     aml
## The following object is masked from 'package:lattice':
## 
##     melanoma
## 
## Call:
## glm(formula = TARGET ~ ., family = "poisson", data = wine_training3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2792  -0.5084   0.1987   0.6366   2.7592  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.053e+00  1.962e-01  10.461  < 2e-16 ***
## FixedAcidity       -7.866e-04  1.046e-03  -0.752 0.452212    
## VolatileAcidity    -5.881e-02  9.416e-03  -6.246 4.21e-10 ***
## CitricAcid          1.734e-02  8.292e-03   2.091 0.036540 *  
## ResidualSugar       1.821e-05  2.078e-04   0.088 0.930182    
## Chlorides          -4.456e-02  2.230e-02  -1.998 0.045703 *  
## FreeSulfurDioxide   9.112e-05  4.782e-05   1.906 0.056709 .  
## TotalSulfurDioxide  1.167e-04  3.165e-05   3.688 0.000226 ***
## Density            -4.520e-01  1.922e-01  -2.352 0.018658 *  
## pH                 -2.349e-02  7.635e-03  -3.077 0.002093 ** 
## Sulphates          -2.297e-02  8.229e-03  -2.791 0.005247 ** 
## Alcohol             5.905e-03  1.445e-03   4.087 4.37e-05 ***
## LabelAppeal         1.963e-01  6.020e-03  32.606  < 2e-16 ***
## AcidIndex          -1.233e-01  4.453e-03 -27.681  < 2e-16 ***
## STARS               2.211e-01  6.466e-03  34.202  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18475  on 12780  degrees of freedom
## AIC: 50447
## 
## Number of Fisher Scoring iterations: 5
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: TARGET
## 
## Terms added sequentially (first to last)
## 
## 
##                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                               12794      22861              
## FixedAcidity        1    44.59     12793      22816 2.428e-11 ***
## VolatileAcidity     1    77.89     12792      22738 < 2.2e-16 ***
## CitricAcid          1     2.84     12791      22736 0.0916712 .  
## ResidualSugar       1     0.15     12790      22735 0.6973282    
## Chlorides           1    11.55     12789      22724 0.0006771 ***
## FreeSulfurDioxide   1     8.03     12788      22716 0.0045888 ** 
## TotalSulfurDioxide  1    13.89     12787      22702 0.0001938 ***
## Density             1    20.14     12786      22682 7.199e-06 ***
## pH                  1     1.03     12785      22681 0.3099732    
## Sulphates           1    13.56     12784      22667 0.0002316 ***
## Alcohol             1    62.20     12783      22605 3.110e-15 ***
## LabelAppeal         1  1975.12     12782      20630 < 2.2e-16 ***
## AcidIndex           1  1006.22     12781      19624 < 2.2e-16 ***
## STARS               1  1148.95     12780      18475 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##      res.deviance    df             p
## [1,]     18474.73 12780 1.430684e-216
## [1] 0.1918551
## [1] 1.430684e-216

Our first model tells us that unit decrases in attributes such as volatile acidity cause the expected number of sales to increase. Residual sugars and fixed acidity have high p values. Residal Sugars have almost no correlation with any of the variables even after transformation. Based on the p-value from the chi square goodness of fit test, we are unable to conclude that the full model is a good fit. The G-statistic is our substitue for r square and in this case, it comes out to .19.

Lets consider some variable selection using Aic

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
##     Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
##     pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
## 
## Final Model:
## TARGET ~ VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS
## 
## 
##              Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                                   12780   18474.73 50446.75
## 2 - ResidualSugar  1 0.00767479     12781   18474.74 50444.76
## 3  - FixedAcidity  1 0.56501594     12782   18475.31 50443.33

AIC suggested a model with Residual Sugar and Fixed Acidity removed. Lets formulate the generated model.

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, family = "poisson", 
##     data = wine_training3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2775  -0.5077   0.1990   0.6363   2.7576  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.050e+00  1.961e-01  10.453  < 2e-16 ***
## VolatileAcidity    -5.887e-02  9.416e-03  -6.252 4.06e-10 ***
## CitricAcid          1.742e-02  8.290e-03   2.101 0.035644 *  
## Chlorides          -4.451e-02  2.230e-02  -1.996 0.045905 *  
## FreeSulfurDioxide   9.127e-05  4.782e-05   1.909 0.056285 .  
## TotalSulfurDioxide  1.167e-04  3.164e-05   3.689 0.000225 ***
## Density            -4.509e-01  1.922e-01  -2.347 0.018942 *  
## pH                 -2.356e-02  7.635e-03  -3.086 0.002028 ** 
## Sulphates          -2.303e-02  8.228e-03  -2.799 0.005118 ** 
## Alcohol             5.908e-03  1.445e-03   4.089 4.33e-05 ***
## LabelAppeal         1.963e-01  6.020e-03  32.612  < 2e-16 ***
## AcidIndex          -1.238e-01  4.400e-03 -28.135  < 2e-16 ***
## STARS               2.211e-01  6.466e-03  34.203  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18475  on 12782  degrees of freedom
## AIC: 50443
## 
## Number of Fisher Scoring iterations: 5
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: TARGET
## 
## Terms added sequentially (first to last)
## 
## 
##                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                               12794      22861              
## VolatileAcidity     1    79.20     12793      22782 < 2.2e-16 ***
## CitricAcid          1     2.95     12792      22779 0.0858538 .  
## Chlorides           1    11.59     12791      22767 0.0006616 ***
## FreeSulfurDioxide   1     8.15     12790      22759 0.0043023 ** 
## TotalSulfurDioxide  1    14.46     12789      22744 0.0001430 ***
## Density             1    20.14     12788      22724 7.185e-06 ***
## pH                  1     1.00     12787      22723 0.3183278    
## Sulphates           1    14.43     12786      22709 0.0001455 ***
## Alcohol             1    63.20     12785      22646 1.869e-15 ***
## LabelAppeal         1  1976.00     12784      20670 < 2.2e-16 ***
## AcidIndex           1  1045.43     12783      19624 < 2.2e-16 ***
## STARS               1  1149.04     12782      18475 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##      res.deviance    df             p
## [1,]     18475.31 12782 1.893996e-216
## [1] 0.1918551
## [1] 1.893996e-216

According to this second iteration, volitile acidity and pH are features that do not show evidence of a gooffit according to chi square test. We can build a third model with these additional variables removed.

## 
## Call:
## glm(formula = TARGET ~ CitricAcid + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + Sulphates + Alcohol + LabelAppeal + 
##     AcidIndex + STARS, family = "poisson", data = wine_training3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3320  -0.4995   0.2007   0.6325   2.7482  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.926e+00  1.942e-01   9.915  < 2e-16 ***
## CitricAcid          1.766e-02  8.284e-03   2.132 0.032993 *  
## Chlorides          -4.576e-02  2.230e-02  -2.052 0.040160 *  
## FreeSulfurDioxide   9.448e-05  4.780e-05   1.977 0.048080 *  
## TotalSulfurDioxide  1.227e-04  3.161e-05   3.882 0.000103 ***
## Density            -4.434e-01  1.920e-01  -2.309 0.020946 *  
## Sulphates          -2.370e-02  8.232e-03  -2.879 0.003992 ** 
## Alcohol             5.783e-03  1.444e-03   4.004 6.24e-05 ***
## LabelAppeal         1.964e-01  6.017e-03  32.645  < 2e-16 ***
## AcidIndex          -1.235e-01  4.384e-03 -28.164  < 2e-16 ***
## STARS               2.222e-01  6.462e-03  34.393  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18525  on 12784  degrees of freedom
## AIC: 50489
## 
## Number of Fisher Scoring iterations: 5
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: TARGET
## 
## Terms added sequentially (first to last)
## 
## 
##                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                               12794      22861              
## CitricAcid          1     3.04     12793      22858 0.0813918 .  
## Chlorides           1    12.19     12792      22846 0.0004797 ***
## FreeSulfurDioxide   1     8.63     12791      22837 0.0033041 ** 
## TotalSulfurDioxide  1    16.85     12790      22820 4.036e-05 ***
## Density             1    19.79     12789      22800 8.642e-06 ***
## Sulphates           1    14.96     12788      22785 0.0001098 ***
## Alcohol             1    61.60     12787      22724 4.213e-15 ***
## LabelAppeal         1  1986.11     12786      20738 < 2.2e-16 ***
## AcidIndex           1  1050.67     12785      19687 < 2.2e-16 ***
## STARS               1  1161.79     12784      18525 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##      res.deviance    df            p
## [1,]     18525.26 12784 1.20718e-219
## [1] 0.1918551
## [1] 1.20718e-219

Citric Acid does not appear to have any evidence of being a good fit for the model. Lets build an additional model with citric acid removed.

We will also take the diagnostic plots one step furthur. We want to estimate the variance for the target given the mean.The variance appears to be much smaller than the mean

## 
## Call:
## glm(formula = TARGET ~ Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + Sulphates + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, family = "poisson", data = wine_training3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3519  -0.4936   0.2012   0.6351   2.7432  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.938e+00  1.941e-01   9.982  < 2e-16 ***
## Chlorides          -4.603e-02  2.230e-02  -2.064   0.0390 *  
## FreeSulfurDioxide   9.474e-05  4.780e-05   1.982   0.0475 *  
## TotalSulfurDioxide  1.234e-04  3.161e-05   3.902 9.52e-05 ***
## Density            -4.457e-01  1.920e-01  -2.322   0.0203 *  
## Sulphates          -2.344e-02  8.231e-03  -2.848   0.0044 ** 
## Alcohol             5.768e-03  1.445e-03   3.993 6.53e-05 ***
## LabelAppeal         1.966e-01  6.017e-03  32.677  < 2e-16 ***
## AcidIndex          -1.232e-01  4.381e-03 -28.111  < 2e-16 ***
## STARS               2.223e-01  6.462e-03  34.394  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18530  on 12785  degrees of freedom
## AIC: 50492
## 
## Number of Fisher Scoring iterations: 5
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: TARGET
## 
## Terms added sequentially (first to last)
## 
## 
##                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                               12794      22861              
## Chlorides           1    12.25     12793      22849 0.0004649 ***
## FreeSulfurDioxide   1     8.69     12792      22840 0.0032076 ** 
## TotalSulfurDioxide  1    16.98     12791      22823 3.783e-05 ***
## Density             1    19.95     12790      22803 7.969e-06 ***
## Sulphates           1    14.77     12789      22788 0.0001217 ***
## Alcohol             1    61.45     12788      22727 4.532e-15 ***
## LabelAppeal         1  1988.15     12787      20739 < 2.2e-16 ***
## AcidIndex           1  1046.99     12786      19692 < 2.2e-16 ***
## STARS               1  1161.90     12785      18530 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##      res.deviance    df             p
## [1,]     18529.78 12785 7.209302e-220
##                    Odds ratio     2.5 %     97.5 %
## (Intercept)         6.9420251 4.7453397 10.1555874
## Chlorides           0.9550140 0.9141708  0.9976819
## FreeSulfurDioxide   1.0000947 1.0000011  1.0001884
## TotalSulfurDioxide  1.0001234 1.0000614  1.0001853
## Density             0.6403568 0.4395335  0.9329366
## Sulphates           0.9768312 0.9611996  0.9927171
## Alcohol             1.0057845 1.0029409  1.0086362
## LabelAppeal         1.2172588 1.2029888  1.2316981
## AcidIndex           0.8841256 0.8765664  0.8917499
## STARS               1.2488944 1.2331763  1.2648128
## [1] 0.1918551

## [1] 7.209302e-220

This model tells us that average number of wine sales increases when wines have a lower concentration of chlorides. This makes sense considering that high chloride makes the wine taste salty and not as good according to certain documentation. A higher alcohol conentration is a good indicator of a better quality wine so it makes sense to increase as number of wine units sold increases. The same story is reflected when we convert exponents to odds ratios. With the odds ratio, we can see that as wine sales are more likely to increase when wine rating increases. We managed to make a simpler model that still has the same G statistic, meaning only .20 of the proportion of deviance is explained by this model. http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0101-20612015000100095

Lets take model five and build a poisson regression model using smoothing splines. We will also use predictors that had a higher correlation with the response variable. These predictors are labelappeal, STARS, and acidindex.

## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.15
## 
## Call: gam(formula = TARGET ~ s(LabelAppeal) + s(AcidIndex) + s(STARS), 
##     family = poisson(link = log), data = wine_training3)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8576 -0.5427  0.1516  0.6300  2.8615 
## 
## (Dispersion Parameter for poisson family taken to be 1)
## 
##     Null Deviance: 22860.89 on 12794 degrees of freedom
## Residual Deviance: 17735.95 on 12783 degrees of freedom
## AIC: 49701.97 
## 
## Number of Local Scoring Iterations: 6 
## 
## Anova for Parametric Effects
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## s(LabelAppeal)     1  1954.2 1954.19 2049.64 < 2.2e-16 ***
## s(AcidIndex)       1   812.3  812.31  851.98 < 2.2e-16 ***
## s(STARS)           1  1305.1 1305.12 1368.87 < 2.2e-16 ***
## Residuals      12783 12187.7    0.95                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df Npar Chisq    P(Chi)    
## (Intercept)                                    
## s(LabelAppeal)       3      58.53 1.212e-12 ***
## s(AcidIndex)         3     184.12 < 2.2e-16 ***
## s(STARS)             2     578.91 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Anova for Nonparametric Effects
##                Npar Df Npar Chisq    P(Chi)    
## (Intercept)                                    
## s(LabelAppeal)       3      58.53 1.212e-12 ***
## s(AcidIndex)         3     184.12 < 2.2e-16 ***
## s(STARS)             2     578.91 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##      res.deviance    df             p
## [1,]     17735.95 12783 3.975734e-169
##                Odds ratio     2.5 %    97.5 %
## (Intercept)     4.5606391 4.2257272 4.9220947
## s(LabelAppeal)  1.2076470 1.1931274 1.2223432
## s(AcidIndex)    0.8902623 0.8822326 0.8983651
## s(STARS)        1.2511907 1.2360710 1.2664953
## [1] 0.2241794

## [1] 3.975734e-169

The poisson model built with smoothing splines yielded the best psuedo r squared value. The predictors are significant with low p values and the smoothing parameters all yield low p values as well. This indicates that there is evidence that the selected predictors form a good overall fit. Using splines yields a marginally better psuedo r square but has better proportion. The odds ratios in this simple model are also the most interpretable. We see that wine sales are 6 times more likley to increase with a unit increase in label appeal and stars. This makes sense since label appeal measures how desirable a wine looks to a customer. Stars is a quality rating. Both stars and label appeal lead to increased sales.

We proceed to building negative binomial models and optimizing said models.

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## 
## Call:
## glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, data = wine_training3, init.theta = 39167.5272, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2791  -0.5084   0.1987   0.6366   2.7592  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.053e+00  1.963e-01  10.460  < 2e-16 ***
## FixedAcidity       -7.867e-04  1.046e-03  -0.752 0.452214    
## VolatileAcidity    -5.881e-02  9.416e-03  -6.246 4.22e-10 ***
## CitricAcid          1.734e-02  8.292e-03   2.091 0.036549 *  
## ResidualSugar       1.821e-05  2.078e-04   0.088 0.930181    
## Chlorides          -4.456e-02  2.230e-02  -1.998 0.045711 *  
## FreeSulfurDioxide   9.112e-05  4.782e-05   1.905 0.056717 .  
## TotalSulfurDioxide  1.167e-04  3.165e-05   3.688 0.000226 ***
## Density            -4.520e-01  1.922e-01  -2.352 0.018661 *  
## pH                 -2.349e-02  7.636e-03  -3.077 0.002094 ** 
## Sulphates          -2.297e-02  8.229e-03  -2.791 0.005248 ** 
## Alcohol             5.905e-03  1.445e-03   4.087 4.37e-05 ***
## LabelAppeal         1.963e-01  6.021e-03  32.605  < 2e-16 ***
## AcidIndex          -1.233e-01  4.454e-03 -27.681  < 2e-16 ***
## STARS               2.211e-01  6.466e-03  34.200  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(39167.53) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18474  on 12780  degrees of freedom
## AIC: 50449
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  39168 
##           Std. Err.:  59671 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50416.88
## Warning in anova.negbin(nmod1, test = "Chisq"): tests made without re-
## estimating 'theta'
## Analysis of Deviance Table
## 
## Model: Negative Binomial(39167.53), link: log
## 
## Response: TARGET
## 
## Terms added sequentially (first to last)
## 
## 
##                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                               12794      22860              
## FixedAcidity        1    44.59     12793      22815 2.432e-11 ***
## VolatileAcidity     1    77.88     12792      22737 < 2.2e-16 ***
## CitricAcid          1     2.84     12791      22734 0.0916850 .  
## ResidualSugar       1     0.15     12790      22734 0.6973393    
## Chlorides           1    11.55     12789      22723 0.0006774 ***
## FreeSulfurDioxide   1     8.03     12788      22715 0.0045905 ** 
## TotalSulfurDioxide  1    13.89     12787      22701 0.0001939 ***
## Density             1    20.14     12786      22681 7.205e-06 ***
## pH                  1     1.03     12785      22680 0.3099861    
## Sulphates           1    13.55     12784      22666 0.0002317 ***
## Alcohol             1    62.19     12783      22604 3.117e-15 ***
## LabelAppeal         1  1974.97     12782      20629 < 2.2e-16 ***
## AcidIndex           1  1006.15     12781      19623 < 2.2e-16 ***
## STARS               1  1148.84     12780      18474 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##      res.deviance    df             p
## [1,]     18473.87 12780 1.634151e-216
##                    Odds ratio     2.5 %     97.5 %
## (Intercept)         7.7905888 5.3029584 11.4451727
## FixedAcidity        0.9992136 0.9971663  1.0012652
## VolatileAcidity     0.9428829 0.9256408  0.9604462
## CitricAcid          1.0174879 1.0010852  1.0341593
## ResidualSugar       1.0000182 0.9996110  1.0004256
## Chlorides           0.9564211 0.9155186  0.9991509
## FreeSulfurDioxide   1.0000911 0.9999974  1.0001849
## TotalSulfurDioxide  1.0001167 1.0000547  1.0001788
## Density             0.6363314 0.4366222  0.9273867
## pH                  0.9767818 0.9622725  0.9915099
## Sulphates           0.9772904 0.9616539  0.9931812
## Alcohol             1.0059225 1.0030779  1.0087752
## LabelAppeal         1.2168961 1.2026209  1.2313408
## AcidIndex           0.8840196 0.8763368  0.8917697
## STARS               1.2475000 1.2317897  1.2634108

## [1] 1.634151e-216

According to this model, fixed acidity, residual sugar and free sulfur dioxide are not significant. Citric acid and ph also do not show evidence of being a good model fit. Lets remove those variables and refit the model. This model has the same fit as the poisson model.

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + 
##     Density + Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_training3, 
##     init.theta = 39133.88486, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2950  -0.4980   0.2044   0.6381   2.7688  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.980e+00  1.942e-01  10.192  < 2e-16 ***
## VolatileAcidity    -5.985e-02  9.417e-03  -6.356 2.07e-10 ***
## Chlorides          -4.641e-02  2.230e-02  -2.081 0.037428 *  
## TotalSulfurDioxide  1.184e-04  3.163e-05   3.744 0.000181 ***
## Density            -4.590e-01  1.921e-01  -2.389 0.016889 *  
## Alcohol             5.916e-03  1.445e-03   4.095 4.23e-05 ***
## LabelAppeal         1.964e-01  6.018e-03  32.638  < 2e-16 ***
## AcidIndex          -1.230e-01  4.383e-03 -28.069  < 2e-16 ***
## STARS               2.213e-01  6.466e-03  34.223  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(39133.88) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18500  on 12786  degrees of freedom
## AIC: 50463
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  39134 
##           Std. Err.:  59912 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50442.87
## Warning in anova.negbin(nmod2, test = "Chisq"): tests made without re-
## estimating 'theta'
## Analysis of Deviance Table
## 
## Model: Negative Binomial(39133.88), link: log
## 
## Response: TARGET
## 
## Terms added sequentially (first to last)
## 
## 
##                    Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                               12794      22860              
## VolatileAcidity     1    79.19     12793      22780 < 2.2e-16 ***
## Chlorides           1    11.64     12792      22769 0.0006439 ***
## TotalSulfurDioxide  1    14.82     12791      22754 0.0001184 ***
## Density             1    20.17     12790      22734 7.093e-06 ***
## Alcohol             1    62.67     12789      22671 2.449e-15 ***
## LabelAppeal         1  1980.48     12788      20691 < 2.2e-16 ***
## AcidIndex           1  1040.51     12787      19650 < 2.2e-16 ***
## STARS               1  1150.36     12786      18500 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##      res.deviance    df             p
## [1,]     18499.85 12786 8.948867e-218
##                    Odds ratio     2.5 %     97.5 %
## (Intercept)         7.2409441 4.9482358 10.5959526
## VolatileAcidity     0.9419023 0.9246764  0.9594490
## Chlorides           0.9546526 0.9138260  0.9973032
## TotalSulfurDioxide  1.0001184 1.0000564  1.0001804
## Density             0.6319319 0.4336569  0.9208613
## Alcohol             1.0059340 1.0030893  1.0087868
## LabelAppeal         1.2170376 1.2027667  1.2314778
## AcidIndex           0.8842408 0.8766772  0.8918697
## STARS               1.2476934 1.2319803  1.2636069

## [1] 8.948867e-218

Our negative binomial models so far indicate there is not much difference between the earlier iterations of our posisson negative binomial models.

Lets build a standard negative binomial with the three variables from the last iteration of the poisson model.

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## 
## Call:
## glm.nb(formula = TARGET ~ Alcohol + LabelAppeal + AcidIndex + 
##     STARS, data = wine_training3, init.theta = 38811.50732, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2844  -0.4895   0.2118   0.6315   2.7140  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.513600   0.040623  37.260  < 2e-16 ***
## Alcohol      0.005629   0.001444   3.898 9.72e-05 ***
## LabelAppeal  0.196719   0.006015  32.704  < 2e-16 ***
## AcidIndex   -0.124695   0.004369 -28.539  < 2e-16 ***
## STARS        0.222389   0.006461  34.421  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(38811.51) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18566  on 12790  degrees of freedom
## AIC: 50521
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  38812 
##           Std. Err.:  60163 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50509.49
## Warning in anova.negbin(nmod4, test = "Chisq"): tests made without re-
## estimating 'theta'
## Analysis of Deviance Table
## 
## Model: Negative Binomial(38811.51), link: log
## 
## Response: TARGET
## 
## Terms added sequentially (first to last)
## 
## 
##             Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                        12794      22860              
## Alcohol      1    59.54     12793      22800 1.201e-14 ***
## LabelAppeal  1  1990.44     12792      20810 < 2.2e-16 ***
## AcidIndex    1  1079.60     12791      19730 < 2.2e-16 ***
## STARS        1  1163.62     12790      18566 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##      res.deviance    df             p
## [1,]     18566.47 12790 6.094005e-222
##             Odds ratio     2.5 %    97.5 %
## (Intercept)  4.5430578 4.1953678 4.9195625
## Alcohol      1.0056448 1.0028022 1.0084955
## LabelAppeal  1.2174018 1.2031337 1.2318391
## AcidIndex    0.8827663 0.8752389 0.8903585
## STARS        1.2490567 1.2333395 1.2649741

## [1] 6.094005e-222

Finall, we can attempt to build linear regression models. There are some challenges to building linear regression models. They include the fact that the response variable is ero inflated. We can see that from our histogram.

## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
## 
##     cement
## The following object is masked from 'package:faraway':
## 
##     hsb
## The following object is masked from 'package:datasets':
## 
##     rivers
## 
## Call:
## lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, data = wine_training3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8571 -0.7435  0.3683  1.1240  4.7336 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.467e+00  5.544e-01   9.861  < 2e-16 ***
## FixedAcidity       -1.900e-03  2.935e-03  -0.647 0.517368    
## VolatileAcidity    -1.686e-01  2.600e-02  -6.483 9.29e-11 ***
## CitricAcid          5.345e-02  2.382e-02   2.244 0.024870 *  
## ResidualSugar      -1.419e-04  5.901e-04  -0.240 0.809961    
## Chlorides          -1.435e-01  6.275e-02  -2.287 0.022220 *  
## FreeSulfurDioxide   2.681e-04  1.362e-04   1.969 0.049008 *  
## TotalSulfurDioxide  3.515e-04  9.083e-05   3.870 0.000109 ***
## Density            -1.340e+00  5.441e-01  -2.463 0.013808 *  
## pH                 -6.306e-02  2.160e-02  -2.920 0.003505 ** 
## Sulphates          -6.802e-02  2.297e-02  -2.961 0.003072 ** 
## Alcohol             2.039e-02  4.090e-03   4.985 6.29e-07 ***
## LabelAppeal         5.938e-01  1.691e-02  35.116  < 2e-16 ***
## AcidIndex          -3.292e-01  1.118e-02 -29.454  < 2e-16 ***
## STARS               7.507e-01  1.950e-02  38.488  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.631 on 12780 degrees of freedom
## Multiple R-squared:  0.2841, Adjusted R-squared:  0.2833 
## F-statistic: 362.2 on 14 and 12780 DF,  p-value: < 2.2e-16

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                
##  ----------------------------------
##  Response : TARGET 
##  Variables: fitted values of TARGET 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    9.602864 
##  Prob > Chi2   =    0.001942741
##       FixedAcidity    VolatileAcidity         CitricAcid 
##           1.034051           1.003856           1.002462 
##      ResidualSugar          Chlorides  FreeSulfurDioxide 
##           1.000737           1.001894           1.001163 
## TotalSulfurDioxide            Density                 pH 
##           1.004620           1.002760           1.004413 
##          Sulphates            Alcohol        LabelAppeal 
##           1.002218           1.005986           1.092379 
##          AcidIndex              STARS 
##           1.053469           1.099862

Lets see what step AIC produces

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
##     Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
##     pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
## 
## Final Model:
## TARGET ~ VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS
## 
## 
##              Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                                  12780   33990.22 12530.95
## 2 - ResidualSugar  1 0.1538105     12781   33990.37 12529.01
## 3  - FixedAcidity  1 1.1201094     12782   33991.49 12527.43

Formulate the model generated by step

lmod2 <- lm(TARGET ~ VolatileAcidity + CitricAcid + 
    Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
    pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
 ,  data=wine_training3);

summary(lmod2);
## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_training3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8568 -0.7428  0.3693  1.1235  4.7350 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.455e+00  5.540e-01   9.847  < 2e-16 ***
## VolatileAcidity    -1.686e-01  2.600e-02  -6.487 9.08e-11 ***
## CitricAcid          5.369e-02  2.382e-02   2.254 0.024182 *  
## Chlorides          -1.433e-01  6.275e-02  -2.283 0.022426 *  
## FreeSulfurDioxide   2.685e-04  1.362e-04   1.972 0.048683 *  
## TotalSulfurDioxide  3.515e-04  9.081e-05   3.870 0.000109 ***
## Density            -1.337e+00  5.440e-01  -2.457 0.014022 *  
## pH                 -6.317e-02  2.159e-02  -2.926 0.003444 ** 
## Sulphates          -6.818e-02  2.297e-02  -2.969 0.002996 ** 
## Alcohol             2.040e-02  4.090e-03   4.989 6.16e-07 ***
## LabelAppeal         5.939e-01  1.691e-02  35.122  < 2e-16 ***
## AcidIndex          -3.305e-01  1.100e-02 -30.055  < 2e-16 ***
## STARS               7.507e-01  1.950e-02  38.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.631 on 12782 degrees of freedom
## Multiple R-squared:  0.284,  Adjusted R-squared:  0.2834 
## F-statistic: 422.6 on 12 and 12782 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lmod2)

hist(resid(lmod2), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod2);
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                
##  ----------------------------------
##  Response : TARGET 
##  Variables: fitted values of TARGET 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    9.600033 
##  Prob > Chi2   =    0.001945739
vif(lmod2);
##    VolatileAcidity         CitricAcid          Chlorides 
##           1.003832           1.002177           1.001865 
##  FreeSulfurDioxide TotalSulfurDioxide            Density 
##           1.001064           1.004416           1.002690 
##                 pH          Sulphates            Alcohol 
##           1.004345           1.002012           1.005931 
##        LabelAppeal          AcidIndex              STARS 
##           1.092339           1.019707           1.099804
plot(predict(lmod2),wine_training3$TARGET,
      xlab="predicted",ylab="actual")
 abline(a=0,b=1)

The results of the constant variance test indicate that there is non constant variance,however residuals are closely normal in the qq plot. The VIF numbers are mostly around one, meaning there is not indication of strong multi-colinearity.

We conclude the model building by constructing an additive linear model

## 
## Attaching package: 'ISLR'
## The following object is masked from 'package:vcd':
## 
##     Hitters
## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + s(LabelAppeal) + AcidIndex + s(STARS), data = wine_training3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8568 -0.7428  0.3693  1.1235  4.7350 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.455e+00  5.540e-01   9.847  < 2e-16 ***
## VolatileAcidity    -1.686e-01  2.600e-02  -6.487 9.08e-11 ***
## CitricAcid          5.369e-02  2.382e-02   2.254 0.024182 *  
## Chlorides          -1.433e-01  6.275e-02  -2.283 0.022426 *  
## FreeSulfurDioxide   2.685e-04  1.362e-04   1.972 0.048683 *  
## TotalSulfurDioxide  3.515e-04  9.081e-05   3.870 0.000109 ***
## Density            -1.337e+00  5.440e-01  -2.457 0.014022 *  
## pH                 -6.317e-02  2.159e-02  -2.926 0.003444 ** 
## Sulphates          -6.818e-02  2.297e-02  -2.969 0.002996 ** 
## Alcohol             2.040e-02  4.090e-03   4.989 6.16e-07 ***
## s(LabelAppeal)      5.939e-01  1.691e-02  35.122  < 2e-16 ***
## AcidIndex          -3.305e-01  1.100e-02 -30.055  < 2e-16 ***
## s(STARS)            7.507e-01  1.950e-02  38.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.631 on 12782 degrees of freedom
## Multiple R-squared:  0.284,  Adjusted R-squared:  0.2834 
## F-statistic: 422.6 on 12 and 12782 DF,  p-value: < 2.2e-16

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                Data                
##  ----------------------------------
##  Response : TARGET 
##  Variables: fitted values of TARGET 
## 
##          Test Summary          
##  ------------------------------
##  DF            =    1 
##  Chi2          =    9.600033 
##  Prob > Chi2   =    0.001945739
##    VolatileAcidity         CitricAcid          Chlorides 
##           1.003832           1.002177           1.001865 
##  FreeSulfurDioxide TotalSulfurDioxide            Density 
##           1.001064           1.004416           1.002690 
##                 pH          Sulphates            Alcohol 
##           1.004345           1.002012           1.005931 
##     s(LabelAppeal)          AcidIndex           s(STARS) 
##           1.092339           1.019707           1.099804

Building an additive linear model does not seem to improve what we already know from the existing linear models.

In order have a large enough pool of models to pick from, we should consider the case of zero inflation models. This model type can be addapted for poisson regression or negative binomial regression, which are two model types we have considered till this point. Right off the bat, we can disregard the linear models due to the nature of the response variable.

The provided documentation states “Zero-inflated poisson regression is used to model count data that has an excess of zero counts. Further, theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently” https://stats.idre.ucla.edu/r/dae/zip/

Just how many zeroes are present in our dataset?

##             TARGET       FixedAcidity    VolatileAcidity 
##               2734                 39                 18 
##         CitricAcid      ResidualSugar          Chlorides 
##                115                  6                  5 
##  FreeSulfurDioxide TotalSulfurDioxide            Density 
##                 11                  7                  0 
##                 pH          Sulphates            Alcohol 
##                  0                 22                  2 
##        LabelAppeal          AcidIndex              STARS 
##               5617                  0                  0

There is a substantial amount of data that contains zero values, hence we are more than justified to use zero inflation model types. We also have some logical arguments to consider. First, lets understand our data here. We have a count of the number of wine cases sold based on marketing and chemical attributes associated with that wine. A use case could be that a stakeholder also wants to predict the probability of a wine having a zero label appeal or a zero quality rating. This could be telling of how wine sales are impacted. We also have a goodness of fit motivation to try something different. The goodness of fit tests suggest our models are not good fits.

Poisson Zero Inflated Model-We will use only the variables from the last poisson model with our lofit predictors being STARS and LabelAppeal

## Loading required package: pscl
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
## 
## Call:
## zeroinfl(formula = TARGET ~ Alcohol + AcidIndex | STARS + LabelAppeal, 
##     data = wine_training3)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5470 -0.5324  0.1649  0.6139  2.3798 
## 
## Count model coefficients (poisson with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.535087   0.044167  34.756  < 2e-16 ***
## Alcohol      0.010018   0.001480   6.767 1.31e-11 ***
## AcidIndex   -0.041976   0.005369  -7.818 5.37e-15 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.16377    0.06729  -2.434   0.0149 *  
## STARS       -0.65848    0.03445 -19.114  < 2e-16 ***
## LabelAppeal  0.18011    0.02862   6.294  3.1e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 12 
## Log-likelihood: -2.424e+04 on 6 Df

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                    12.26657 model1 > model2 < 2.22e-16
## AIC-corrected          12.31571 model1 > model2 < 2.22e-16
## BIC-corrected          12.49891 model1 > model2 < 2.22e-16

The vuong test indicates that zero inflated poisson model is better than the regular poisson model due to the small p value. Our predictors in the count and inflation portions of the model are significant.

Lets see how this model compares to the null model

## 'log Lik.' 3.404314e-112 (df=6)

We can conclude that our model is staistically significant based on this hypothesis test.

  1. Model Selection

We need to parition a test and control data set from our larger training subset in order to predict model accuracy before we deploy on the evaluation data.

Lets show why the zero inflation poisson regression model is our best bet.

## structure(c(1.5350871583595, 0.0100184890625652, -0.0419759809109108
## ), .Names = c("(Intercept)", "Alcohol", "AcidIndex"))
## structure(c(-0.163773933205731, -0.658480007306658, 0.180109965808006
## ), .Names = c("(Intercept)", "STARS", "LabelAppeal"))

We extract the logit portion of linear portion of our model.

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = wine_training3, statistic = f, R = 100, parallel = "snow", 
##     ncpus = 4)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1*   1.535087158 -2.650632e-03 3.572330e-02
## t2*   0.044167429 -7.028079e-06 6.540012e-04
## t3*   0.010018488 -9.778092e-05 1.007721e-03
## t4*   0.001480428 -3.339340e-06 1.416369e-05
## t5*  -0.041975982  5.950387e-04 4.414965e-03
## t6*   0.005369237  1.801399e-07 9.323353e-05
## t7*  -0.163773933  1.171076e-02 5.037472e-02
## t8*   0.067289071  3.238981e-06 5.425711e-04
## t9*  -0.658480007 -6.222192e-03 2.352005e-02
## t10*  0.034451020  3.138112e-05 3.216014e-04
## t11*  0.180109967  3.838638e-03 2.974772e-02
## t12*  0.028617846  1.927628e-05 3.328356e-04

The output here are alternating parameter estimates. tw pertains to parameter estimates,tw has the standard error, and t3 contains the bootstrap standard errors.

Confidence intervals

##                          2.5 %      97.5 %
## count_(Intercept)  1.448520589  1.62165373
## count_Alcohol      0.007116903  0.01292008
## count_AcidIndex   -0.052499493 -0.03145247
## zero_(Intercept)  -0.295658088 -0.03188978
## zero_STARS        -0.726002765 -0.59095725
## zero_LabelAppeal   0.124020018  0.23619991

How well does it predict values in our test data? Lets deploy our model on the evaluation data and look at some descriptives to compare to the training data.Before that, We can also partition our training data into a smaller subset and see actuals vs predicted

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.032   4.000   8.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.709   2.779   3.024   3.016   3.251   4.225

Deploy to production on evaluation data and compare

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.838   2.720   3.030   3.014   3.292   4.175     841
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.029   4.000   8.000

It looks like the distribution of our predicted values are roughly the same as the distirbution of the actuals. We can conclude that the zero inflated poisson model is our best model to predict the number of wine sales.

## 
## Call:
## zeroinfl(formula = TARGET ~ Alcohol + AcidIndex | STARS + LabelAppeal, 
##     data = wine_training3)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5470 -0.5324  0.1649  0.6139  2.3798 
## 
## Count model coefficients (poisson with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.535087   0.044167  34.756  < 2e-16 ***
## Alcohol      0.010018   0.001480   6.767 1.31e-11 ***
## AcidIndex   -0.041976   0.005369  -7.818 5.37e-15 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.16377    0.06729  -2.434   0.0149 *  
## STARS       -0.65848    0.03445 -19.114  < 2e-16 ***
## LabelAppeal  0.18011    0.02862   6.294  3.1e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 12 
## Log-likelihood: -2.424e+04 on 6 Df

Appendix)

url <- 'https://raw.githubusercontent.com/vindication09/DATA-621-Week-5/master/wine-training-data.csv'
url2<-'https://raw.githubusercontent.com/vindication09/DATA-621-Week-5/master/wine-evaluation-data.csv'

wine_training <- read.csv(url, header = TRUE)
wine_evaluation <- read.csv(url2, header = TRUE)


head(wine_training,10)
wine_training2<-subset(wine_training, select=-c(INDEX))
#wine_training2<-subset(wine_training, select=-c(INDEX))
wine_evaluation2<-subset(wine_evaluation, select=-c(IN))
names(wine_training2)
str(wine_training2)
#install.packages('DataExplorer) 
library(DataExplorer)
plot_str(wine_training2)
plot_missing(wine_training2)
plot_histogram(wine_training2);plot_density(wine_training2)
barplot(table(wine_training2$TARGET), ylim=c(0, 5000), xlab="Result", ylab="N", col="black",
        main="Distribution of Target(Response)")
summary(wine_training2)
summary(wine_training2$TARGET);var(wine_training2$TARGET)
#12,795
colSums(wine_training2 < 0)
#has.neg <- apply(wine_training2, 1, function(row) any(row < 0))
#which(has.neg)
apply(wine_training2,2,  function(col)cor(col, wine_training2$TARGET))
#correlation matrix and visualization 
correlation_matrix <- round(cor(wine_training2),2)

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(correlation_matrix){
    correlation_matrix[upper.tri(correlation_matrix)] <- NA
    return(correlation_matrix)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(correlation_matrix){
    correlation_matrix[lower.tri(correlation_matrix)]<- NA
    return(correlation_matrix)
  }
  
  upper_tri <- get_upper_tri(correlation_matrix)



library(reshape2)

# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)

# Heatmap
library(ggplot2)

ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 15, hjust = 1))+
 coord_fixed()


#add nice labels 
ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  axis.text.x=element_text(size=rel(0.8), angle=90),
  axis.text.y=element_text(size=rel(0.8)),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwiwine_training3h = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))
outlierKD<-function(wine_training2, var) {
     var_name <- eval(substitute(var),eval(wine_training2))
     na1 <- sum(is.na(var_name))
     m1 <- mean(var_name, na.rm = T)
     par(mfrow=c(2, 2), oma=c(0,0,3,0))
     boxplot(var_name, main="With outliers")
     hist(var_name, main="With outliers", xlab=NA, ylab=NA)
     outlier <- boxplot.stats(var_name)$out
     mo <- mean(outlier)
     var_name <- ifelse(var_name %in% outlier, NA, var_name)
     boxplot(var_name, main="Without outliers")
     hist(var_name, main="Without outliers", xlab=NA, ylab=NA)
     title("Outlier Check", outer=TRUE)
     na2 <- sum(is.na(var_name))
     cat("Outliers identified:", na2 - na1, "n")
     cat("Propotion (%) of outliers:", round((na2 - na1) / sum(!is.na(var_name))*100, 1), "n")
     cat("Mean of the outliers:", round(mo, 2), "n")
     m2 <- mean(var_name, na.rm = T)
     cat("Mean without removing outliers:", round(m1, 2), "n")
     cat("Mean if we remove outliers:", round(m2, 2), "n")
     response <- readline(prompt="Do you want to remove outliers and to replace with NA? [yes/no]: ")
     if(response == "y" | response == "yes"){
          wine_training3[as.character(substitute(var))] <- invisible(var_name)
          assign(as.character(as.list(match.call())$wine_training2), wine_training2, envir = .GlobalEnv)
          cat("Outliers successfully removed", "n")
          return(invisible(wine_training2))
     } else{
          cat("Nothing changed", "n")
          return(invisible(var_name))
     }
}
outlierKD(wine_training2, TARGET)
outlierKD(wine_training2, Chlorides)
outlierKD(wine_training2, Alcohol)
outlierKD(wine_training2, FixedAcidity)
outlierKD(wine_training2, FreeSulfurDioxide)
outlierKD(wine_training2, LabelAppeal)
outlierKD(wine_training2, VolatileAcidity)
outlierKD(wine_training2, TotalSulfurDioxide)
outlierKD(wine_training2, AcidIndex)
outlierKD(wine_training2, CitricAcid)
outlierKD(wine_training2, Density)
outlierKD(wine_training2, ResidualSugar)
outlierKD(wine_training2, pH)
outlierKD(wine_training2, STARS)
outlierKD(wine_training2, Sulphates)
colSums(is.na(wine_training2))
library(Hmisc)

wine_training3<-wine_training2

wine_training3$STARS<-impute(wine_training3$STARS, median)

#make an additional subset that retains the same values but simply removes negative values (not possible)
wine_training_redux <- wine_training2[wine_training2$Alcohol >= 0 && wine_training2$Sulphates >= 0
                                      && wine_training2$Sulphates >= 0
                                      && wine_training2$TotalSulfurDioxide >= 0
                                      && wine_training2$FreeSulfurDioxide >= 0
                                      && wine_training2$Chlorides >= 0
                                      && wine_training2$ResidualSugar
                                      && wine_training2$CitricAcid
                                      && wine_training2$VolatileAcidity >= 0
                                      && wine_training2$FixedAcidity >= 0,]

#wine_training_redux <- wine_training2[wine_training2$Sulphates >= 0,] 
#wine_training_redux <- wine_training2[wine_training2$TotalSulfurDioxide >= 0,]
#wine_training_redux <- wine_training2[wine_training2$FreeSulfurDioxide >= 0, ]
#wine_training_redux <- wine_training2[wine_training2$Chlorides >= 0, ]
#wine_training_redux <- wine_training2[wine_training2$ResidualSugar >= 0,]
#wine_training_redux <- wine_training2[wine_training2$CitricAcid >= 0,]
#wine_training_redux <- wine_training2[wine_training2$VolatileAcidity >= 0,]
#wine_training_redux <- wine_training2[wine_training2$FixedAcidity >= 0,]
summary(wine_training3$STARS);summary(wine_training2$STARS)
barplot(table(wine_training3$STARS), ylim=c(0, 7000), xlab="Rating (post impute)", ylab="N", col="black");
barplot(table(wine_training2$STARS), ylim=c(0, 7000), xlab="Rating (pre impute)", ylab="N", col="black")

colSums(wine_training3<0);colSums(is.na(wine_training3))
wine_training3$Sulphates<-abs(wine_training3$Sulphates)
wine_training3$pH<-abs(wine_training3$pH)
wine_training3$ResidualSugar<-abs(wine_training3$ResidualSugar)
wine_training3$Chlorides<-abs(wine_training3$Chlorides)
wine_training3$FreeSulfurDioxide<-abs(wine_training3$FreeSulfurDioxide)
wine_training3$TotalSulfurDioxide<-abs(wine_training3$TotalSulfurDioxide)
wine_training3$VolatileAcidity<-abs(wine_training3$VolatileAcidity)
wine_training3$Alcohol<-abs(wine_training3$ Alcohol)
wine_training3$CitricAcid<-abs(wine_training3$CitricAcid)
wine_training3$FixedAcidity<-abs(wine_training3$FixedAcidity)

wine_evaluation3<-wine_evaluation

wine_evaluation3$Sulphates<-abs(wine_evaluation3$Sulphates)
wine_evaluation3$pH<-abs(wine_evaluation3$pH)
wine_evaluation3$ResidualSugar<-abs(wine_evaluation3$ResidualSugar)
wine_evaluation3$Chlorides<-abs(wine_evaluation3$Chlorides)
wine_evaluation3$FreeSulfurDioxide<-abs(wine_evaluation3$FreeSulfurDioxide)
wine_evaluation3$TotalSulfurDioxide<-abs(wine_evaluation3$TotalSulfurDioxide)
wine_evaluation3$VolatileAcidity<-abs(wine_evaluation3$VolatileAcidity)
wine_evaluation3$Alcohol<-abs(wine_evaluation3$ Alcohol)
wine_evaluation3$CitricAcid<-abs(wine_evaluation3$CitricAcid)
wine_evaluation3$FixedAcidity<-abs(wine_evaluation3$FixedAcidity)
wine_training3$Sulphates<-impute(wine_training3$Sulphates, median)
wine_training3$pH<-impute(wine_training3$pH, median)
wine_training3$ResidualSugar<-impute(wine_training3$ResidualSugar, median)
wine_training3$Chlorides<-impute(wine_training3$Chlorides, median)
wine_training3$FreeSulfurDioxide<-impute(wine_training3$FreeSulfurDioxide, median)
wine_training3$TotalSulfurDioxide<-impute(wine_training3$TotalSulfurDioxide, median)
wine_training3$Alcohol<-impute(wine_training3$Alcohol, median)

wine_evaluation3$Sulphates<-impute(wine_evaluation3$Sulphates, median)
wine_evaluation3$pH<-impute(wine_evaluation3$pH, median)
wine_evaluation3$ResidualSugar<-impute(wine_evaluation3$ResidualSugar, median)
wine_evaluation3$Chlorides<-impute(wine_evaluation3$Chlorides, median)
wine_evaluation3$FreeSulfurDioxide<-impute(wine_evaluation3$FreeSulfurDioxide, median)
wine_evaluation3$TotalSulfurDioxide<-impute(wine_evaluation3$TotalSulfurDioxide, median)
wine_evaluation3$Alcohol<-impute(wine_evaluation3$Alcohol, median)
#wine_evaluation3$TARGET<-impute(wine_evaluation3$Alcohol, median)



summary(wine_training3)
plot_density(wine_training3)
#testing
wine_training4<-wine_training3

wine_training4$Sulphates<-log(wine_training4$Sulphates+1)
wine_training4$pH<-log(wine_training4$pH+1)
wine_training4$ResidualSugar<-log(wine_training4$ResidualSugar+1)
wine_training4$Chlorides<-log(wine_training4$Chlorides+1)
wine_training4$FreeSulfurDioxide<-log(wine_training4$FreeSulfurDioxide+1)
wine_training4$TotalSulfurDioxide<-log(wine_training4$TotalSulfurDioxide+1)
wine_training4$Alcohol<-log(wine_training4$Alcohol+1)


plot_density(wine_training4);plot_density(wine_training2)

#correlation matrix and visualization 
correlation_matrix <- round(cor(wine_training3),2)

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(correlation_matrix){
    correlation_matrix[upper.tri(correlation_matrix)] <- NA
    return(correlation_matrix)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(correlation_matrix){
    correlation_matrix[lower.tri(correlation_matrix)]<- NA
    return(correlation_matrix)
  }
  
  upper_tri <- get_upper_tri(correlation_matrix)



library(reshape2)

# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)

# Heatmap
library(ggplot2)

ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 15, hjust = 1))+
 coord_fixed()


#add nice labels 
ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  axis.text.x=element_text(size=rel(0.8), angle=90),
  axis.text.y=element_text(size=rel(0.8)),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwiwine_training3h = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))
#correlation matrix and visualization 
correlation_matrix <- round(cor(wine_training4),2)

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(correlation_matrix){
    correlation_matrix[upper.tri(correlation_matrix)] <- NA
    return(correlation_matrix)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(correlation_matrix){
    correlation_matrix[lower.tri(correlation_matrix)]<- NA
    return(correlation_matrix)
  }
  
  upper_tri <- get_upper_tri(correlation_matrix)



library(reshape2)

# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)

# Heatmap
library(ggplot2)

ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 15, hjust = 1))+
 coord_fixed()


#add nice labels 
ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  axis.text.x=element_text(size=rel(0.8), angle=90),
  axis.text.y=element_text(size=rel(0.8)),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwiwine_training3h = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))
library(vcd)
library(faraway)
library(AER)
library(boot)

pmod <- glm(TARGET~., family="poisson", data=wine_training3)
summary(pmod);



#goodness of fit
anova(pmod, test="Chisq");

glm.diag.plots(pmod, glmdiag = glm.diag(pmod), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(pmod, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

p1<-1-(18475/22861)
p1;

pchisq(pmod$deviance, df=pmod$df.residual, lower.tail=FALSE)

#dispersion test

#deviance(pmod)/pmod$df.residual
#dispersiontest(pmod);

#halfnorm(residuals(pmod))
library(MASS)
step<-stepAIC(pmod, trace=FALSE)
step$anova

pmod2<-glm(TARGET ~ VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
    TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
    LabelAppeal + AcidIndex + STARS, family="poisson",  data=wine_training3)

summary(pmod2);

#goodness of fit
anova(pmod2, test="Chisq");

glm.diag.plots(pmod2, glmdiag = glm.diag(pmod2), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(pmod2, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

p2<-1-(18475/22861)
p2;

pchisq(pmod2$deviance, df=pmod2$df.residual, lower.tail=FALSE)

#dispersion test

#deviance(pmod)/pmod$df.residual
#dispersiontest(pmod);

#halfnorm(residuals(pmod))
pmod3<-glm(TARGET ~ CitricAcid + Chlorides + FreeSulfurDioxide + 
    TotalSulfurDioxide + Density + Sulphates + Alcohol + 
    LabelAppeal + AcidIndex + STARS, family="poisson",  data=wine_training3)

summary(pmod3);

#goodness of fit
anova(pmod3, test="Chisq");

glm.diag.plots(pmod3, glmdiag = glm.diag(pmod3), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(pmod3, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

p3<-1-(18475/22861)
p3

pchisq(pmod3$deviance, df=pmod3$df.residual, lower.tail=FALSE)
#dispersion test

#deviance(pmod)/pmod$df.residual
#dispersiontest(pmod);

#halfnorm(residuals(pmod))
pmod4<-glm(TARGET ~ Chlorides + FreeSulfurDioxide + 
    TotalSulfurDioxide + Density + Sulphates + Alcohol + 
    LabelAppeal + AcidIndex + STARS, family="poisson",  data=wine_training3)

summary(pmod4);

#goodness of fit
anova(pmod4, test="Chisq");

glm.diag.plots(pmod4, glmdiag = glm.diag(pmod4), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(pmod4, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

exp(cbind("Odds ratio" = coef(pmod4), confint.default(pmod4, level = 0.95)));

p4<-1-(18475/22861)
p4;

plot(log(fitted(pmod4)), log((wine_training3$TARGET-fitted(pmod4))^2), xlab=expression(hat(mu)), ylab=expression((y-hat(mu))^2))
abline(0, 1);

#goodness of fit
pchisq(pmod4$deviance, df=pmod4$df.residual, lower.tail=FALSE)

#dispersion test

#deviance(pmod)/pmod$df.residual
#dispersiontest(pmod);

#halfnorm(residuals(pmod))
library(gam)

pmod_smooth<-gam(TARGET ~ s(LabelAppeal)+s(AcidIndex)+s(STARS) , family=poisson(link=log),  data=wine_training3)

#pmod_smooth<-gam(TARGET ~ LabelAppeal+AcidIndex+STARS , family=poisson(link=log),  data=wine_training3)

summary(pmod_smooth);

#goodness of fit
anova(pmod_smooth, test="Chisq");

glm.diag.plots(pmod_smooth, glmdiag = glm.diag(pmod_smooth), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(pmod_smooth, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

exp(cbind("Odds ratio" = coef(pmod_smooth), confint.default(pmod_smooth, level = 0.95)));

p_smooth<-1-(17735.95/22860.89)
p_smooth;

plot(log(fitted(pmod_smooth)), log((wine_training3$TARGET-fitted(pmod_smooth))^2), xlab=expression(hat(mu)), ylab=expression((y-hat(mu))^2))
abline(0, 1);

#goodness of fit
# 1-pchisq(summary(pmod_smooth)$deviance, summary(pmod_smooth)$df.residual)
pchisq(pmod_smooth$deviance, df=pmod_smooth$df.residual, lower.tail=FALSE)

#dispersion test

#deviance(pmod)/pmod$df.residual
#dispersiontest(pmod);

#halfnorm(residuals(pmod))
#plot(p_smooth, pages = 1, scheme = 1, all.terms = TRUE, seWithMean = TRUE)
library(MASS)
nmod1 <- glm.nb(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
    Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
    pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
 ,  data=wine_training3)

summary(nmod1);

#goodness of fit
anova(nmod1, test="Chisq");

glm.diag.plots(nmod1, glmdiag = glm.diag(nmod1), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(nmod1, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

exp(cbind("Odds ratio" = coef(nmod1), confint.default(nmod1, level = 0.95)));

#n<-1-(18474/22860)
#n;

plot(log(fitted(nmod1)), log((wine_training3$TARGET-fitted(nmod1))^2), xlab=expression(hat(mu)), ylab=expression((y-hat(mu))^2))
abline(0, 1);

#goodness of fit
pchisq(nmod1$deviance, df=nmod1$df.residual, lower.tail=FALSE)

nmod2 <- glm.nb(TARGET ~ VolatileAcidity  + 
    Chlorides + TotalSulfurDioxide + Density + Alcohol + LabelAppeal + AcidIndex + STARS
 ,  data=wine_training3)

summary(nmod2);

#goodness of fit
anova(nmod2, test="Chisq");

glm.diag.plots(nmod2, glmdiag = glm.diag(nmod2), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(nmod2, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

exp(cbind("Odds ratio" = coef(nmod2), confint.default(nmod2, level = 0.95)));

#n2<-1-(18500/22860)
#n2;

plot(log(fitted(nmod2)), log((wine_training3$TARGET-fitted(nmod2))^2), xlab=expression(hat(mu)), ylab=expression((y-hat(mu))^2))
abline(0, 1);

#goodness of fit
pchisq(nmod2$deviance, df=nmod2$df.residual, lower.tail=FALSE)
nmod4 <- glm.nb(TARGET ~ Alcohol+LabelAppeal + AcidIndex + STARS
 ,  data=wine_training3)

summary(nmod4);

#goodness of fit
anova(nmod4, test="Chisq");

glm.diag.plots(nmod4, glmdiag = glm.diag(nmod4), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(nmod4, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

exp(cbind("Odds ratio" = coef(nmod4), confint.default(nmod4, level = 0.95)));

#n2<-1-(18566/22860)
#n2;

plot(log(fitted(nmod4)), log((wine_training3$TARGET-fitted(nmod4))^2), xlab=expression(hat(mu)), ylab=expression((y-hat(mu))^2))
abline(0, 1);

plot(predict(nmod4),wine_training3$TARGET,
      xlab="predicted",ylab="actual")
 abline(a=0,b=1);
 
 #goodness of fit
pchisq(nmod4$deviance, df=nmod4$df.residual, lower.tail=FALSE)
library(olsrr)

lmod1 <- lm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
    Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
    pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
 ,  data=wine_training3);

summary(lmod1);

par(mfrow=c(2,2))
plot(lmod1)
hist(resid(lmod1), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod1);
vif(lmod1);

plot(predict(lmod1),wine_training3$TARGET,
      xlab="predicted",ylab="actual")
 abline(a=0,b=1)
lstep<-stepAIC(lmod1, trace=FALSE)
lstep$anova
lmod2 <- lm(TARGET ~ VolatileAcidity + CitricAcid + 
    Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
    pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
 ,  data=wine_training3);

summary(lmod2);

par(mfrow=c(2,2))
plot(lmod2)
hist(resid(lmod2), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod2);
vif(lmod2);

plot(predict(lmod2),wine_training3$TARGET,
      xlab="predicted",ylab="actual")
 abline(a=0,b=1)
library(ISLR)

lmod3<-lm(TARGET ~ VolatileAcidity + CitricAcid + 
    Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
    pH + Sulphates + Alcohol + s(LabelAppeal) + AcidIndex + s(STARS)
 ,  data=wine_training3)

summary(lmod3);

par(mfrow=c(2,2))
plot(lmod3)
hist(resid(lmod3), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod2);
vif(lmod3);

plot(predict(lmod3),wine_training3$TARGET,
      xlab="predicted",ylab="actual")
 abline(a=0,b=1);
 
 #goodness of fit
 #pchisq(summary(pmod7)$deviance, 
  #         summary(pmod7)$df.residual
   #        )

colSums(wine_training3==0)
require(ggplot2)
require(pscl)
require(MASS)
require(boot)

pmod7 <- zeroinfl(TARGET ~   Alcohol  + AcidIndex | STARS+LabelAppeal,
  data = wine_training3)
summary(pmod7);


#glm.diag.plots(nmod3, glmdiag = glm.diag(nmod3), subset = NULL,
 #              iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

#with(nmod3, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

#exp(cbind("Odds ratio" = coef(pmod7), confint.default(pmod7, level = 0.95)));

#n2<-1-(18500/22860)
#n2;

plot(log(fitted(pmod7)), log((wine_training3$TARGET-fitted(pmod7))^2), xlab=expression(hat(mu)), ylab=expression((y-hat(mu))^2))
abline(0, 1)

vuong(pmod7, pmod4);

plot(predict(pmod7),wine_training3$TARGET,
      xlab="predicted",ylab="actual")
 abline(a=0,b=1);
 
#goodness of fit
# pchisq(summary(pmod7)$deviance, 
 #          summary(pmod7)$df.residual
  #         )
pnull <- update(pmod7, . ~ 1)

pchisq(2 * (logLik(pmod7) - logLik(pnull)), df = 3, lower.tail = FALSE)
library(caret)
Train <- createDataPartition(wine_training3$TARGET, p=0.7, list=FALSE)
train <- wine_training3[Train, ]
test <- wine_training3[-Train, ]
dput(coef(pmod7, "count"));dput(coef(pmod7, "zero"))
f <- function(data, i) 
  {
  require(pscl)
  m <- zeroinfl(TARGET ~   Alcohol  + AcidIndex | STARS+LabelAppeal, data = data[i, ],
    start = list(count = c(1.5350871583595, 0.0100184890625652, -0.0419759809109108
), zero = c(-0.163773933205731, -0.658480007306658, 0.180109965808006
)))
  as.vector(t(do.call(rbind, coef(summary(m)))[, 1:2]))
 }

set.seed(10)
res <- boot(wine_training3, f, R = 100, parallel = "snow", ncpus = 4)

## print results
res
confint(pmod7)
#gather predicted
test_results2<-predict(pmod7, newdata=test, type = "response")
target_pred<-data.frame(test_results2)

actuals<-subset(test,select=c(TARGET))

#plot
results<-data.frame(target_pred, actuals)

xyplot(TARGET ~ test_results2, data = results,
  xlab = "Predicted ",
  ylab = "Actuals",
  main = "Predicted Wine Sales vs Actual Wine Sales on Test Data");

plot_density(results);

summary(results$TARGET);summary(results$test_results2)
test_results<-predict(pmod7, newdata=wine_evaluation3, type = "response")
test.df<-data.frame(test_results)
summary(test_results);summary(wine_training3$TARGET)

summary(pmod7)
#read in data 
url <- 'https://raw.githubusercontent.com/vindication09/DATA-621-Week-5/master/wine-training-data.csv'
url2<-'https://raw.githubusercontent.com/vindication09/DATA-621-Week-5/master/wine-evaluation-data.csv'

wine_training <- read.csv(url, header = TRUE)
wine_evaluation <- read.csv(url2, header = TRUE)


head(wine_training,10)

wine_training2<-subset(wine_training, select=-c(INDEX))
#wine_training2<-subset(wine_training, select=-c(INDEX))
wine_evaluation2<-subset(wine_evaluation, select=-c(IN))
names(wine_training2)
str(wine_training2)

#eda
#install.packages('DataExplorer) 
library(DataExplorer)
plot_str(wine_training2)
plot_missing(wine_training2)
plot_histogram(wine_training2);plot_density(wine_training2)


barplot(table(wine_training2$TARGET), ylim=c(0, 5000), xlab="Result", ylab="N", col="black",
        main="Distribution of Target(Response)")

summary(wine_training2)

summary(wine_training2$TARGET);var(wine_training2$TARGET)

#12,795
colSums(wine_training2 < 0)
#has.neg <- apply(wine_training2, 1, function(row) any(row < 0))
#which(has.neg)

apply(wine_training2,2,  function(col)cor(col, wine_training2$TARGET))


#correlation matrix and visualization 
correlation_matrix <- round(cor(wine_training2),2)

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(correlation_matrix){
    correlation_matrix[upper.tri(correlation_matrix)] <- NA
    return(correlation_matrix)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(correlation_matrix){
    correlation_matrix[lower.tri(correlation_matrix)]<- NA
    return(correlation_matrix)
  }
  
  upper_tri <- get_upper_tri(correlation_matrix)



library(reshape2)

# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)

# Heatmap
library(ggplot2)

ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 15, hjust = 1))+
 coord_fixed()


#add nice labels 
ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  axis.text.x=element_text(size=rel(0.8), angle=90),
  axis.text.y=element_text(size=rel(0.8)),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwiwine_training3h = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))




  outlierKD<-function(wine_training2, var) {
     var_name <- eval(substitute(var),eval(wine_training3))
     na1 <- sum(is.na(var_name))
     m1 <- mean(var_name, na.rm = T)
     par(mfrow=c(2, 2), oma=c(0,0,3,0))
     boxplot(var_name, main="With outliers")
     hist(var_name, main="With outliers", xlab=NA, ylab=NA)
     outlier <- boxplot.stats(var_name)$out
     mo <- mean(outlier)
     var_name <- ifelse(var_name %in% outlier, NA, var_name)
     boxplot(var_name, main="Without outliers")
     hist(var_name, main="Without outliers", xlab=NA, ylab=NA)
     title("Outlier Check", outer=TRUE)
     na2 <- sum(is.na(var_name))
     cat("Outliers identified:", na2 - na1, "n")
     cat("Propotion (%) of outliers:", round((na2 - na1) / sum(!is.na(var_name))*100, 1), "n")
     cat("Mean of the outliers:", round(mo, 2), "n")
     m2 <- mean(var_name, na.rm = T)
     cat("Mean without removing outliers:", round(m1, 2), "n")
     cat("Mean if we remove outliers:", round(m2, 2), "n")
     response <- readline(prompt="Do you want to remove outliers and to replace with NA? [yes/no]: ")
     if(response == "y" | response == "yes"){
          wine_training3[as.character(substitute(var))] <- invisible(var_name)
          assign(as.character(as.list(match.call())$wine_training2), wine_training2, envir = .GlobalEnv)
          cat("Outliers successfully removed", "n")
          return(invisible(wine_training2))
     } else{
          cat("Nothing changed", "n")
          return(invisible(var_name))
     }
}





outlierKD(wine_training2, TARGET)



outlierKD(wine_training2, Chlorides)




outlierKD(wine_training2, Alcohol)




outlierKD(wine_training2, FixedAcidity)




outlierKD(wine_training2, FreeSulfurDioxide)




outlierKD(wine_training2, LabelAppeal)



outlierKD(wine_training2, VolatileAcidity)



outlierKD(wine_training2, TotalSulfurDioxide)




outlierKD(wine_training2, AcidIndex)



outlierKD(wine_training2, CitricAcid)




outlierKD(wine_training2, Density)



outlierKD(wine_training2, ResidualSugar)



outlierKD(wine_training2, pH)




outlierKD(wine_training2, STARS)



outlierKD(wine_training2, Sulphates)



#Data prep

library(Hmisc)

wine_training3<-wine_training2

wine_training3$STARS<-impute(wine_training3$STARS, median)

#make an additional subset that retains the same values but simply removes negative values (not possible)
wine_training_redux <- wine_training2[wine_training2$Alcohol >= 0 && wine_training2$Sulphates >= 0
                                      && wine_training2$Sulphates >= 0
                                      && wine_training2$TotalSulfurDioxide >= 0
                                      && wine_training2$FreeSulfurDioxide >= 0
                                      && wine_training2$Chlorides >= 0
                                      && wine_training2$ResidualSugar
                                      && wine_training2$CitricAcid
                                      && wine_training2$VolatileAcidity >= 0
                                      && wine_training2$FixedAcidity >= 0,]

#wine_training_redux <- wine_training2[wine_training2$Sulphates >= 0,] 
#wine_training_redux <- wine_training2[wine_training2$TotalSulfurDioxide >= 0,]
#wine_training_redux <- wine_training2[wine_training2$FreeSulfurDioxide >= 0, ]
#wine_training_redux <- wine_training2[wine_training2$Chlorides >= 0, ]
#wine_training_redux <- wine_training2[wine_training2$ResidualSugar >= 0,]
#wine_training_redux <- wine_training2[wine_training2$CitricAcid >= 0,]
#wine_training_redux <- wine_training2[wine_training2$VolatileAcidity >= 0,]
#wine_training_redux <- wine_training2[wine_training2$FixedAcidity >= 0,]


barplot(table(wine_training3$STARS), ylim=c(0, 7000), xlab="Rating (post impute)", ylab="N", col="black");
barplot(table(wine_training2$STARS), ylim=c(0, 7000), xlab="Rating (pre impute)", ylab="N", col="black")



colSums(wine_training3<0);colSums(is.na(wine_training3))



wine_training3$Sulphates<-abs(wine_training3$Sulphates)
wine_training3$pH<-abs(wine_training3$pH)
wine_training3$ResidualSugar<-abs(wine_training3$ResidualSugar)
wine_training3$Chlorides<-abs(wine_training3$Chlorides)
wine_training3$FreeSulfurDioxide<-abs(wine_training3$FreeSulfurDioxide)
wine_training3$TotalSulfurDioxide<-abs(wine_training3$TotalSulfurDioxide)
wine_training3$VolatileAcidity<-abs(wine_training3$VolatileAcidity)
wine_training3$Alcohol<-abs(wine_training3$ Alcohol)
wine_training3$CitricAcid<-abs(wine_training3$CitricAcid)
wine_training3$FixedAcidity<-abs(wine_training3$FixedAcidity)

wine_evaluation3<-wine_evaluation

wine_evaluation3$Sulphates<-abs(wine_evaluation3$Sulphates)
wine_evaluation3$pH<-abs(wine_evaluation3$pH)
wine_evaluation3$ResidualSugar<-abs(wine_evaluation3$ResidualSugar)
wine_evaluation3$Chlorides<-abs(wine_evaluation3$Chlorides)
wine_evaluation3$FreeSulfurDioxide<-abs(wine_evaluation3$FreeSulfurDioxide)
wine_evaluation3$TotalSulfurDioxide<-abs(wine_evaluation3$TotalSulfurDioxide)
wine_evaluation3$VolatileAcidity<-abs(wine_evaluation3$VolatileAcidity)
wine_evaluation3$Alcohol<-abs(wine_evaluation3$ Alcohol)
wine_evaluation3$CitricAcid<-abs(wine_evaluation3$CitricAcid)
wine_evaluation3$FixedAcidity<-abs(wine_evaluation3$FixedAcidity)



wine_training3$Sulphates<-impute(wine_training3$Sulphates, median)
wine_training3$pH<-impute(wine_training3$pH, median)
wine_training3$ResidualSugar<-impute(wine_training3$ResidualSugar, median)
wine_training3$Chlorides<-impute(wine_training3$Chlorides, median)
wine_training3$FreeSulfurDioxide<-impute(wine_training3$FreeSulfurDioxide, median)
wine_training3$TotalSulfurDioxide<-impute(wine_training3$TotalSulfurDioxide, median)
wine_training3$Alcohol<-impute(wine_training3$Alcohol, median)

wine_evaluation3$Sulphates<-impute(wine_evaluation3$Sulphates, median)
wine_evaluation3$pH<-impute(wine_evaluation3$pH, median)
wine_evaluation3$ResidualSugar<-impute(wine_evaluation3$ResidualSugar, median)
wine_evaluation3$Chlorides<-impute(wine_evaluation3$Chlorides, median)
wine_evaluation3$FreeSulfurDioxide<-impute(wine_evaluation3$FreeSulfurDioxide, median)
wine_evaluation3$TotalSulfurDioxide<-impute(wine_evaluation3$TotalSulfurDioxide, median)
wine_evaluation3$Alcohol<-impute(wine_evaluation3$Alcohol, median)
#wine_evaluation3$TARGET<-impute(wine_evaluation3$Alcohol, median)


wine_training_redux$Sulphates<-impute(wine_training_redux$Sulphates, median)
wine_training_redux$pH<-impute(wine_training_redux$pH, median)
wine_training_redux$ResidualSugar<-impute(wine_training_redux$ResidualSugar, median)
wine_training_redux$Chlorides<-impute(wine_training_redux$Chlorides, median)
wine_training_redux$FreeSulfurDioxide<-impute(wine_training_redux$FreeSulfurDioxide, median)
wine_training_redux$TotalSulfurDioxide<-impute(wine_training_redux$TotalSulfurDioxide, median)
wine_training_redux$Alcohol<-impute(wine_training_redux$Alcohol, median)

summary(wine_training3)



#Modeling 
library(vcd)
library(faraway)
library(AER)
library(boot)

pmod <- glm(TARGET~., family="poisson", data=wine_training3)
summary(pmod);



#goodness of fit
anova(pmod, test="Chisq");

glm.diag.plots(pmod, glmdiag = glm.diag(pmod), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(pmod, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

p1<-1-(18475/22861)
p1;

pchisq(pmod$deviance, df=pmod$df.residual, lower.tail=FALSE)

#dispersion test

#deviance(pmod)/pmod$df.residual
#dispersiontest(pmod);

#halfnorm(residuals(pmod))


library(MASS)
step<-stepAIC(pmod, trace=FALSE)
step$anova


pmod2<-glm(TARGET ~ VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
    TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
    LabelAppeal + AcidIndex + STARS, family="poisson",  data=wine_training3)

summary(pmod2);

#goodness of fit
anova(pmod2, test="Chisq");

glm.diag.plots(pmod2, glmdiag = glm.diag(pmod2), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(pmod2, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

p2<-1-(18475/22861)
p2;

pchisq(pmod2$deviance, df=pmod2$df.residual, lower.tail=FALSE)

#dispersion test

#deviance(pmod)/pmod$df.residual
#dispersiontest(pmod);

#halfnorm(residuals(pmod))


pmod3<-glm(TARGET ~ CitricAcid + Chlorides + FreeSulfurDioxide + 
    TotalSulfurDioxide + Density + Sulphates + Alcohol + 
    LabelAppeal + AcidIndex + STARS, family="poisson",  data=wine_training3)

summary(pmod3);

#goodness of fit
anova(pmod3, test="Chisq");

glm.diag.plots(pmod3, glmdiag = glm.diag(pmod3), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(pmod3, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

p3<-1-(18475/22861)
p3

pchisq(pmod3$deviance, df=pmod3$df.residual, lower.tail=FALSE)
#dispersion test



pmod4<-glm(TARGET ~ Chlorides + FreeSulfurDioxide + 
    TotalSulfurDioxide + Density + Sulphates + Alcohol + 
    LabelAppeal + AcidIndex + STARS, family="poisson",  data=wine_training4)

summary(pmod4);

#goodness of fit
anova(pmod4, test="Chisq");

glm.diag.plots(pmod4, glmdiag = glm.diag(pmod4), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(pmod4, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

exp(cbind("Odds ratio" = coef(pmod4), confint.default(pmod4, level = 0.95)));

p4<-1-(18475/22861)
p4;

plot(log(fitted(pmod4)), log((wine_training4$TARGET-fitted(pmod4))^2), xlab=expression(hat(mu)), ylab=expression((y-hat(mu))^2))
abline(0, 1);

#goodness of fit
pchisq(pmod4$deviance, df=pmod4$df.residual, lower.tail=FALSE)

#dispersion test

#deviance(pmod)/pmod$df.residual
#dispersiontest(pmod);

#halfnorm(residuals(pmod))


library(gam)

pmod_smooth<-gam(TARGET ~ s(LabelAppeal)+s(AcidIndex)+s(STARS) , family=poisson(link=log),  data=wine_training3)

#pmod_smooth<-gam(TARGET ~ LabelAppeal+AcidIndex+STARS , family=poisson(link=log),  data=wine_training3)

summary(pmod_smooth);

#goodness of fit
anova(pmod_smooth, test="Chisq");

glm.diag.plots(pmod_smooth, glmdiag = glm.diag(pmod_smooth), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(pmod_smooth, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

exp(cbind("Odds ratio" = coef(pmod_smooth), confint.default(pmod_smooth, level = 0.95)));

p_smooth<-1-(17735.95/22860.89)
p_smooth;

plot(log(fitted(pmod_smooth)), log((wine_training3$TARGET-fitted(pmod_smooth))^2), xlab=expression(hat(mu)), ylab=expression((y-hat(mu))^2))
abline(0, 1);

#goodness of fit
# 1-pchisq(summary(pmod_smooth)$deviance, summary(pmod_smooth)$df.residual)
pchisq(pmod_smooth$deviance, df=pmod_smooth$df.residual, lower.tail=FALSE)

#dispersion test

#deviance(pmod)/pmod$df.residual
#dispersiontest(pmod);

#halfnorm(residuals(pmod))
#plot(p_smooth, pages = 1, scheme = 1, all.terms = TRUE, seWithMean = TRUE)


library(MASS)
nmod1 <- glm.nb(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
    Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
    pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
 ,  data=wine_training3)

summary(nmod1);

#goodness of fit
anova(nmod1, test="Chisq");

glm.diag.plots(nmod1, glmdiag = glm.diag(nmod1), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(nmod1, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

exp(cbind("Odds ratio" = coef(nmod1), confint.default(nmod1, level = 0.95)));

#n<-1-(18474/22860)
#n;

plot(log(fitted(nmod1)), log((wine_training3$TARGET-fitted(nmod1))^2), xlab=expression(hat(mu)), ylab=expression((y-hat(mu))^2))
abline(0, 1);

#goodness of fit
pchisq(nmod1$deviance, df=nmod1$df.residual, lower.tail=FALSE)


nmod2 <- glm.nb(TARGET ~ VolatileAcidity  + 
    Chlorides + TotalSulfurDioxide + Density + Alcohol + LabelAppeal + AcidIndex + STARS
 ,  data=wine_training3)

summary(nmod2);

#goodness of fit
anova(nmod2, test="Chisq");

glm.diag.plots(nmod2, glmdiag = glm.diag(nmod2), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(nmod2, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

exp(cbind("Odds ratio" = coef(nmod2), confint.default(nmod2, level = 0.95)));

#n2<-1-(18500/22860)
#n2;

plot(log(fitted(nmod2)), log((wine_training3$TARGET-fitted(nmod2))^2), xlab=expression(hat(mu)), ylab=expression((y-hat(mu))^2))
abline(0, 1);

#goodness of fit
pchisq(nmod2$deviance, df=nmod2$df.residual, lower.tail=FALSE)





nmod4 <- glm.nb(TARGET ~ Alcohol+LabelAppeal + AcidIndex + STARS
 ,  data=wine_training3)

summary(nmod4);

#goodness of fit
anova(nmod4, test="Chisq");

glm.diag.plots(nmod4, glmdiag = glm.diag(nmod4), subset = NULL,
               iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

with(nmod4, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

exp(cbind("Odds ratio" = coef(nmod4), confint.default(nmod4, level = 0.95)));

#n2<-1-(18566/22860)
#n2;

plot(log(fitted(nmod4)), log((wine_training3$TARGET-fitted(nmod4))^2), xlab=expression(hat(mu)), ylab=expression((y-hat(mu))^2))
abline(0, 1);

plot(predict(nmod4),wine_training3$TARGET,
      xlab="predicted",ylab="actual")
 abline(a=0,b=1);
 
 #goodness of fit
pchisq(nmod4$deviance, df=nmod4$df.residual, lower.tail=FALSE)



library(olsrr)

lmod1 <- lm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
    Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
    pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
 ,  data=wine_training3);

summary(lmod1);

par(mfrow=c(2,2))
plot(lmod1)
hist(resid(lmod1), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod1);
vif(lmod1);

plot(predict(lmod1),wine_training3$TARGET,
      xlab="predicted",ylab="actual")
 abline(a=0,b=1)


 lstep<-stepAIC(lmod1, trace=FALSE)
lstep$anova


lmod2 <- lm(TARGET ~ VolatileAcidity + CitricAcid + 
    Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
    pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
 ,  data=wine_training3);

summary(lmod2);

par(mfrow=c(2,2))
plot(lmod2)
hist(resid(lmod2), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod2);
vif(lmod2);

plot(predict(lmod2),wine_training3$TARGET,
      xlab="predicted",ylab="actual")
 abline(a=0,b=1)


 library(ISLR)

lmod3<-lm(TARGET ~ VolatileAcidity + CitricAcid + 
    Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
    pH + Sulphates + Alcohol + s(LabelAppeal) + AcidIndex + s(STARS)
 ,  data=wine_training3)

summary(lmod3);

par(mfrow=c(2,2))
plot(lmod3)
hist(resid(lmod3), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod2);
vif(lmod3);

plot(predict(lmod3),wine_training3$TARGET,
      xlab="predicted",ylab="actual")
 abline(a=0,b=1);
 
 #goodness of fit
 #pchisq(summary(pmod7)$deviance, 
  #         summary(pmod7)$df.residual
   #        )



require(ggplot2)
require(pscl)
require(MASS)
require(boot)

pmod7 <- zeroinfl(TARGET ~   Alcohol  + AcidIndex | STARS+LabelAppeal,
  data = wine_training3)
summary(pmod7);


#glm.diag.plots(nmod3, glmdiag = glm.diag(nmod3), subset = NULL,
 #              iden = FALSE, labels = NULL, ret = FALSE)

#cov.pmod <- vcovHC(pmod, type="HC0")
#std.err <- sqrt(diag(cov.pmod))
#r.est <- cbind(Estimate= coef(pmod), "Robust SE" = std.err,
#"Pr(>|z|)" = 2 * pnorm(abs(coef(pmod)/std.err), lower.tail=FALSE),
#LL = coef(pmod) - 1.96 * std.err,
#UL = coef(pmod) + 1.96 * std.err);

#r.est;

#with(nmod3, cbind(res.deviance = deviance, df = df.residual,p = pchisq(deviance, df.residual, lower.tail=FALSE)));

#exp(cbind("Odds ratio" = coef(pmod7), confint.default(pmod7, level = 0.95)));

#n2<-1-(18500/22860)
#n2;

plot(log(fitted(pmod7)), log((wine_training3$TARGET-fitted(pmod7))^2), xlab=expression(hat(mu)), ylab=expression((y-hat(mu))^2))
abline(0, 1)

vuong(pmod7, pmod4);

plot(predict(pmod7),wine_training3$TARGET,
      xlab="predicted",ylab="actual")
 abline(a=0,b=1);
 
#goodness of fit
# pchisq(summary(pmod7)$deviance, 
 #          summary(pmod7)$df.residual
  #         )



#model selection

  pnull <- update(pmod7, . ~ 1)

pchisq(2 * (logLik(pmod7) - logLik(pnull)), df = 3, lower.tail = FALSE)



library(caret)
Train <- createDataPartition(wine_training3$TARGET, p=0.7, list=FALSE)
train <- wine_training3[Train, ]
test <- wine_training3[-Train, ]


dput(coef(pmod7, "count"));dput(coef(pmod7, "zero"))


f <- function(data, i) 
  {
  require(pscl)
  m <- zeroinfl(TARGET ~   Alcohol  + AcidIndex | STARS+LabelAppeal, data = data[i, ],
    start = list(count = c(1.5350871583595, 0.0100184890625652, -0.0419759809109108
), zero = c(-0.163773933205731, -0.658480007306658, 0.180109965808006
)))
  as.vector(t(do.call(rbind, coef(summary(m)))[, 1:2]))
 }

set.seed(10)
res <- boot(wine_training3, f, R = 100, parallel = "snow", ncpus = 4)

## print results
res



#conclusion
#gather predicted
test_results2<-predict(pmod7, newdata=test, type = "response")
target_pred<-data.frame(test_results2)

actuals<-subset(test,select=c(TARGET))

#plot
results<-data.frame(target_pred, actuals)

xyplot(TARGET ~ test_results2, data = results,
  xlab = "Predicted ",
  ylab = "Actuals",
  main = "Predicted Wine Sales vs Actual Wine Sales on Test Data");

plot_density(results);

summary(results$TARGET);summary(results$test_results2)