Introduction

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Our objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. Below is a short description of the variables of interest in the data set:
Variable Name Definition Theoretical Effect
INDEX Identification Variable (do not use) None
TARGET Number of Cases Purchased None
AcidIndex Proprietary method of testing total acidity of wine by using a weighted average
Alcohol Alcohol Content
Chlorides Chloride content of wine
CitricAcid Citric Acid Content
Density Density of Wine
FixedAcidity Fixed Acidity of Wine
FreeSulfurDioxide Sulfur Dioxide content of wine
LabelAppeal Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don’t like the design. Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales
ResidualSugar Residual Sugar of wine
STARS Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor A high number of stars suggests high sales
Sulphates Sulfate conten of wine
TotalSulfurDioxide Total Sulfur Dioxide of Wine
VolatileAcidity Volatile Acid content of wine
pH pH of wine

Data Exploration

Our dataset consists of 15 variables and 12,795 observations with ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates, Alcohol, and STARS variables containing several missing values. As stated previously, TARGET is our response variable. Lastly, LabelAppeal, AcidIndex, and STARS are discrete variables and the rest are continuous. As such we will perform the appropriate visualization in the following figures.

## Observations: 12,795
## Variables: 15
## $ TARGET             <int> 3, 3, 5, 3, 4, 0, 0, 4, 3, 6, 0, 4, 3, 7, 4...
## $ FixedAcidity       <dbl> 3.2, 4.5, 7.1, 5.7, 8.0, 11.3, 7.7, 6.5, 14...
## $ VolatileAcidity    <dbl> 1.160, 0.160, 2.640, 0.385, 0.330, 0.320, 0...
## $ CitricAcid         <dbl> -0.98, -0.81, -0.88, 0.04, -1.26, 0.59, -0....
## $ ResidualSugar      <dbl> 54.20, 26.10, 14.80, 18.80, 9.40, 2.20, 21....
## $ Chlorides          <dbl> -0.567, -0.425, 0.037, -0.425, NA, 0.556, 0...
## $ FreeSulfurDioxide  <dbl> NA, 15, 214, 22, -167, -37, 287, 523, -213,...
## $ TotalSulfurDioxide <dbl> 268, -327, 142, 115, 108, 15, 156, 551, NA,...
## $ Density            <dbl> 0.99280, 1.02792, 0.99518, 0.99640, 0.99457...
## $ pH                 <dbl> 3.33, 3.38, 3.12, 2.24, 3.12, 3.20, 3.49, 3...
## $ Sulphates          <dbl> -0.59, 0.70, 0.48, 1.83, 1.77, 1.29, 1.21, ...
## $ Alcohol            <dbl> 9.9, NA, 22.0, 6.2, 13.7, 15.4, 10.3, 11.6,...
## $ LabelAppeal        <int> 0, -1, -1, -1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2...
## $ AcidIndex          <int> 8, 7, 8, 6, 9, 11, 8, 7, 6, 8, 5, 10, 7, 8,...
## $ STARS              <int> 2, 3, 3, 1, 2, NA, NA, 3, NA, 4, 1, 2, 2, 3...
##      TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
##  Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
##  1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
##  Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
##  Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
##  3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
##  Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
##                                                                       
##  ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
##  1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
##  Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
##  Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
##  3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
##  Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
##  NA's   :616        NA's   :638       NA's   :647       NA's   :682       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
##  1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
##  Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
##  Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
##  Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
##                   NA's   :395     NA's   :1210      NA's   :653    
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##                                       NA's   :3359

In the box plot below, we plot several variables in to two panels. TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar variables have large ranges compared to other variables. Therefore, we separated those variables in to a different panel to view their distribution. From both panel, we can tell a high number of variables have numerous outliers. However, almost all of the variables are centered around zero. Lastly, we also see several variables have negative values. In the bar char below, AcidIndex tells us large quantity of wine were sold with the index number 7 and 8. LabelAppeal tells us generic labeled wine sells the most; however, better label does yield higher number of wine samples per order. Lastly, STARS tells us excellent quality does not result in high wine orders. It could be due to high star wine bottle’s high price tag. In the histogram plot below, we see all the continous variable are normally distributed. However, several variables have negatives value in them. We will have to transform these values to positive for our analysis.

In the correlation table and plot below, we see, STARS and LabelAppeal are most positively correlated variables with the response variable. We expected this because our variable description mentions these variable’s theoretical effect are higher than other variables. Also, we some mild negative correlation between the response variable and AcidIndex variable.
Correlation
TARGET 1.0000000
STARS 0.5546857
LabelAppeal 0.4979465
Alcohol 0.0737771
FreeSulfurDioxide 0.0226398
TotalSulfurDioxide 0.0216021
ResidualSugar 0.0035196
CitricAcid 0.0023450
pH 0.0002199
FixedAcidity -0.0125381
Sulphates -0.0212204
Chlorides -0.0304301
Density -0.0475989
VolatileAcidity -0.0759979
AcidIndex -0.1676431

Data Preparation

Using the aggr function from VIM package, We see several variables have missing values. According to UCI Machine Learning, who published this dataset, all wine contain some natural sulfites. Therefore, we will impute the missing values for sulfite chemical properties. Also, it’s rare to find wines with less than 1 gram/liter of sugar. We will impute this as well. Matter of fact, since all missing values are missing at random, we will impute all the missing values using the mice package and random forest method. Mice uses multivariate imputations to estimate the missing values. Using multiple imputations helps in resolving the uncertainty for the missingness. Our target variable will be removed as a predictor variable but still will be imputed. Our response variables will be removed as predictor variables but still will be imputed.

## 
##  Variables sorted by number of missings: 
##            Variable Count
##               STARS  3359
##           Sulphates  1210
##  TotalSulfurDioxide   682
##             Alcohol   653
##   FreeSulfurDioxide   647
##           Chlorides   638
##       ResidualSugar   616
##                  pH   395
##              TARGET     0
##        FixedAcidity     0
##     VolatileAcidity     0
##          CitricAcid     0
##             Density     0
##         LabelAppeal     0
##           AcidIndex     0
## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   1   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   2   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   3   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   4   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   2  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   3  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   4  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
##   5   5  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
## [1] "Missing value after imputation: 0"

Build Model

We will build two different Poisson regression models using dataset with and without imputed values, two different negative binomial regression models using stepwise variables selection and imputed variables, and two multiple linear regression models using stepwise variables selection and imputed variables to see which model yields the best performance.

Poisson Regression: Model 1

In our Poisson regression model below, we see the deviance residuals is quite symmetrical. This means the the predicted points are close to actual observed points. As expected and seen in our correlation table, STARS, LabelAppeal and AcidIndex are significant variables. And the variation in standard error is low. We will compare the AIC with other models.

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2158  -0.2734   0.0616   0.3732   1.6830  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.593e+00  2.506e-01   6.359 2.03e-10 ***
## FixedAcidity        3.293e-04  1.053e-03   0.313  0.75447    
## VolatileAcidity    -2.560e-02  8.353e-03  -3.065  0.00218 ** 
## CitricAcid         -7.259e-04  7.575e-03  -0.096  0.92365    
## ResidualSugar      -6.141e-05  1.941e-04  -0.316  0.75165    
## Chlorides          -3.007e-02  2.056e-02  -1.463  0.14346    
## FreeSulfurDioxide   6.734e-05  4.404e-05   1.529  0.12620    
## TotalSulfurDioxide  2.081e-05  2.855e-05   0.729  0.46618    
## Density            -3.725e-01  2.462e-01  -1.513  0.13026    
## pH                 -4.661e-03  9.598e-03  -0.486  0.62722    
## Sulphates          -5.164e-03  7.051e-03  -0.732  0.46398    
## Alcohol             3.948e-03  1.771e-03   2.229  0.02579 *  
## LabelAppeal         1.771e-01  7.954e-03  22.271  < 2e-16 ***
## AcidIndex          -4.870e-02  5.903e-03  -8.251  < 2e-16 ***
## STARS               1.871e-01  7.487e-03  24.993  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5844.1  on 6435  degrees of freedom
## Residual deviance: 4009.1  on 6421  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23172
## 
## Number of Fisher Scoring iterations: 5

Poisson Regression: Model 2

In our Poisson regression model below, we see the deviance residuals is quite symmetrical. This means the the predicted points are close to actual observed points. With the imputed data, we have more significant variables than the previous model. However, the AIC score is score is increased drastically. A low residual deviance implies that the model we have trained is appropriate.

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train_imputed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8449  -0.5193   0.2089   0.6305   2.5305  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.039e+00  1.953e-01  10.444  < 2e-16 ***
## FixedAcidity       -2.934e-04  8.194e-04  -0.358 0.720269    
## VolatileAcidity    -5.090e-02  6.489e-03  -7.844 4.36e-15 ***
## CitricAcid          1.338e-02  5.891e-03   2.271 0.023124 *  
## ResidualSugar       1.649e-04  1.513e-04   1.089 0.275961    
## Chlorides          -5.657e-02  1.613e-02  -3.507 0.000453 ***
## FreeSulfurDioxide   1.332e-04  3.457e-05   3.854 0.000116 ***
## TotalSulfurDioxide  1.106e-04  2.204e-05   5.019 5.19e-07 ***
## Density            -4.090e-01  1.917e-01  -2.133 0.032893 *  
## pH                 -2.336e-02  7.553e-03  -3.093 0.001979 ** 
## Sulphates          -1.713e-02  5.536e-03  -3.094 0.001978 ** 
## Alcohol             4.966e-03  1.387e-03   3.581 0.000343 ***
## LabelAppeal         1.974e-01  6.024e-03  32.768  < 2e-16 ***
## AcidIndex          -1.199e-01  4.470e-03 -26.816  < 2e-16 ***
## STARS               1.928e-01  5.816e-03  33.149  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18474  on 12780  degrees of freedom
## AIC: 50446
## 
## Number of Fisher Scoring iterations: 5

Negative Binomial Regression: Model 3

In our negative binomial regression model below, we see the deviance residuals is quite symmetrical. This means the the predicted points are close to actual observed points. Sometimes Poisson and negative binomial regression models give the same results. We can see that by comparing model 3 and model 2’s coefficients score. They are exactly same. However, fishering scoring iteration is only 1.

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train_imputed, init.theta = 37200.03811, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8447  -0.5193   0.2089   0.6305   2.5304  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.039e+00  1.953e-01  10.443  < 2e-16 ***
## FixedAcidity       -2.935e-04  8.195e-04  -0.358 0.720264    
## VolatileAcidity    -5.090e-02  6.490e-03  -7.844 4.37e-15 ***
## CitricAcid          1.338e-02  5.892e-03   2.271 0.023130 *  
## ResidualSugar       1.649e-04  1.513e-04   1.089 0.275963    
## Chlorides          -5.657e-02  1.613e-02  -3.507 0.000453 ***
## FreeSulfurDioxide   1.332e-04  3.457e-05   3.854 0.000116 ***
## TotalSulfurDioxide  1.106e-04  2.204e-05   5.019 5.19e-07 ***
## Density            -4.090e-01  1.917e-01  -2.133 0.032899 *  
## pH                 -2.336e-02  7.553e-03  -3.093 0.001980 ** 
## Sulphates          -1.713e-02  5.536e-03  -3.094 0.001978 ** 
## Alcohol             4.966e-03  1.387e-03   3.580 0.000343 ***
## LabelAppeal         1.974e-01  6.024e-03  32.767  < 2e-16 ***
## AcidIndex          -1.199e-01  4.471e-03 -26.815  < 2e-16 ***
## STARS               1.928e-01  5.816e-03  33.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(37200.04) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18473  on 12780  degrees of freedom
## AIC: 50448
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  37200 
##           Std. Err.:  60018 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50416.39

Negative Binomial Regression: Model 4

In our negative binomial regression model below, we use forward and backward step-wise variables selection algorithm. This model is only slightly better with a lower AIC score.

## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train_imputed, 
##     init.theta = 37205.32973, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8469  -0.5183   0.2103   0.6292   2.5379  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.039e+00  1.953e-01  10.441  < 2e-16 ***
## VolatileAcidity    -5.101e-02  6.489e-03  -7.860 3.83e-15 ***
## CitricAcid          1.330e-02  5.891e-03   2.257 0.023979 *  
## Chlorides          -5.646e-02  1.613e-02  -3.501 0.000464 ***
## FreeSulfurDioxide   1.335e-04  3.456e-05   3.862 0.000113 ***
## TotalSulfurDioxide  1.112e-04  2.203e-05   5.045 4.54e-07 ***
## Density            -4.081e-01  1.917e-01  -2.128 0.033300 *  
## pH                 -2.322e-02  7.551e-03  -3.076 0.002101 ** 
## Sulphates          -1.721e-02  5.535e-03  -3.108 0.001881 ** 
## Alcohol             4.942e-03  1.387e-03   3.563 0.000367 ***
## LabelAppeal         1.974e-01  6.024e-03  32.774  < 2e-16 ***
## AcidIndex          -1.201e-01  4.419e-03 -27.186  < 2e-16 ***
## STARS               1.928e-01  5.816e-03  33.157  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(37205.33) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18475  on 12782  degrees of freedom
## AIC: 50446
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  37205 
##           Std. Err.:  60034 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50417.72

Multiple Linear Regression: Model 5

In our multiple linear regression model below, r-squared is 0.29, which means this model explains 29% of the data’s variation. As seen with previous models, FixedAcidity and ResidualSugar seem to have have no impact in this model. So far none of the model adequately explains the dataset.

## 
## Call:
## lm(formula = TARGET ~ ., data = wine_train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8654 -0.7149  0.3674  1.1120  4.9647 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.455e+00  5.541e-01   9.844  < 2e-16 ***
## FixedAcidity       -5.024e-04  2.326e-03  -0.216  0.82901    
## VolatileAcidity    -1.550e-01  1.847e-02  -8.393  < 2e-16 ***
## CitricAcid          4.054e-02  1.681e-02   2.411  0.01592 *  
## ResidualSugar       5.220e-04  4.308e-04   1.212  0.22568    
## Chlorides          -1.787e-01  4.574e-02  -3.908 9.36e-05 ***
## FreeSulfurDioxide   4.005e-04  9.824e-05   4.077 4.59e-05 ***
## TotalSulfurDioxide  3.232e-04  6.252e-05   5.170 2.37e-07 ***
## Density            -1.192e+00  5.454e-01  -2.185  0.02887 *  
## pH                 -6.102e-02  2.140e-02  -2.852  0.00436 ** 
## Sulphates          -4.885e-02  1.574e-02  -3.103  0.00192 ** 
## Alcohol             1.739e-02  3.927e-03   4.429 9.54e-06 ***
## LabelAppeal         5.976e-01  1.699e-02  35.176  < 2e-16 ***
## AcidIndex          -3.157e-01  1.126e-02 -28.039  < 2e-16 ***
## STARS               6.308e-01  1.724e-02  36.592  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.635 on 12780 degrees of freedom
## Multiple R-squared:  0.2808, Adjusted R-squared:   0.28 
## F-statistic: 356.4 on 14 and 12780 DF,  p-value: < 2.2e-16

Multiple Linear Regression: Model 6

In our last model using multiple linear regression with forward and backward step-wise variables selection algorithm, we see a similar output as model 5. R-squared is 0.29, which means this model explains 29% of the data’s variation.

## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8687 -0.7184  0.3665  1.1119  4.9712 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.452e+00  5.541e-01   9.840  < 2e-16 ***
## VolatileAcidity    -1.552e-01  1.847e-02  -8.400  < 2e-16 ***
## CitricAcid          4.041e-02  1.681e-02   2.404  0.01624 *  
## Chlorides          -1.786e-01  4.573e-02  -3.906 9.45e-05 ***
## FreeSulfurDioxide   4.021e-04  9.822e-05   4.094 4.26e-05 ***
## TotalSulfurDioxide  3.248e-04  6.250e-05   5.198 2.05e-07 ***
## Density            -1.187e+00  5.454e-01  -2.176  0.02956 *  
## pH                 -6.069e-02  2.140e-02  -2.836  0.00457 ** 
## Sulphates          -4.905e-02  1.574e-02  -3.117  0.00183 ** 
## Alcohol             1.730e-02  3.926e-03   4.406 1.06e-05 ***
## LabelAppeal         5.976e-01  1.699e-02  35.177  < 2e-16 ***
## AcidIndex          -3.162e-01  1.109e-02 -28.526  < 2e-16 ***
## STARS               6.310e-01  1.724e-02  36.608  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.635 on 12782 degrees of freedom
## Multiple R-squared:  0.2807, Adjusted R-squared:   0.28 
## F-statistic: 415.7 on 12 and 12782 DF,  p-value: < 2.2e-16

Select Models

To make prediction, we will select one of our count regression model. The criteria for our selection for the best count regression model will be the AIC score and mean squared error of the model. Based on the table below, model 1 is our best model. I reckon the imputed values for the missing values did not explain the variation in our dataset appropriately.
Model 1 Model 2 Model 3 Model 4
MSE 6.90228375567207 7.04844870624825 7.04844726212522 7.04855179988549
AIC 23171.7476259535 50446.2749506101 50448.3887510875 50445.7160907331
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.5847  1.0685  1.2508  1.2705  1.4527  2.1245    1624
# knitr settings
knitr::opts_chunk$set(warning = F, 
                      message = F,
                      echo = F,
                      fig.align = "center")
# load libraries
library(kableExtra)
library(knitr)
library(dplyr)
library(ggplot2)
library(psych)
library(tidyr)
library(ggpubr)
library(corrplot)
library(RColorBrewer)
library(VIM)
library(mice)
library(MASS)
library(caret)
library(pROC)
library(Metrics)
vr <- c("INDEX", "TARGET", "AcidIndex", "Alcohol", "Chlorides", "CitricAcid", "Density", "FixedAcidity", "FreeSulfurDioxide", "LabelAppeal", "ResidualSugar", "STARS", "Sulphates", "TotalSulfurDioxide", "VolatileAcidity", "pH")
def <- c("Identification Variable (do not use)", "Number of Cases Purchased", "Proprietary method of testing total acidity of wine by using a weighted average", "Alcohol Content", "Chloride content of wine", "Citric Acid Content", "Density of Wine", "Fixed Acidity of Wine", "Sulfur Dioxide content of wine", "Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don't like the design.", "Residual Sugar of wine", "Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor", "Sulfate conten of wine", "Total Sulfur Dioxide of Wine", "Volatile Acid content of wine", "pH of wine")
te <- c("None", "None", "", "",  "", "", "", "",  "", "Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales", "", "A high number of stars suggests high sales",  "", "", "", "")
kable(cbind(vr, def, te), col.names = c("Variable Name", "Definition", "Theoretical Effect")) %>% 
  kable_styling()
# load data
url1 <- "https://raw.githubusercontent.com/saayedalam/Data/master/wine-training-data.csv"
url2 <- "https://raw.githubusercontent.com/saayedalam/Data/master/wine-evaluation-data.csv"
wine_train <- read.csv(url1)
wine_test <- read.csv(url2)

# remove index column as it is not needed
wine_train <- wine_train %>% 
  dplyr::select(-"ï..INDEX")

wine_test <- wine_test %>% 
  dplyr::select(-"IN")

# summary statistics
wine_train %>% 
  glimpse() %>% 
  summary()
# boxplot
p1 <- wine_train %>% 
  dplyr::select(-c("TotalSulfurDioxide", "FreeSulfurDioxide", "ResidualSugar")) %>% 
  gather(na.rm = TRUE) %>% 
  ggplot(aes(factor(key), value)) +
  geom_boxplot(outlier.colour = "#e281cf", outlier.shape = 1,  color = "#5aa1ed") +
  coord_flip() +
  labs(title = "Boxplot of Chemical Properties of Wine", x = "Chemical Properties", y = "Values") +
  theme_minimal()

p2 <- wine_train %>% 
  dplyr::select(c("TotalSulfurDioxide", "FreeSulfurDioxide", "ResidualSugar")) %>% 
  gather(na.rm = TRUE) %>% 
  ggplot(aes(factor(key), value)) +
  geom_boxplot(outlier.colour = "#e281cf", outlier.shape = 1, color = "#5aa1ed") +
  #labs(title = "Boxplot of Chemical Properties of Wine", x = "Chemical Properties", y = "Values") +
  theme_minimal()

ggarrange(p1, p2)
# barchart
p3 <- wine_train %>% 
  dplyr::select(TARGET, STARS) %>% 
  mutate(STARS = as.factor(STARS),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(STARS)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()

p4 <- wine_train %>%
  dplyr::select(TARGET, LabelAppeal) %>% 
  mutate(STARS = as.factor(LabelAppeal),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(LabelAppeal)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()

p5 <- wine_train %>% 
  dplyr::select(TARGET, AcidIndex) %>% 
  mutate(STARS = as.factor(AcidIndex),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(AcidIndex)) +
  geom_bar(aes(fill = TARGET)) +
  theme_minimal()

ggarrange(p5, ggarrange(p3, p4, ncol = 2, nrow = 1, legend = "none"), nrow = 2, common.legend = TRUE)
# histogram
wine_train %>% 
  dplyr::select(-c("AcidIndex", "STARS", "TARGET", "LabelAppeal")) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 3) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="#56B4E9") +
  theme_minimal()
# top correlation
wine_train_corr <- wine_train %>% 
  drop_na() %>% 
  cor()
kable(sort(wine_train_corr[,1], decreasing = T), col.names = c("Correlation")) %>% 
  kable_styling(full_width = F)

# correlation plot
corrplot(wine_train_corr, 
         method = "number", 
         type = "lower",
         col = brewer.pal(n = 15, name = "Blues"),
         number.cex = .7, tl.cex = .7,
         tl.col = "black", tl.srt = 45)
# missing value columns
aggr(wine_train, 
     sortVars=TRUE, 
     labels=names(wine_train), 
     cex.axis=.5, 
     bars = FALSE, 
     col = c("#efeaee", "#5aa1ed"),
     combined = TRUE,
     #border = NA,
     ylab = "Missing Values") 

# imputating train data
init <- mice(wine_train)
meth <- init$method
predM <- init$predictorMatrix
predM[, c("TARGET")] <- 0 #this code will remove the variable as a predictor but still will be imputed
wine_train_impute <- mice(wine_train, method = 'rf', predictorMatrix=predM)
wine_train_imputed <- complete(wine_train_impute)
print(paste0("Missing value after imputation: ", sum(is.na(wine_train_imputed))))
# poisson model with the missing values
model1 <- glm(TARGET ~ ., family = poisson, wine_train)
summary(model1)
# poisson model with the imputed values
model2 <- glm(TARGET ~ ., family = poisson, wine_train_imputed)
summary(model2)
# negative binomial regression with the imputed values
model3 <- glm.nb(TARGET ~ ., wine_train_imputed)
summary(model3)
# negative binomial regression with stepwise variable selection
model4 <- stepAIC(model3, direction = "both", trace = FALSE)
summary(model4)
# multiple linear regression with the imputed values
model5 <- lm(TARGET ~ ., wine_train_imputed)
summary(model5)
# multiple linear regression with stepwise variable selection
model6 <- stepAIC(model5, direction = "both", trace = FALSE)
summary(model6)
# metrics
aic1 <- model1$aic
aic2 <- model2$aic
aic3 <- model3$aic
aic4 <- model4$aic
mse1 <- mean((wine_train$TARGET - predict(model1))^2)
mse2 <- mean((wine_train_imputed$TARGET - predict(model2))^2)
mse3 <- mean((wine_train_imputed$TARGET - predict(model3))^2)
mse4 <- mean((wine_train_imputed$TARGET - predict(model4))^2)
MSE <- list(mse1, mse2, mse3, mse4)
AIC <- list(aic1, aic2, aic3, aic4)

kable(rbind(MSE, AIC), col.names = c("Model 1", "Model 2", "Model 3", "Model 4"))  %>% 
  kable_styling(full_width = T)

# predict
predict <- predict(model1, wine_test, interval = "predict")
summary(predict)