1. Data Exploration

a. Read Data In

The dataset of interest for this report is the wine purchase orders from sample sale to stores in USA.
The dataset is stored in a github repostiory and read into R programing environment directly. The first 6 rows of the data given are as below. The target variable is under “SalePrice” column that shows the dollar values of the houses last sold.

##   ï..INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## 1        1      3          3.2           1.160      -0.98          54.2
## 2        2      3          4.5           0.160      -0.81          26.1
## 3        4      5          7.1           2.640      -0.88          14.8
## 4        5      3          5.7           0.385       0.04          18.8
## 5        6      4          8.0           0.330      -1.26           9.4
## 6        7      0         11.3           0.320       0.59           2.2

A corresponding short description of the predictors variables are given as below.

## # A tibble: 16 x 2
##    Columns_Name       Short_Desc                    
##    <fct>              <fct>                         
##  1 ï..INDEX           Row Index                     
##  2 TARGET             Number of Cases Purchased     
##  3 FixedAcidity       Fixed Acidity of Wine         
##  4 VolatileAcidity    Volatile Acid content of wine 
##  5 CitricAcid         Citric Acid Content           
##  6 ResidualSugar      Residual Sugar of wine        
##  7 Chlorides          Chloride content of wine      
##  8 FreeSulfurDioxide  Sulfur Dioxide content of wine
##  9 TotalSulfurDioxide Total Sulfur Dioxide of Wine  
## 10 Density            Density of Wine               
## 11 pH                 pH of wine                    
## 12 Sulphates          Sulfate conten of wine        
## 13 Alcohol            Alcohol Content               
## # ... with 3 more rows

b. Check for missing values

The data is checked to see if it contains any blank or NA values. The missing values are quite numerous as shown in below table.

##   na_count          col_names
## 8     3359              STARS
## 6     1210          Sulphates
## 4      682 TotalSulfurDioxide
## 7      653            Alcohol
## 3      647  FreeSulfurDioxide
## 2      638          Chlorides
## 1      616      ResidualSugar
## 5      395                 pH

Quite a few number of columns have impossible negative values. For example, a certain wine may have residual sugar content but it cannot have negative sugar level. First the missing values are replaced with zeros and the negative values with zero values after imputation is carried out.

2 Data Preparation

a. Impute missing value using MICE

The imputation process involves generating a chart to review the general distribution of missing values. Below plot shows there are missing values in “STARS”, “Sulphates”, “Alcohol” and other variables.

## 
##  Variables sorted by number of missings: 
##            Variable      Count
##               STARS 0.26252442
##           Sulphates 0.09456819
##  TotalSulfurDioxide 0.05330207
##             Alcohol 0.05103556
##   FreeSulfurDioxide 0.05056663
##           Chlorides 0.04986323
##       ResidualSugar 0.04814381
##                  pH 0.03087143
##              TARGET 0.00000000
##        FixedAcidity 0.00000000
##     VolatileAcidity 0.00000000
##          CitricAcid 0.00000000
##             Density 0.00000000
##         LabelAppeal 0.00000000
##           AcidIndex 0.00000000

“MICE” package is used to calculate the missing value, specifically, predictive mean matching was used as an imputation method. The graph now show there is no more missing value.

## 
##  Variables sorted by number of missings: 
##            Variable Count
##              TARGET     0
##        FixedAcidity     0
##     VolatileAcidity     0
##          CitricAcid     0
##       ResidualSugar     0
##           Chlorides     0
##   FreeSulfurDioxide     0
##  TotalSulfurDioxide     0
##             Density     0
##                  pH     0
##           Sulphates     0
##             Alcohol     0
##         LabelAppeal     0
##           AcidIndex     0
##               STARS     0

Replace impossible negative values with zero.

b. Examine distribution of variables

Below is a mosaic of small bar charts that explore the distribution of each variables. Prior to replacing missing values with imputed values and negative values with zeros, the below graph would have been very different.

c. Correlation to target variable plot

Below table shows the correlation and statistical significant indication between variables in term of numerical values. We see STARS rating, LabelAppeal have positive correlation while AcideIndex have negative correlation with the target variable.

##                          TARGET FixedAcidity
## TARGET              1.000000000 -0.053298570
## FixedAcidity       -0.053298570  1.000000000
## VolatileAcidity    -0.096957000  0.016633522
## CitricAcid          0.013334588  0.008432220
## ResidualSugar       0.012578403 -0.015704545
## Chlorides          -0.047054191  0.001814482
## FreeSulfurDioxide   0.049026860  0.002798944
## TotalSulfurDioxide  0.055211925 -0.020429749
## Density            -0.035517502  0.003770637
## pH                 -0.008409828 -0.006778542
## Sulphates          -0.039629602  0.034405905
## Alcohol             0.062594620 -0.010176501
## LabelAppeal         0.291168558 -0.003598874
## AcidIndex          -0.246049449  0.188025951
## STARS               0.623393323 -0.030710769

Box plot was also used to explore potential outlier within the dataset. There is no appearant outliers in each features but some extreme values are present.

3. Model Building

Although this is a count regression model, I would like to set up baseline model using simple linear regression. This way we can compare how this

The next step is to use “poisson” family from generalized linear models to build the first count regression model. The model summary includes AIC, Akaike Information Criterion. It is a proportional comparison among various models; the lower AIC value, the better the model. Residual deviance is similar to variance in that var(y|x) = sigma^2*E(Y|X). Residual deviance of 16059 divided by 12780 degrees of freedom gives a value larger than 1. Therefore there is over dispersion.

## 
## Call:
## glm(formula = TARGET ~ ., family = "poisson", data = housetrain4)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.99796  -0.75725   0.08663   0.64523   2.79540  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.465e+00  1.960e-01   7.475 7.75e-14 ***
## FixedAcidity       -5.101e-04  9.688e-04  -0.527 0.598521    
## VolatileAcidity    -6.505e-02  9.540e-03  -6.819 9.19e-12 ***
## CitricAcid          2.104e-02  8.404e-03   2.504 0.012279 *  
## ResidualSugar       8.066e-05  2.274e-04   0.355 0.722855    
## Chlorides          -8.260e-02  2.446e-02  -3.377 0.000732 ***
## FreeSulfurDioxide   2.241e-04  5.006e-05   4.477 7.55e-06 ***
## TotalSulfurDioxide  1.154e-04  2.947e-05   3.915 9.04e-05 ***
## Density            -3.605e-01  1.925e-01  -1.873 0.061119 .  
## pH                 -1.867e-02  7.517e-03  -2.484 0.012980 *  
## Sulphates          -1.862e-02  7.437e-03  -2.503 0.012313 *  
## Alcohol             3.467e-03  1.394e-03   2.487 0.012901 *  
## LabelAppeal         1.491e-01  8.837e-03  16.870  < 2e-16 ***
## AcidIndex          -9.657e-02  4.529e-03 -21.324  < 2e-16 ***
## STARS               3.537e-01  5.524e-03  64.029  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 16302  on 12780  degrees of freedom
## AIC: 48274
## 
## Number of Fisher Scoring iterations: 5

In the next model, “quasipoisson” technique was used. There is no AIC value from this model but overdispersion still persist.

Lets compare the coefficient of the two models. The STARS variable and Density variable both have bigger effect on the dependent variable. They have over 39% and 33% impact on the sales of the wine if everything else being equal. We have to note Density negatively impact the sales; the higher the Density, the lower the sale. In order to correctly interpret the coefficient, the exponential value of them needs to be calculated. The exponent value of STARS shows 39% increase in sale for every STARS rating.

##                    m1_Coefficient standardError1 m1a_Coefficient
## (Intercept)          1.465351e+00   1.960466e-01    1.465351e+00
## FixedAcidity        -5.100978e-04   9.687934e-04   -5.100978e-04
## VolatileAcidity     -6.505394e-02   9.540492e-03   -6.505394e-02
## CitricAcid           2.104279e-02   8.403591e-03    2.104279e-02
## ResidualSugar        8.065591e-05   2.274263e-04    8.065591e-05
## Chlorides           -8.260081e-02   2.445726e-02   -8.260081e-02
## FreeSulfurDioxide    2.241327e-04   5.005809e-05    2.241327e-04
## TotalSulfurDioxide   1.153873e-04   2.947214e-05    1.153873e-04
## Density             -3.604702e-01   1.924935e-01   -3.604702e-01
## pH                  -1.867367e-02   7.516643e-03   -1.867367e-02
## Sulphates           -1.861510e-02   7.437004e-03   -1.861510e-02
## Alcohol              3.466838e-03   1.394263e-03    3.466838e-03
## LabelAppeal          1.490786e-01   8.837033e-03    1.490786e-01
## AcidIndex           -9.656767e-02   4.528552e-03   -9.656767e-02
## STARS                3.537062e-01   5.524144e-03    3.537062e-01
##                    standardError1a
## (Intercept)           1.878047e-01
## FixedAcidity          9.280649e-04
## VolatileAcidity       9.139405e-03
## CitricAcid            8.050300e-03
## ResidualSugar         2.178652e-04
## Chlorides             2.342907e-02
## FreeSulfurDioxide     4.795362e-05
## TotalSulfurDioxide    2.823312e-05
## Density               1.844010e-01
## pH                    7.200640e-03
## Sulphates             7.124349e-03
## Alcohol               1.335648e-03
## LabelAppeal           8.465520e-03
## AcidIndex             4.338169e-03
## STARS                 5.291907e-03
## [1] 0.6789465
## [1] 1.39808

In poisson family of models, a better way to measure the model is by setting cross validation. That is to compare the predicted values from the model and the actual values of the dataset. In below chart, it is simply shown the discrepancy between the “actual” number of sales and predicted number of sales.

Due to the nature of zeros sales for some wine brands, a “hurlde” or “zero-inflated poisson regression” model is created to negate the effect of zero sales. A “hurlde” model is very similar to stacking two different models for two subpopulation within the dataset.

Based on the below rootogram, the hurdle model has effectively taken into zero sales account. However, the possion hurdle model (m2) is over fitting count of 1, 2, 6, 7 ,8 and under-fitting the rest of the variables except at 0. By improving the hurdle model, our goal of selecting a count regression model will be closer. Reference: http://data.library.virginia.edu/getting-started-with-hurdle-models/

In model six, ‘negative binomial’ distribution is assumed. However, the rootogram shows the model does not have improved goodness of fit when predicting the sale figures. Sale count of ‘1’ is still very overfitted.

4. Model Selection & Predictions

## # A tibble: 6 x 2
##   Models AIC_values
##   <chr>  <chr>     
## 1 Model1 48274.26  
## 2 Model2 <NA>      
## 3 Model3 42821.88  
## 4 Model4 42823.88  
## 5 Model5 45595.48  
## # ... with 1 more row