Section 1 - Data Exploration

The training dataset contains 12795 rows and 16 columns related to properties of commercial wines. A summary and boxplots of all variables suggest some columns have missing data, and some may contain outliers. The variable histogram suggests normal data distribution.

A visual correlation matrix reveals little relation among independent variables. The predictors LabelAppeal and STARS do have some connections with the target dependent variable.

1.1 Data Dimension and Structure

dim(df.train.orig)
## [1] 12795    16
str(df.train.orig)
## 'data.frame':    12795 obs. of  16 variables:
##  $ INDEX             : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...

Displaying first few rows of the dataset

head(df.train.orig)
##   INDEX TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## 1     1      3          3.2           1.160      -0.98          54.2
## 2     2      3          4.5           0.160      -0.81          26.1
## 3     4      5          7.1           2.640      -0.88          14.8
## 4     5      3          5.7           0.385       0.04          18.8
## 5     6      4          8.0           0.330      -1.26           9.4
## 6     7      0         11.3           0.320       0.59           2.2
##   Chlorides FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates
## 1    -0.567                NA                268 0.99280 3.33     -0.59
## 2    -0.425                15               -327 1.02792 3.38      0.70
## 3     0.037               214                142 0.99518 3.12      0.48
## 4    -0.425                22                115 0.99640 2.24      1.83
## 5        NA              -167                108 0.99457 3.12      1.77
## 6     0.556               -37                 15 0.99940 3.20      1.29
##   Alcohol LabelAppeal AcidIndex STARS
## 1     9.9           0         8     2
## 2      NA          -1         7     3
## 3    22.0          -1         8     3
## 4     6.2          -1         6     1
## 5    13.7           0         9     2
## 6    15.4           0        11    NA

1.2 Basic Summary and Plots

summary(df.train.orig)
##      INDEX           TARGET       FixedAcidity     VolatileAcidity  
##  Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
##  1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
##  Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
##  Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
##  3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
##  Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
##                                                                     
##    CitricAcid      ResidualSugar        Chlorides       FreeSulfurDioxide
##  Min.   :-3.2400   Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00  
##  1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00  
##  Median : 0.3100   Median :   3.900   Median : 0.0460   Median :  30.00  
##  Mean   : 0.3084   Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85  
##  3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00  
##  Max.   : 3.8600   Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00  
##                    NA's   :616        NA's   :638       NA's   :647      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
##  1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
##  Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
##  Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
##  3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
##  Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
##  NA's   :682                         NA's   :395     NA's   :1210     
##     Alcohol       LabelAppeal          AcidIndex          STARS      
##  Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##  NA's   :653                                          NA's   :3359

Box plots showing graphically the summaries of the variables

Frequency plots of the variables showing nearly normal distribution

1.3 Correlations of Variables

1.4 Missing Values

aggr_plot <- aggr(df.train.orig, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(df.train.orig), cex.axis=.7, gap=5, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##            Variable      Count
##               STARS 0.26252442
##           Sulphates 0.09456819
##  TotalSulfurDioxide 0.05330207
##             Alcohol 0.05103556
##   FreeSulfurDioxide 0.05056663
##           Chlorides 0.04986323
##       ResidualSugar 0.04814381
##                  pH 0.03087143
##               INDEX 0.00000000
##              TARGET 0.00000000
##        FixedAcidity 0.00000000
##     VolatileAcidity 0.00000000
##          CitricAcid 0.00000000
##             Density 0.00000000
##         LabelAppeal 0.00000000
##           AcidIndex 0.00000000

Section 2 - Data Preparation

2.1 Impute Missing Values

Use the MICE package to impute missing data. The resulting density plot suggest the imputed values are reasonable compared to the existing distribution.

tempData <- mice(df.train.orig,m=1,maxit=1)
## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS
densityplot(tempData)

df.train.nomissing <- complete(tempData)

2.2 Cap Outliers

Any outliers outside of lower 1.5IQR would be capped at 5th %ile, and observations above the upper 1.5IQR would be capped at 95th %ile.

df.train.nooutliers <- df.train.nomissing

id <- c(2:16)
for (val in id) {
  qnt <- quantile(df.train.nooutliers[,val], probs=c(.25, .75), na.rm = T)
  caps <- quantile(df.train.nooutliers[,val], probs=c(.05, .95), na.rm = T)
  H <- 1.5 * IQR(df.train.nooutliers[,val], na.rm = T)
  df.train.nooutliers[,val][df.train.nooutliers[,val] < (qnt[1] - H)] <- caps[1]
  df.train.nooutliers[,val][df.train.nooutliers[,val] > (qnt[2] + H)] <- caps[2]
}

df.train.transformed <- df.train.nooutliers

Section 3 - Build and Evaluate Models

Split the training data set such that 80% of the observations will be used to train the model and 20% will be used to evaluate.

Evaluate performance of models based on comparing the predicted target values with holdout’s mean and standard error below.

## [1] 3.026417
## [1] 0.01698166

3.1 Poisson - all variables

p1 <- glm(TARGET ~ . -INDEX, data=ntraining, family="poisson")

# summary of the model
summary(p1)
## 
## Call:
## glm(formula = TARGET ~ . - INDEX, family = "poisson", data = ntraining)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9133  -0.6860   0.1377   0.6367   2.5006  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.447e+00  2.238e-01   6.463 1.03e-10 ***
## FixedAcidity       -3.288e-04  1.034e-03  -0.318 0.750568    
## VolatileAcidity    -4.153e-02  8.097e-03  -5.129 2.91e-07 ***
## CitricAcid          1.421e-02  7.376e-03   1.927 0.053993 .  
## ResidualSugar       7.329e-05  1.784e-04   0.411 0.681215    
## Chlorides          -3.590e-02  1.909e-02  -1.881 0.060003 .  
## FreeSulfurDioxide   1.363e-04  3.905e-05   3.491 0.000481 ***
## TotalSulfurDioxide  1.161e-04  2.867e-05   4.049 5.15e-05 ***
## Density            -2.411e-01  2.182e-01  -1.105 0.269252    
## pH                 -1.748e-02  9.625e-03  -1.816 0.069398 .  
## Sulphates          -1.121e-02  6.704e-03  -1.673 0.094378 .  
## Alcohol             5.051e-03  1.774e-03   2.847 0.004415 ** 
## LabelAppeal         1.417e-01  6.821e-03  20.769  < 2e-16 ***
## AcidIndex          -1.054e-01  5.804e-03 -18.158  < 2e-16 ***
## STARS               3.391e-01  6.286e-03  53.940  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18262  on 10235  degrees of freedom
## Residual deviance: 12897  on 10221  degrees of freedom
## AIC: 38456
## 
## Number of Fisher Scoring iterations: 5

Evaluate the model’s mean and standard error using holdout dataset:

mean(floor(predict(p1, ntesting, type = "response")))
## [1] 2.51739
std.error(floor(predict(p1, ntesting, type = "response")))
## [1] 0.01219045

3.1 Poisson - significant variables

p2 <- glm(TARGET ~ . -INDEX -FixedAcidity -ResidualSugar -Density,
          data=ntraining, family="poisson")
summary(p2)
## 
## Call:
## glm(formula = TARGET ~ . - INDEX - FixedAcidity - ResidualSugar - 
##     Density, family = "poisson", data = ntraining)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9317  -0.6870   0.1380   0.6371   2.5059  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.209e+00  6.177e-02  19.568  < 2e-16 ***
## VolatileAcidity    -4.163e-02  8.097e-03  -5.142 2.72e-07 ***
## CitricAcid          1.418e-02  7.376e-03   1.923 0.054466 .  
## Chlorides          -3.660e-02  1.908e-02  -1.918 0.055066 .  
## FreeSulfurDioxide   1.362e-04  3.905e-05   3.487 0.000488 ***
## TotalSulfurDioxide  1.158e-04  2.865e-05   4.044 5.26e-05 ***
## pH                 -1.751e-02  9.622e-03  -1.820 0.068771 .  
## Sulphates          -1.111e-02  6.701e-03  -1.658 0.097321 .  
## Alcohol             5.041e-03  1.774e-03   2.841 0.004494 ** 
## LabelAppeal         1.418e-01  6.820e-03  20.788  < 2e-16 ***
## AcidIndex          -1.059e-01  5.730e-03 -18.477  < 2e-16 ***
## STARS               3.392e-01  6.284e-03  53.986  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18262  on 10235  degrees of freedom
## Residual deviance: 12898  on 10224  degrees of freedom
## AIC: 38451
## 
## Number of Fisher Scoring iterations: 5

Evaluate the model’s mean and standard error using holdout dataset:

mean(floor(predict(p2, ntesting, type = "response")))
## [1] 2.517155
std.error(floor(predict(p2, ntesting, type = "response")))
## [1] 0.01218848

3.2 Negative Binomial - all variables

nb1 <- glm.nb(TARGET ~ . -INDEX, data=ntraining)
summary(nb1)
## 
## Call:
## glm.nb(formula = TARGET ~ . - INDEX, data = ntraining, init.theta = 48733.23342, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9132  -0.6860   0.1377   0.6367   2.5006  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.447e+00  2.238e-01   6.463 1.03e-10 ***
## FixedAcidity       -3.288e-04  1.034e-03  -0.318 0.750561    
## VolatileAcidity    -4.153e-02  8.097e-03  -5.129 2.91e-07 ***
## CitricAcid          1.421e-02  7.376e-03   1.927 0.054000 .  
## ResidualSugar       7.330e-05  1.784e-04   0.411 0.681209    
## Chlorides          -3.590e-02  1.909e-02  -1.881 0.060006 .  
## FreeSulfurDioxide   1.363e-04  3.906e-05   3.491 0.000481 ***
## TotalSulfurDioxide  1.161e-04  2.867e-05   4.049 5.15e-05 ***
## Density            -2.411e-01  2.182e-01  -1.105 0.269264    
## pH                 -1.748e-02  9.626e-03  -1.816 0.069396 .  
## Sulphates          -1.121e-02  6.704e-03  -1.673 0.094383 .  
## Alcohol             5.051e-03  1.774e-03   2.847 0.004416 ** 
## LabelAppeal         1.417e-01  6.822e-03  20.768  < 2e-16 ***
## AcidIndex          -1.054e-01  5.805e-03 -18.158  < 2e-16 ***
## STARS               3.391e-01  6.287e-03  53.938  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(48733.23) family taken to be 1)
## 
##     Null deviance: 18262  on 10235  degrees of freedom
## Residual deviance: 12896  on 10221  degrees of freedom
## AIC: 38458
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  48733 
##           Std. Err.:  63499 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38425.91

Evaluate the model’s mean and standard error using holdout dataset:

mean(floor(predict(nb1, ntesting, type = "response")))
## [1] 2.51739
std.error(floor(predict(nb1, ntesting, type = "response")))
## [1] 0.01219045

3.2 Negative Binomial - significant variables

nb2 <- glm.nb(TARGET ~ . -INDEX -FixedAcidity -ResidualSugar -Density,
          data=ntraining)
summary(nb2)
## 
## Call:
## glm.nb(formula = TARGET ~ . - INDEX - FixedAcidity - ResidualSugar - 
##     Density, data = ntraining, init.theta = 48721.75002, link = log)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.932  -0.687   0.138   0.637   2.506  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.209e+00  6.177e-02  19.567  < 2e-16 ***
## VolatileAcidity    -4.164e-02  8.097e-03  -5.142 2.72e-07 ***
## CitricAcid          1.418e-02  7.376e-03   1.923 0.054472 .  
## Chlorides          -3.660e-02  1.908e-02  -1.918 0.055069 .  
## FreeSulfurDioxide   1.362e-04  3.905e-05   3.487 0.000488 ***
## TotalSulfurDioxide  1.158e-04  2.865e-05   4.044 5.26e-05 ***
## pH                 -1.751e-02  9.623e-03  -1.820 0.068770 .  
## Sulphates          -1.111e-02  6.701e-03  -1.658 0.097326 .  
## Alcohol             5.041e-03  1.774e-03   2.841 0.004496 ** 
## LabelAppeal         1.418e-01  6.820e-03  20.787  < 2e-16 ***
## AcidIndex          -1.059e-01  5.731e-03 -18.477  < 2e-16 ***
## STARS               3.392e-01  6.284e-03  53.985  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(48721.75) family taken to be 1)
## 
##     Null deviance: 18262  on 10235  degrees of freedom
## Residual deviance: 12898  on 10224  degrees of freedom
## AIC: 38453
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  48722 
##           Std. Err.:  63485 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38427.42

Evaluate the model’s mean and standard error using holdout dataset:

mean(floor(predict(nb2, ntesting, type = "response")))
## [1] 2.517155
std.error(floor(predict(nb2, ntesting, type = "response")))
## [1] 0.01218848

3.3 Zero-Inflated - all variables

zi1 <- zeroinfl(TARGET ~ . -INDEX |STARS, data=ntraining, dist="negbin")
summary(zi1)
## 
## Call:
## zeroinfl(formula = TARGET ~ . - INDEX | STARS, data = ntraining, 
##     dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.12705 -0.49934  0.06438  0.49034  2.08559 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.520e+00  2.311e-01   6.576 4.82e-11 ***
## FixedAcidity        2.373e-04  1.066e-03   0.223   0.8238    
## VolatileAcidity    -1.734e-02  8.375e-03  -2.071   0.0384 *  
## CitricAcid          3.897e-03  7.570e-03   0.515   0.6067    
## ResidualSugar      -1.156e-04  1.836e-04  -0.630   0.5289    
## Chlorides          -1.911e-02  1.970e-02  -0.970   0.3319    
## FreeSulfurDioxide   3.943e-05  4.003e-05   0.985   0.3247    
## TotalSulfurDioxide -1.411e-05  2.906e-05  -0.485   0.6274    
## Density            -2.770e-01  2.250e-01  -1.231   0.2183    
## pH                  5.195e-03  9.961e-03   0.522   0.6020    
## Sulphates          -2.062e-03  6.937e-03  -0.297   0.7663    
## Alcohol             8.079e-03  1.827e-03   4.422 9.77e-06 ***
## LabelAppeal         2.259e-01  7.143e-03  31.619  < 2e-16 ***
## AcidIndex          -3.848e-02  6.312e-03  -6.096 1.09e-09 ***
## STARS               1.164e-01  7.018e-03  16.588  < 2e-16 ***
## Log(theta)          2.301e+01         NA      NA       NA    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.4699     0.1171   21.09   <2e-16 ***
## STARS        -2.8004     0.1039  -26.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 9835737329.486 
## Number of iterations in BFGS optimization: 67 
## Log-likelihood: -1.72e+04 on 18 Df

Evaluate the model’s mean and standard error using holdout dataset:

mean(floor(predict(zi1, ntesting, type = "response")))
## [1] 2.538492
std.error(floor(predict(zi1, ntesting, type = "response")))
## [1] 0.01246765

3.3 Zero-Inflated - significant variables

zi2 <- zeroinfl(TARGET ~ . -INDEX -FixedAcidity -ResidualSugar -Density -CitricAcid -Chlorides -FreeSulfurDioxide -TotalSulfurDioxide -pH -Sulphates |STARS, data=ntraining, dist="negbin")
summary(zi2)
## 
## Call:
## zeroinfl(formula = TARGET ~ . - INDEX - FixedAcidity - ResidualSugar - 
##     Density - CitricAcid - Chlorides - FreeSulfurDioxide - TotalSulfurDioxide - 
##     pH - Sulphates | STARS, data = ntraining, dist = "negbin")
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.13221 -0.49523  0.06362  0.49137  2.07155 
## 
## Count model coefficients (negbin with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.263002   0.053928  23.420  < 2e-16 ***
## VolatileAcidity -0.017495   0.008365  -2.091   0.0365 *  
## Alcohol          0.008133   0.001825   4.457 8.32e-06 ***
## LabelAppeal      0.226149   0.007136  31.693  < 2e-16 ***
## AcidIndex       -0.038786   0.006222  -6.234 4.56e-10 ***
## STARS            0.116290   0.007010  16.588  < 2e-16 ***
## Log(theta)      17.344846   1.442677  12.023  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.4719     0.1172   21.09   <2e-16 ***
## STARS        -2.8023     0.1040  -26.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 34101294.411 
## Number of iterations in BFGS optimization: 39 
## Log-likelihood: -1.721e+04 on 9 Df

Evaluate the model’s mean and standard error using holdout dataset:

mean(floor(predict(zi2, ntesting, type = "response")))
## [1] 2.538726
std.error(floor(predict(zi2, ntesting, type = "response")))
## [1] 0.01249403

Section 4 - Select Models

Although not a big difference in the performance of all models, the last model performs slightly better with closer values to the baseline’s mean and standard error.

The Zero-Inflated with Significant variables model will be used to predict the evaluation dataset.

## 
##  iter imp variable
##   1   1  ResidualSugar  Chlorides  FreeSulfurDioxide  TotalSulfurDioxide  pH  Sulphates  Alcohol  STARS

Export to working directory the evaluation dataset with predicted TARGET to working directory

# Export
  write.csv(eval_df,file="HW5_Wine.csv")