Cover Page

Data 621 - Week 5 HW

Baron Curtin

CUNY School of Professional Studies

Introduction

The purpose of this assignment is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine

Data Exploration

Non Visual Exploration

Variables

  • Response Variable: TARGET
    • Number of cases of wine expected the sell
  • Explanatory Variables
    • 12 Variables describing the chemical composition of the wine
    • 2 Variables describing non-chemical chracteristics of the wine
  • Identification Variable: INDEX
    • Will not be used in the analysis
## Observations: 12,795
## Variables: 15
## $ TARGET             <int> 3, 3, 5, 3, 4, 0, 0, 4, 3, 6, 0, 4, 3, 7, 4...
## $ FixedAcidity       <dbl> 3.2, 4.5, 7.1, 5.7, 8.0, 11.3, 7.7, 6.5, 14...
## $ VolatileAcidity    <dbl> 1.160, 0.160, 2.640, 0.385, 0.330, 0.320, 0...
## $ CitricAcid         <dbl> -0.98, -0.81, -0.88, 0.04, -1.26, 0.59, -0....
## $ ResidualSugar      <dbl> 54.20, 26.10, 14.80, 18.80, 9.40, 2.20, 21....
## $ Chlorides          <dbl> -0.567, -0.425, 0.037, -0.425, NA, 0.556, 0...
## $ FreeSulfurDioxide  <dbl> NA, 15, 214, 22, -167, -37, 287, 523, -213,...
## $ TotalSulfurDioxide <dbl> 268, -327, 142, 115, 108, 15, 156, 551, NA,...
## $ Density            <dbl> 0.99280, 1.02792, 0.99518, 0.99640, 0.99457...
## $ pH                 <dbl> 3.33, 3.38, 3.12, 2.24, 3.12, 3.20, 3.49, 3...
## $ Sulphates          <dbl> -0.59, 0.70, 0.48, 1.83, 1.77, 1.29, 1.21, ...
## $ Alcohol            <dbl> 9.9, NA, 22.0, 6.2, 13.7, 15.4, 10.3, 11.6,...
## $ LabelAppeal        <int> 0, -1, -1, -1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2...
## $ AcidIndex          <int> 8, 7, 8, 6, 9, 11, 8, 7, 6, 8, 5, 10, 7, 8,...
## $ STARS              <int> 2, 3, 3, 1, 2, NA, NA, 3, NA, 4, 1, 2, 2, 3...

The glimpse function of dplyr shows that there are 12795 observations and 15 variables (without INDEX) * All of the variables are numeric * Various fields have missing values

  • 11 variables are of the double type
  • 5 variables of the integer type
##      TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
##  Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
##  1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
##  Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
##  Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
##  3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
##  Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
##                                                                       
##  ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
##  1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
##  Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
##  Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
##  3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
##  Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
##  NA's   :616        NA's   :638       NA's   :647       NA's   :682       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
##  1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
##  Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
##  Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
##  Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
##                   NA's   :395     NA's   :1210      NA's   :653    
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##                                       NA's   :3359
  • The summary function provides information on 8 fields that are missing values
  • The pH range is from ~0.5 to ~6.1
    • These values indicate that all of these wines are closer to acids rather than bases
    • Being alcohol, one would expect the pHs to be more basic

Basic Stats

## # A tibble: 15 x 16
##    variables     n na_vals   obs     mean      min      q1  median      q3
##    <chr>     <dbl>   <dbl> <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>
##  1 AcidIndex 12795       0 12795  7.77e+0    4       7       8       8    
##  2 Alcohol   12795     653 12142  1.05e+1   -4.7     9      10.4    12.4  
##  3 Chlorides 12795     638 12157  5.48e-2   -1.17   -0.031   0.046   0.153
##  4 CitricAc~ 12795       0 12795  3.08e-1   -3.24    0.03    0.31    0.580
##  5 Density   12795       0 12795  9.94e-1    0.888   0.988   0.994   1.00 
##  6 FixedAci~ 12795       0 12795  7.08e+0  -18.1     5.2     6.9     9.5  
##  7 FreeSulf~ 12795     647 12148  3.08e+1 -555       0      30      70    
##  8 LabelApp~ 12795       0 12795 -9.07e-3   -2      -1       0       1    
##  9 pH        12795     395 12400  3.21e+0    0.48    2.96    3.2     3.47 
## 10 Residual~ 12795     616 12179  5.42e+0 -128.     -2       3.9    15.9  
## 11 STARS     12795    3359  9436  2.04e+0    1       1       2       3    
## 12 Sulphates 12795    1210 11585  5.27e-1   -3.13    0.28    0.5     0.86 
## 13 TARGET    12795       0 12795  3.03e+0    0       2       3       4    
## 14 TotalSul~ 12795     682 12113  1.21e+2 -823      27     123     208    
## 15 Volatile~ 12795       0 12795  3.24e-1   -2.79    0.13    0.28    0.64 
## # ... with 7 more variables: max <dbl>, sd <dbl>, var <dbl>, range <dbl>,
## #   iqr <dbl>, skewness <dbl>, kurtosis <dbl>
  • basisStats further confirms the existence of missing values in the aforemention variables
  • No variable appears to suffer from severe skew or kurtosis
  • There are a number of variables with negative values that should not contain negatives
    • These will be addressed in data preparation
    • All of the variables below except for LabelAppeal are chemical properties yet all should not contain negatives

Correlation

  • The correlation table above reveals that STARS and LabelAppeal have the greatest correlation with selling cases of wine
  • AcidIndex has the highest negative correlation indicating that the higher the acidity, the less cases will be sold

  • The correlation matrix further confirms that values from the correlation table
  • LabelAppeal and STARS show a weak positive correlation
    • It would appear that LabelAppeal has an impact on STARS

Visual Inspection

Density Plots

## No id variables; using all as measure variables

The density plots reveal some of the variables are approximately normally distributed and others that are not: * Normally Distributed Variables: + FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol * Non-normal Variables: + TARGET, LabelAppeal, AcidIndex, STARS

Histograms

  • The histograms are able to refute some of the conclusions made in the density plots showing that Alcohol, Density, and pH as that approximately normal distributions
  • The histograms also reveal right skew in Sulphates, VolatileAcidity, TotalSulfurDioxide, ResidualSugar, FixedAcidity, Chlorides

Box Plots

  • The box plots are able to confirm the existence of skew in VolatileAcidity, CitricAcid, Sulphates, LabelAppeal, AcidIndex
  • The box plots further confirm that high variability that exists ResidualSugar, TotalSulfurDioxide

Data Preparation

INDEX removed from datasets

As INDEX is just an identification variable, it will be removed

Fields with Negatives Set to 0

## # A tibble: 10 x 1
##    variables         
##    <chr>             
##  1 Alcohol           
##  2 Chlorides         
##  3 CitricAcid        
##  4 FixedAcidity      
##  5 FreeSulfurDioxide 
##  6 LabelAppeal       
##  7 ResidualSugar     
##  8 Sulphates         
##  9 TotalSulfurDioxide
## 10 VolatileAcidity

The variables above contained values lower then zero. All of them are chemical properties of the wine (except LabelAppeal) and should not be 0. They have been changed so that their lowest value is 0. LabelAppeal also receives the same treatment because there aren’t many rating scales that begin in the negatives.

Missing Value Imputation

THe missing values will need to be handled prior to model development. To do so, the missForest package will be used

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   removed variable(s) 1 due to the missingness of all entries
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!

Build Models

Poisson Regression Model

The glmulti package provides a convinient way to generate the best models. To generate the models, we will use that package

Best Formulas

## [[1]]
## TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     LabelAppeal + AcidIndex + STARS
## <environment: 0x0000000027a16e10>
## 
## [[2]]
## TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + pH + Sulphates + Alcohol + LabelAppeal + 
##     AcidIndex + STARS
## <environment: 0x0000000027a16e10>
## 
## [[3]]
## TARGET ~ 1 + FixedAcidity + VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS
## <environment: 0x0000000027a16e10>
## 
## [[4]]
## TARGET ~ 1 + VolatileAcidity + CitricAcid + ResidualSugar + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS
## <environment: 0x0000000027a16e10>
## 
## [[5]]
## TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + LabelAppeal + 
##     AcidIndex + STARS
## <environment: 0x0000000027a16e10>

Model 1: 13 Variables

## 
## Call:
## glm(formula = TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + 
##     FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, family = "poisson", 
##     data = imputed$train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9377  -0.7520   0.1119   0.6389   2.8292  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.457e+00  1.960e-01   7.436 1.04e-13 ***
## VolatileAcidity    -6.774e-02  9.523e-03  -7.113 1.14e-12 ***
## CitricAcid          1.925e-02  8.398e-03   2.292  0.02189 *  
## Chlorides          -9.086e-02  2.519e-02  -3.607  0.00031 ***
## FreeSulfurDioxide   2.087e-04  5.139e-05   4.061 4.88e-05 ***
## TotalSulfurDioxide  1.177e-04  3.024e-05   3.892 9.95e-05 ***
## Density            -3.509e-01  1.921e-01  -1.826  0.06778 .  
## pH                 -1.772e-02  7.620e-03  -2.326  0.02003 *  
## Sulphates          -2.647e-02  7.861e-03  -3.367  0.00076 ***
## Alcohol             2.990e-03  1.428e-03   2.094  0.03625 *  
## LabelAppeal         1.458e-01  8.896e-03  16.389  < 2e-16 ***
## AcidIndex          -9.822e-02  4.469e-03 -21.977  < 2e-16 ***
## STARS               3.614e-01  5.862e-03  61.652  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 16644  on 12782  degrees of freedom
## AIC: 48612
## 
## Number of Fisher Scoring iterations: 5
  • The AIC for this model is 48651
  • The model includes 11 significant variables at the 5% level
  • AcidIndex has the highest negative impact on TARGET
  • STARS has the highest positive impact on TARGET
    • This aligns with conventional wisdom as one would expect previous reviews to influence future buyers

Model 2: All Variables

## 
## Call:
## glm(formula = TARGET ~ 1 + ., family = "poisson", data = imputed$train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9375  -0.7516   0.1103   0.6390   2.8308  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.457e+00  1.960e-01   7.436 1.04e-13 ***
## FixedAcidity       -5.757e-04  9.689e-04  -0.594 0.552384    
## VolatileAcidity    -6.768e-02  9.523e-03  -7.107 1.19e-12 ***
## CitricAcid          1.926e-02  8.399e-03   2.293 0.021853 *  
## ResidualSugar       3.705e-05  2.336e-04   0.159 0.873990    
## Chlorides          -9.095e-02  2.519e-02  -3.610 0.000306 ***
## FreeSulfurDioxide   2.089e-04  5.139e-05   4.064 4.82e-05 ***
## TotalSulfurDioxide  1.175e-04  3.025e-05   3.884 0.000103 ***
## Density            -3.507e-01  1.921e-01  -1.825 0.067940 .  
## pH                 -1.771e-02  7.621e-03  -2.323 0.020162 *  
## Sulphates          -2.637e-02  7.863e-03  -3.353 0.000799 ***
## Alcohol             2.994e-03  1.428e-03   2.097 0.036015 *  
## LabelAppeal         1.457e-01  8.897e-03  16.381  < 2e-16 ***
## AcidIndex          -9.780e-02  4.525e-03 -21.612  < 2e-16 ***
## STARS               3.614e-01  5.863e-03  61.638  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 16644  on 12780  degrees of freedom
## AIC: 48616
## 
## Number of Fisher Scoring iterations: 5
  • The AIC value ias 48654, slightly higher than model 1
  • It reintroduces some insignificant variables that may overfit the data
  • The highest positive impactful variable is STARS, aligning with conventional wisdom
  • The highest negative variable is AcidIndex

Negative Binomial Regression

The glmulti package provides a convinient way to generate the best models. To generate the models, we will use that package

Best Formulas

## [[1]]
## TARGET ~ 1 + VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + pH + Sulphates + LabelAppeal + AcidIndex + 
##     STARS
## <environment: 0x000000005c280218>
## 
## [[2]]
## TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + pH + Sulphates + LabelAppeal + AcidIndex + 
##     STARS
## <environment: 0x000000005c280218>
## 
## [[3]]
## TARGET ~ 1 + VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + LabelAppeal + 
##     AcidIndex + STARS
## <environment: 0x000000005c280218>
## 
## [[4]]
## TARGET ~ 1 + VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Sulphates + LabelAppeal + AcidIndex + 
##     STARS
## <environment: 0x000000005c280218>
## 
## [[5]]
## TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + LabelAppeal + 
##     AcidIndex + STARS
## <environment: 0x000000005c280218>

Model 3: 9 Variables

## 
## Call:
## glm.nb(formula = TARGET ~ 1 + VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + pH + Sulphates + LabelAppeal + AcidIndex + 
##     STARS, data = imputed$train, init.theta = 44857.248, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9310  -0.7486   0.1083   0.6398   2.8329  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.149e+00  4.716e-02  24.368  < 2e-16 ***
## VolatileAcidity    -6.782e-02  9.523e-03  -7.121 1.07e-12 ***
## Chlorides          -9.352e-02  2.518e-02  -3.714 0.000204 ***
## FreeSulfurDioxide   2.067e-04  5.139e-05   4.023 5.75e-05 ***
## TotalSulfurDioxide  1.154e-04  3.022e-05   3.819 0.000134 ***
## pH                 -1.784e-02  7.619e-03  -2.342 0.019172 *  
## Sulphates          -2.623e-02  7.860e-03  -3.337 0.000847 ***
## LabelAppeal         1.457e-01  8.895e-03  16.377  < 2e-16 ***
## AcidIndex          -9.836e-02  4.455e-03 -22.076  < 2e-16 ***
## STARS               3.627e-01  5.843e-03  62.076  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(44857.25) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 16657  on 12785  degrees of freedom
## AIC: 48622
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  44857 
##           Std. Err.:  53571 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -48599.69
  • This model has an AIC of 48661, higher than the poisson models
  • All of the variables are significant
  • The highest positive coefficient is STARS, aligning again with conventional wisdom
  • The lowest negative coefficient is AcidIndex, similar to the poisson regressions

Model 4: Log Transform 9 Variables

## 
## Call:
## glm.nb(formula = TARGET ~ 1 + VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + pH + Sulphates + LabelAppeal + AcidIndex + 
##     STARS, data = ., init.theta = 52851.92895, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7821  -0.3463   0.1144   0.3702   1.3945  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.956417   0.163722   5.842 5.17e-09 ***
## VolatileAcidity    -0.116464   0.026142  -4.455 8.38e-06 ***
## Chlorides          -0.114019   0.052053  -2.190 0.028491 *  
## FreeSulfurDioxide   0.014240   0.004190   3.399 0.000677 ***
## TotalSulfurDioxide  0.010643   0.003837   2.774 0.005536 ** 
## pH                 -0.081187   0.047520  -1.708 0.087548 .  
## Sulphates          -0.060223   0.023680  -2.543 0.010985 *  
## LabelAppeal         0.046602   0.023306   2.000 0.045548 *  
## AcidIndex          -0.799153   0.062611 -12.764  < 2e-16 ***
## STARS               0.986068   0.030149  32.707  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(52851.93) family taken to be 1)
## 
##     Null deviance: 8120.9  on 12794  degrees of freedom
## Residual deviance: 6491.5  on 12785  degrees of freedom
## AIC: 30175
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  52852 
##           Std. Err.:  61681 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -30152.53
  • The log transformation resulted in a much lower AIC at 30191
  • Many of the variables that were highly significant previously, became less significant
  • The highest positive coefficient is STARS, aligning with conventional wisdom
  • The lowest negative coefficient is AcidIndex

Multiple Linear Regression

For multiple linear regression, we can use the leaps package to return the best subset

## Subset selection object
## Call: regsubsets.formula(TARGET ~ ., data = imputed$train, method = "exhaustive", 
##     nvmax = NULL, nbest = 1)
## 14 Variables  (and intercept)
##                    Forced in Forced out
## FixedAcidity           FALSE      FALSE
## VolatileAcidity        FALSE      FALSE
## CitricAcid             FALSE      FALSE
## ResidualSugar          FALSE      FALSE
## Chlorides              FALSE      FALSE
## FreeSulfurDioxide      FALSE      FALSE
## TotalSulfurDioxide     FALSE      FALSE
## Density                FALSE      FALSE
## pH                     FALSE      FALSE
## Sulphates              FALSE      FALSE
## Alcohol                FALSE      FALSE
## LabelAppeal            FALSE      FALSE
## AcidIndex              FALSE      FALSE
## STARS                  FALSE      FALSE
## 1 subsets of each size up to 14
## Selection Algorithm: exhaustive
##           FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1  ( 1 )  " "          " "             " "        " "           " "      
## 2  ( 1 )  " "          " "             " "        " "           " "      
## 3  ( 1 )  " "          " "             " "        " "           " "      
## 4  ( 1 )  " "          "*"             " "        " "           " "      
## 5  ( 1 )  " "          "*"             " "        " "           "*"      
## 6  ( 1 )  " "          "*"             " "        " "           "*"      
## 7  ( 1 )  " "          "*"             " "        " "           "*"      
## 8  ( 1 )  " "          "*"             " "        " "           "*"      
## 9  ( 1 )  " "          "*"             " "        " "           "*"      
## 10  ( 1 ) " "          "*"             "*"        " "           "*"      
## 11  ( 1 ) " "          "*"             "*"        " "           "*"      
## 12  ( 1 ) " "          "*"             "*"        " "           "*"      
## 13  ( 1 ) "*"          "*"             "*"        " "           "*"      
## 14  ( 1 ) "*"          "*"             "*"        "*"           "*"      
##           FreeSulfurDioxide TotalSulfurDioxide Density pH  Sulphates
## 1  ( 1 )  " "               " "                " "     " " " "      
## 2  ( 1 )  " "               " "                " "     " " " "      
## 3  ( 1 )  " "               " "                " "     " " " "      
## 4  ( 1 )  " "               " "                " "     " " " "      
## 5  ( 1 )  " "               " "                " "     " " " "      
## 6  ( 1 )  "*"               " "                " "     " " " "      
## 7  ( 1 )  "*"               "*"                " "     " " " "      
## 8  ( 1 )  "*"               "*"                " "     " " "*"      
## 9  ( 1 )  "*"               "*"                " "     " " "*"      
## 10  ( 1 ) "*"               "*"                " "     " " "*"      
## 11  ( 1 ) "*"               "*"                " "     "*" "*"      
## 12  ( 1 ) "*"               "*"                "*"     "*" "*"      
## 13  ( 1 ) "*"               "*"                "*"     "*" "*"      
## 14  ( 1 ) "*"               "*"                "*"     "*" "*"      
##           Alcohol LabelAppeal AcidIndex STARS
## 1  ( 1 )  " "     " "         " "       "*"  
## 2  ( 1 )  " "     " "         "*"       "*"  
## 3  ( 1 )  " "     "*"         "*"       "*"  
## 4  ( 1 )  " "     "*"         "*"       "*"  
## 5  ( 1 )  " "     "*"         "*"       "*"  
## 6  ( 1 )  " "     "*"         "*"       "*"  
## 7  ( 1 )  " "     "*"         "*"       "*"  
## 8  ( 1 )  " "     "*"         "*"       "*"  
## 9  ( 1 )  "*"     "*"         "*"       "*"  
## 10  ( 1 ) "*"     "*"         "*"       "*"  
## 11  ( 1 ) "*"     "*"         "*"       "*"  
## 12  ( 1 ) "*"     "*"         "*"       "*"  
## 13  ( 1 ) "*"     "*"         "*"       "*"  
## 14  ( 1 ) "*"     "*"         "*"       "*"

Determine Best Subset

  • From the Cp plot, 12 variables is the best subset
  • These variables are
##        (Intercept)       FixedAcidity    VolatileAcidity 
##               TRUE              FALSE               TRUE 
##         CitricAcid      ResidualSugar          Chlorides 
##               TRUE              FALSE               TRUE 
##  FreeSulfurDioxide TotalSulfurDioxide            Density 
##               TRUE               TRUE               TRUE 
##                 pH          Sulphates            Alcohol 
##               TRUE               TRUE               TRUE 
##        LabelAppeal          AcidIndex              STARS 
##               TRUE               TRUE               TRUE

Model 5: 12 Variables

## 
## Call:
## lm(formula = TARGET ~ . - ResidualSugar - FixedAcidity, data = imputed$train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3425 -1.1117  0.1344  1.0604  4.6378 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.465e+00  4.963e-01   6.983 3.04e-12 ***
## VolatileAcidity    -1.884e-01  2.314e-02  -8.145 4.15e-16 ***
## CitricAcid          5.612e-02  2.167e-02   2.590 0.009603 ** 
## Chlorides          -2.913e-01  6.239e-02  -4.669 3.06e-06 ***
## FreeSulfurDioxide   5.854e-04  1.334e-04   4.387 1.16e-05 ***
## TotalSulfurDioxide  3.140e-04  7.816e-05   4.017 5.92e-05 ***
## Density            -1.019e+00  4.876e-01  -2.089 0.036706 *  
## pH                 -4.204e-02  1.934e-02  -2.174 0.029731 *  
## Sulphates          -7.560e-02  1.960e-02  -3.858 0.000115 ***
## Alcohol             1.227e-02  3.619e-03   3.390 0.000702 ***
## LabelAppeal         5.395e-01  2.484e-02  21.721  < 2e-16 ***
## AcidIndex          -2.443e-01  9.967e-03 -24.510  < 2e-16 ***
## STARS               1.249e+00  1.641e-02  76.117  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.461 on 12782 degrees of freedom
## Multiple R-squared:  0.4254, Adjusted R-squared:  0.4249 
## F-statistic: 788.6 on 12 and 12782 DF,  p-value: < 2.2e-16
  • All of the variables in Model 5 are significant at the 5% level
  • The most significant predictors are VolatileAcidity, STARS, AcidIndex, LabelAppeal, Alcohol, Sulphates, TotalSulfurDioxide, FreeSulfurDioxide, Chlorides
  • FreeSulfurDioxide has the highest positive impact on TARGET, contrary to the belief about STARS and LabelAppeal
  • Sulphates has the highest negative impacto on TARGET
  • The adjusted R^2 is .4219 leaving room for improvement

Model 6: All Variables

## 
## Call:
## lm(formula = TARGET ~ ., data = imputed$train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3452 -1.1145  0.1333  1.0606  4.6370 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.4698486  0.4963990   6.990 2.88e-12 ***
## FixedAcidity       -0.0011419  0.0024425  -0.468 0.640130    
## VolatileAcidity    -0.1883262  0.0231386  -8.139 4.35e-16 ***
## CitricAcid          0.0559966  0.0216699   2.584 0.009775 ** 
## ResidualSugar      -0.0001492  0.0005963  -0.250 0.802467    
## Chlorides          -0.2917076  0.0624020  -4.675 2.97e-06 ***
## FreeSulfurDioxide   0.0005864  0.0001335   4.394 1.12e-05 ***
## TotalSulfurDioxide  0.0003142  0.0000782   4.018 5.90e-05 ***
## Density            -1.0193362  0.4875970  -2.091 0.036590 *  
## pH                 -0.0419544  0.0193413  -2.169 0.030089 *  
## Sulphates          -0.0753582  0.0196068  -3.843 0.000122 ***
## Alcohol             0.0122428  0.0036201   3.382 0.000722 ***
## LabelAppeal         0.5394189  0.0248408  21.715  < 2e-16 ***
## AcidIndex          -0.2434396  0.0101406 -24.006  < 2e-16 ***
## STARS               1.2491617  0.0164156  76.096  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.461 on 12780 degrees of freedom
## Multiple R-squared:  0.4254, Adjusted R-squared:  0.4248 
## F-statistic: 675.8 on 14 and 12780 DF,  p-value: < 2.2e-16
  • The full model reintroduces insignificant variables
  • FreeSulfurDioxide has the highest positive impact on TARGET, contract to the belief about STARS and LabelAppeal
  • Sulphates have the highest negative impact on TARGET
  • The adjusted R^2 is .4218, lower than the subsetted model

Select Models

The model we will use for prediction is Model 3. It had the highest AIC, all variables were significant, and the coefficients aligned with conventional thinking

Model 3

  • The histogram of the residuals do not show a normal distribution
  • The qqplot shows a fairly linear relationship with the tails of the plot venturing away from the line
  • The residual plot does not display contant variance

Test Model

Code Appendix

knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(tidy = TRUE)
knitr::opts_chunk$set(warning = FALSE)

loadPkg <- function(x) {
    if (!require(x, character.only = T)) 
        install.packages(x, dependencies = T)
    require(x, character.only = T)
}

libs <- c("knitr", "magrittr", "data.table", "kableExtra", "caret", "pROC", 
    "missForest", "zoo", "ISLR", "leaps", "fBasics", "reshape2", "GGally", "gridExtra", 
    "ROCR", "dummies", "pscl", "bestglm", "glmulti", "meifly", "MASS", "tidyverse")

lapply(libs, loadPkg)
train <- fread("https://raw.githubusercontent.com/baroncurtin2/data621/master/week5/wine-training-data.csv") %>% 
    as_tibble()
test <- fread("https://raw.githubusercontent.com/baroncurtin2/data621/master/week5/wine-evaluation-data.csv") %>% 
    as_tibble() %>% rename(INDEX = IN)

# list of datasets
data <- list(train = train, test = test)
data_frame(exp_vars = names(train)) %>% filter(!exp_vars %in% c("TARGET", "INDEX")) %>% 
    arrange(exp_vars) %>% mutate(tag = if_else(exp_vars %in% c("STARS", "LabelAppeal"), 
    "Non-Chemical", "Chemical"))
# remove INDEX from datasets
data %<>% map(function(df) {
    df %<>% dplyr::select(-INDEX)
    
    return(df)
})
glimpse(data$train)
data$train %>% sapply(typeof) %>% as.data.frame() %>% rownames_to_column(var = "variable") %>% 
    rename(vartype = 2) %>% group_by(vartype) %>% summarise(count = n())
summary(data$train)
stats <- data$train %>% basicStats() %>% as_tibble() %>% rownames_to_column() %>% 
    gather(var, value, -rowname) %>% spread(rowname, value) %>% dplyr::rename_all(str_to_lower) %>% 
    dplyr::rename_all(str_trim) %>% dplyr::rename(variables = "var", q1 = `1. quartile`, 
    q3 = `3. quartile`, max = maximum, min = minimum, na_vals = nas, n = nobs, 
    sd = stdev, var = variance) %>% mutate(obs = n - na_vals, range = max - 
    min, iqr = q3 - q1) %>% dplyr::select(variables, n, na_vals, obs, mean, 
    min, q1, median, q3, max, sd, var, range, iqr, skewness, kurtosis) %>% print
stats %>% filter(min < 0) %>% select(variables)
# helper function
setToZero <- function(x) {
    if (is.double(x)) {
        if_else(!is.na(x) & x < 0, as.double(0), as.double(x))
    } else {
        if_else(!is.na(x) & x < 0, 0, x)
    }
}

data %<>% map(function(df) {
    # vars with negatives
    negs <- stats %>% filter(min < 0) %>% select(variables)
    
    # mutate values below 0 to be 0
    df %<>% mutate_all(as.numeric) %>% mutate_at(negs[["variables"]], setToZero)
    
    # return df
    return(df)
})
data$train %>% cor(use = "na.or.complete") %>% as.data.frame() %>% rownames_to_column(var = "exp_var") %>% 
    as_data_frame() %>% select(exp_var, TARGET) %>% filter(!exp_var %in% c("TARGET", 
    "INDEX")) %>% arrange(desc(TARGET))
ggcorr(data$train, palette = "RdBu", label = T, geom = "tile", size = 2)
vis <- data$train %>% melt()
ggplot(vis, aes(value)) + geom_density(fill = "skyblue") + facet_wrap(~variable, 
    scales = "free")
data$train %>% gather(key = "variable", value = "value") %>% ggplot(aes(value)) + 
    geom_histogram(fill = "red", binwidth = function(x) 2 * IQR(x)/(length(x)^(1/3))) + 
    facet_wrap(~variable, scales = "free")
ggplot(vis, aes(x = 1, y = value)) + geom_boxplot(show.legend = T) + stat_summary(fun.y = mean, 
    color = "red", geom = "point", shape = 18, size = 3) + facet_wrap(~variable, 
    scales = "free") + coord_flip()
negs <- stats %>% filter(min < 0) %>% select(variables) %>% print
forests <- data %>% map(function(df) {
    df %<>% as.data.frame() %>% missForest()
    
    # return df
    return(df)
})
imputed <- forests %>% map(function(x) {
    return(x$ximp)
})
prm <- glmulti(TARGET ~ ., data = imputed$train, level = 1, method = "h", confsetsize = 5, 
    plotty = F, report = F, fitfunction = "glm", family = "poisson")
prm@formulas
m1 <- glm(TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + FreeSulfurDioxide + 
    TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + LabelAppeal + 
    AcidIndex + STARS, data = imputed$train, family = "poisson")
summary(m1)
m2 <- glm(TARGET ~ 1 + ., data = imputed$train, family = "poisson")
summary(m2)
nbr <- glmulti(TARGET ~ ., data = imputed$train, level = 1, method = "h", confsetsize = 5, 
    plotty = F, report = F, family = negative.binomial(1))
nbr@formulas
m3 <- glm.nb(TARGET ~ 1 + VolatileAcidity + Chlorides + FreeSulfurDioxide + 
    TotalSulfurDioxide + pH + Sulphates + LabelAppeal + AcidIndex + STARS, data = imputed$train)
summary(m3)
m4 <- imputed$train %>% mutate_all(~log(1 + .x)) %>% glm.nb(TARGET ~ 1 + VolatileAcidity + 
    Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + pH + Sulphates + LabelAppeal + 
    AcidIndex + STARS, data = .)
summary(m4)
mlr <- regsubsets(TARGET ~ ., data = imputed$train, method = "exhaustive", nvmax = NULL, 
    nbest = 1)
mlr.summary <- summary(mlr)
print(mlr.summary)
# determine best subset
plot(mlr.summary$cp, xlab = "Number of Variables", ylab = "Cp")
points(which.min(mlr.summary$cp), mlr.summary$cp[which.min(mlr.summary$cp)], 
    pch = 20, col = "red")

# cp plot par(mfrow=c(1,2))
plot(mlr, scale = "Cp", main = "Cp")

# r^2 splot
plot(mlr, scale = "adjr2", main = "Adjusted R^2")
(summary(mlr))$which[12, ]
m5 <- lm(TARGET ~ . - ResidualSugar - FixedAcidity, data = imputed$train)
summary(m5)
m6 <- lm(TARGET ~ ., data = imputed$train)
summary(m6)
par(mfrow = c(2, 2))

plot(m3)
hist(m3$residuals)
qqnorm(m3$residuals)
qqline(m3$residuals)
m3_results <- predict(m3, newdata = imputed$test)
test.final <- imputed$test %>% bind_cols(data_frame(predicted = m3_results))

m3.training <- predict(m3, newdata = imputed$train)
training.final <- imputed$train %>% bind_cols(data_frame(predicted = m3.training)) %>% 
    mutate(log_target = log(1 + TARGET))