CUNY MSDS Data 621 - HW 5

NIcholas Schettini

July 9, 2018

Work in Progress

Cover Page

CUNY MSDS HW5 -

Nicholas Schettini

CUNY School of Professional Studies

Abstract

In this research assignment, we investigated data on a number of wine boxes sold. The data consists of two response variables: TARGET. The explanatory variables in this dataset include: AcidIndex, Alchol, Chlorides, CitricAcid, Density, FixedAcidity, FreeSulferDioxide, LabelAppeal, ResidualSugar, STARS, Sulphates, TotalSulfurDioxide, VolatileAcidity, pH. The data consits of ~ 12795 observatrions and 14 variables. The research included 4 overall groups: data exploration, data preparation, creating models, and selecting the best model. The data was visualized using multiple methods, including histograms and boxplots. The data was prepped by adding imputations using the mice package in R to correct NA values. Different models were created based on different approaches (for example, Poisson and Zero Inflaction), and finally the best model was selected. The research shows that certain variables from within the dataset set were better predictors than others.

Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Data Exploration

The summary below shows multiple missing variables across most of the variables in the wine dataset. The TARGET variable seems to show a discrete variable rather than continious - # of wine boxes sold.
vars n mean sd median trimmed mad min max range skew kurtosis se na_count
INDEX 1 12795 8069.9803048 4656.9051071 8110.00000 8071.0294031 5977.8432000 1.00000 16129.00000 1.6128e+04 -0.0032496 -1.2005027 41.1696565 0
TARGET 2 12795 3.0290739 1.9263682 3.00000 3.0538244 1.4826000 0.00000 8.00000 8.0000e+00 -0.3263010 -0.8772457 0.0170302 0
FixedAcidity 3 12795 7.0757171 6.3176435 6.90000 7.0736739 3.2617200 -18.10000 34.40000 5.2500e+01 -0.0225860 1.6749987 0.0558515 0
VolatileAcidity 4 12795 0.3241039 0.7840142 0.28000 0.3243890 0.4299540 -2.79000 3.68000 6.4700e+00 0.0203800 1.8322106 0.0069311 0
CitricAcid 5 12795 0.3084127 0.8620798 0.31000 0.3102520 0.4151280 -3.24000 3.86000 7.1000e+00 -0.0503070 1.8379401 0.0076213 0
ResidualSugar 6 12179 5.4187331 33.7493790 3.90000 5.5800410 15.7155600 -127.80000 141.15000 2.6895e+02 -0.0531229 1.8846917 0.3058158 616
Chlorides 7 12157 0.0548225 0.3184673 0.04600 0.0540159 0.1349166 -1.17100 1.35100 2.5220e+00 0.0304272 1.7886044 0.0028884 638
FreeSulfurDioxide 8 12148 30.8455713 148.7145577 30.00000 30.9334877 56.3388000 -555.00000 623.00000 1.1780e+03 0.0063930 1.8364966 1.3492769 647
TotalSulfurDioxide 9 12113 120.7142326 231.9132105 123.00000 120.8895367 134.9166000 -823.00000 1057.00000 1.8800e+03 -0.0071794 1.6746665 2.1071703 682
Density 10 12795 0.9942027 0.0265376 0.99449 0.9942130 0.0093552 0.88809 1.09924 2.1115e-01 -0.0186938 1.8999592 0.0002346 0
pH 11 12400 3.2076282 0.6796871 3.20000 3.2055706 0.3854760 0.48000 6.13000 5.6500e+00 0.0442880 1.6462681 0.0061038 395
Sulphates 12 11585 0.5271118 0.9321293 0.50000 0.5271453 0.4447800 -3.13000 4.24000 7.3700e+00 0.0059119 1.7525655 0.0086602 1210
Alcohol 13 12142 10.4892363 3.7278190 10.40000 10.5018255 2.3721600 -4.70000 26.50000 3.1200e+01 -0.0307158 1.5394949 0.0338306 653
LabelAppeal 14 12795 -0.0090660 0.8910892 0.00000 -0.0099639 1.4826000 -2.00000 2.00000 4.0000e+00 0.0084295 -0.2622916 0.0078777 0
AcidIndex 15 12795 7.7727237 1.3239264 8.00000 7.6431572 1.4826000 4.00000 17.00000 1.3000e+01 1.6484959 5.1900925 0.0117043 0
STARS 16 9436 2.0417550 0.9025400 2.00000 1.9711258 1.4826000 1.00000 4.00000 3.0000e+00 0.4472353 -0.6925343 0.0092912 3359

Visual Exploration

Boxplots

The below boxplots show all of the variables listed in the dataset. This visualization will assist in showing how the data is spread for each variable.

The boxplots show

The target variable, number of cases, is shown below. The data shows a large number of zero values.

The distribution looks like a Poisson distribution, with a significant amount of zero values.

## Warning: Removed 3359 rows containing non-finite values (stat_count).

AcidIndex looks more shaped like a poisson distribution, with a slight right skew. LabelAppearl and STARS seems to be more categorical.

## Warning: Removed 4841 rows containing non-finite values (stat_bin).

The other variables seem to be more normally distributed with high kurtosis.

Correlation

The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.

For this project, it makes sense to break down the correlation by target - since that’s what we’re trying to predict.
x
INDEX 0.0314911
TARGET 0.4979465
FixedAcidity 0.0113760
VolatileAcidity -0.0202420
CitricAcid 0.0153316
ResidualSugar -0.0045793
Chlorides -0.0063870
FreeSulfurDioxide 0.0149601
TotalSulfurDioxide -0.0027237
Density -0.0180944
pH 0.0002182
Sulphates 0.0037687
Alcohol -0.0006449
LabelAppeal 1.0000000
AcidIndex 0.0103010
STARS 0.3188970

Looking at the correlations, very few look correlated at all. The ones that do (STARS, LabelAppeal) have a small positive correlation, while AcidIndex and TARGET have a small negative correlation.

Missing Values

According to the graph, the data shows multiple variables with missing variables. The STARS variable has the most NA values. These missing values will be imputed later on during the data preperation using the MICE package.

Data Prep

Imputation of Missing (NA) values

The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.

Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.

In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.

Since the data has many missing values over multiple different variables. The MICE algorithm takes some computing time..

vars n mean sd median trimmed mad min max range skew kurtosis se
TARGET 1 12795 3.0290739 1.9263682 3.00000 3.0538244 1.4826000 0.00000 8.00000 8.00000 -0.3263010 -0.8772457 0.0170302
FixedAcidity 2 12795 7.0757171 6.3176435 6.90000 7.0736739 3.2617200 -18.10000 34.40000 52.50000 -0.0225860 1.6749987 0.0558515
VolatileAcidity 3 12795 0.3241039 0.7840142 0.28000 0.3243890 0.4299540 -2.79000 3.68000 6.47000 0.0203800 1.8322106 0.0069311
CitricAcid 4 12795 0.3084127 0.8620798 0.31000 0.3102520 0.4151280 -3.24000 3.86000 7.10000 -0.0503070 1.8379401 0.0076213
ResidualSugar 5 12795 5.4560688 33.5479209 3.80000 5.5931328 15.5673000 -127.80000 141.15000 268.95000 -0.0418962 1.9063668 0.2965825
Chlorides 6 12795 0.0539703 0.3159624 0.04600 0.0533848 0.1275036 -1.17100 1.35100 2.52200 0.0204250 1.8584778 0.0027933
FreeSulfurDioxide 7 12795 31.2710434 148.0337336 30.00000 31.4126209 53.3736000 -555.00000 623.00000 1178.00000 0.0016334 1.8837818 1.3087013
TotalSulfurDioxide 8 12795 120.3816335 230.8142427 124.00000 120.7789880 133.4340000 -823.00000 1057.00000 1880.00000 -0.0177176 1.6983299 2.0405275
Density 9 12795 0.9942027 0.0265376 0.99449 0.9942130 0.0093552 0.88809 1.09924 0.21115 -0.0186938 1.8999592 0.0002346
pH 10 12795 3.2073834 0.6769933 3.20000 3.2054889 0.3854760 0.48000 6.13000 5.65000 0.0426209 1.6611990 0.0059850
Sulphates 11 12795 0.5277061 0.9207721 0.50000 0.5272687 0.4003020 -3.13000 4.24000 7.37000 0.0114220 1.8670775 0.0081401
Alcohol 12 12795 10.4818189 3.7032024 10.40000 10.4915128 2.3721600 -4.70000 26.50000 31.20000 -0.0199093 1.5761104 0.0327384
LabelAppeal 13 12795 -0.0090660 0.8910892 0.00000 -0.0099639 1.4826000 -2.00000 2.00000 4.00000 0.0084295 -0.2622916 0.0078777
AcidIndex 14 12795 7.7727237 1.3239264 8.00000 7.6431572 1.4826000 4.00000 17.00000 13.00000 1.6484959 5.1900925 0.0117043
STARS 15 12795 1.9802267 0.8855040 2.00000 1.9059295 1.4826000 1.00000 4.00000 3.00000 0.5180282 -0.5978441 0.0078284

Absoulte value of variables

Some of the discussion among classmates has been about taking the abs value of the variables in the dataset - since the debate on the negative numbers for multiple variables.

In this case I will take an ABS transformation and apply it to the top performing model.

It seems however, that taking the ABS of the values in the dataset introduces a right skew where the variable would have been approx. normal.

If this data is transformed using the log transformation, it seems to become ‘more’ normal - but this might be introducting overfitting into the data?

## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 5804 rows containing non-finite values (stat_bin).

Build Models

Throughout this section, various models will be created to try to determine which will allow for the best “fit” to predict weather crime appears in a major city as given by the dataset. In this assignment, I will try various models such as: Linear models, Negative Binomial, and Poisson, as suggested by the homework instructions.

Model 1 - Poisson with imputed data

As per the homework videos, the poisson distribution works well with count data.

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = imputed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9784  -0.5298   0.2051   0.6296   2.5442  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.035e+00  1.956e-01  10.403  < 2e-16 ***
## FixedAcidity       -2.602e-04  8.201e-04  -0.317 0.751045    
## VolatileAcidity    -5.162e-02  6.494e-03  -7.949 1.87e-15 ***
## CitricAcid          1.431e-02  5.891e-03   2.429 0.015134 *  
## ResidualSugar       1.163e-04  1.517e-04   0.767 0.443001    
## Chlorides          -5.287e-02  1.615e-02  -3.272 0.001066 ** 
## FreeSulfurDioxide   1.405e-04  3.442e-05   4.081 4.48e-05 ***
## TotalSulfurDioxide  9.744e-05  2.215e-05   4.399 1.09e-05 ***
## Density            -4.109e-01  1.921e-01  -2.139 0.032433 *  
## pH                 -2.290e-02  7.550e-03  -3.033 0.002420 ** 
## Sulphates          -1.939e-02  5.520e-03  -3.513 0.000443 ***
## Alcohol             5.155e-03  1.382e-03   3.731 0.000190 ***
## LabelAppeal         1.937e-01  6.022e-03  32.158  < 2e-16 ***
## AcidIndex          -1.217e-01  4.463e-03 -27.259  < 2e-16 ***
## STARS               2.027e-01  5.788e-03  35.027  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18351  on 12780  degrees of freedom
## AIC: 50323
## 
## Number of Fisher Scoring iterations: 5

Model 2 - Poisson without imputed data

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2158  -0.2734   0.0616   0.3732   1.6830  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.593e+00  2.506e-01   6.359 2.03e-10 ***
## FixedAcidity        3.293e-04  1.053e-03   0.313  0.75447    
## VolatileAcidity    -2.560e-02  8.353e-03  -3.065  0.00218 ** 
## CitricAcid         -7.259e-04  7.575e-03  -0.096  0.92365    
## ResidualSugar      -6.141e-05  1.941e-04  -0.316  0.75165    
## Chlorides          -3.007e-02  2.056e-02  -1.463  0.14346    
## FreeSulfurDioxide   6.734e-05  4.404e-05   1.529  0.12620    
## TotalSulfurDioxide  2.081e-05  2.855e-05   0.729  0.46618    
## Density            -3.725e-01  2.462e-01  -1.513  0.13026    
## pH                 -4.661e-03  9.598e-03  -0.486  0.62722    
## Sulphates          -5.164e-03  7.051e-03  -0.732  0.46398    
## Alcohol             3.948e-03  1.771e-03   2.229  0.02579 *  
## LabelAppeal         1.771e-01  7.954e-03  22.271  < 2e-16 ***
## AcidIndex          -4.870e-02  5.903e-03  -8.251  < 2e-16 ***
## STARS               1.871e-01  7.487e-03  24.993  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5844.1  on 6435  degrees of freedom
## Residual deviance: 4009.1  on 6421  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23172
## 
## Number of Fisher Scoring iterations: 5

Model 3 - Negative Binomial

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = imputed, init.theta = 38344.98616, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9782  -0.5297   0.2051   0.6296   2.5442  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.035e+00  1.956e-01  10.402  < 2e-16 ***
## FixedAcidity       -2.602e-04  8.202e-04  -0.317 0.751042    
## VolatileAcidity    -5.162e-02  6.494e-03  -7.949 1.88e-15 ***
## CitricAcid          1.431e-02  5.891e-03   2.429 0.015139 *  
## ResidualSugar       1.163e-04  1.517e-04   0.767 0.442988    
## Chlorides          -5.287e-02  1.616e-02  -3.272 0.001066 ** 
## FreeSulfurDioxide   1.405e-04  3.442e-05   4.081 4.49e-05 ***
## TotalSulfurDioxide  9.744e-05  2.215e-05   4.398 1.09e-05 ***
## Density            -4.109e-01  1.921e-01  -2.139 0.032438 *  
## pH                 -2.290e-02  7.550e-03  -3.033 0.002420 ** 
## Sulphates          -1.939e-02  5.521e-03  -3.513 0.000444 ***
## Alcohol             5.155e-03  1.382e-03   3.731 0.000191 ***
## LabelAppeal         1.937e-01  6.022e-03  32.156  < 2e-16 ***
## AcidIndex          -1.217e-01  4.464e-03 -27.258  < 2e-16 ***
## STARS               2.027e-01  5.788e-03  35.025  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(38344.99) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18350  on 12780  degrees of freedom
## AIC: 50325
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  38345 
##           Std. Err.:  59918 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50293.1

Model 4 - Linear Model

## 
## Call:
## lm(formula = TARGET ~ ., data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1455 -0.7398  0.3661  1.1045  4.4181 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.401e+00  5.507e-01   9.807  < 2e-16 ***
## FixedAcidity       -4.122e-04  2.312e-03  -0.178 0.858533    
## VolatileAcidity    -1.580e-01  1.836e-02  -8.605  < 2e-16 ***
## CitricAcid          4.299e-02  1.671e-02   2.573 0.010107 *  
## ResidualSugar       3.892e-04  4.287e-04   0.908 0.363882    
## Chlorides          -1.675e-01  4.551e-02  -3.679 0.000235 ***
## FreeSulfurDioxide   4.209e-04  9.721e-05   4.330 1.51e-05 ***
## TotalSulfurDioxide  2.823e-04  6.236e-05   4.526 6.06e-06 ***
## Density            -1.159e+00  5.422e-01  -2.137 0.032591 *  
## pH                 -5.957e-02  2.126e-02  -2.801 0.005096 ** 
## Sulphates          -5.669e-02  1.562e-02  -3.629 0.000286 ***
## Alcohol             1.840e-02  3.892e-03   4.729 2.28e-06 ***
## LabelAppeal         5.868e-01  1.689e-02  34.752  < 2e-16 ***
## AcidIndex          -3.218e-01  1.117e-02 -28.817  < 2e-16 ***
## STARS               6.645e-01  1.708e-02  38.913  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.625 on 12780 degrees of freedom
## Multiple R-squared:  0.2895, Adjusted R-squared:  0.2887 
## F-statistic:   372 on 14 and 12780 DF,  p-value: < 2.2e-16

Model 5 - Zero inflation

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = imputed, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2585 -0.3095  0.1807  0.5111  2.2104 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.647e+00  2.060e-01   7.998 1.27e-15 ***
## FixedAcidity        1.766e-04  8.540e-04   0.207  0.83613    
## VolatileAcidity    -1.959e-02  6.848e-03  -2.861  0.00423 ** 
## CitricAcid          2.508e-03  6.121e-03   0.410  0.68203    
## ResidualSugar      -3.782e-05  1.580e-04  -0.239  0.81089    
## Chlorides          -2.587e-02  1.689e-02  -1.532  0.12561    
## FreeSulfurDioxide   4.816e-05  3.521e-05   1.368  0.17136    
## TotalSulfurDioxide -6.033e-06  2.243e-05  -0.269  0.78793    
## Density            -3.043e-01  2.016e-01  -1.509  0.13131    
## pH                  1.714e-03  7.903e-03   0.217  0.82831    
## Sulphates          -3.187e-03  5.788e-03  -0.550  0.58198    
## Alcohol             6.822e-03  1.434e-03   4.758 1.96e-06 ***
## LabelAppeal         2.423e-01  6.375e-03  38.004  < 2e-16 ***
## AcidIndex          -4.319e-02  5.403e-03  -7.994 1.31e-15 ***
## STARS               9.568e-02  6.313e-03  15.155  < 2e-16 ***
## Log(theta)          1.798e+01  1.981e+00   9.080  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.33119    0.06228  -5.317 1.05e-07 ***
## STARS       -0.62932    0.03362 -18.720  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 64638340.4478 
## Number of iterations in BFGS optimization: 65 
## Log-likelihood: -2.296e+04 on 18 Df

Model 6 - glmulti Package

The glmulti package is an “automated model selection and model averaging” tool. The package automatically generates all possible models “with the specified response and explanatory variables”. The tool is basically used to find the “best” model.

glmmodel <- glm(imputed$TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + 
    FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
    Alcohol + LabelAppeal + AcidIndex + STARS, data = imputed)

summary(glmmodel)
## 
## Call:
## glm(formula = imputed$TARGET ~ 1 + VolatileAcidity + CitricAcid + 
##     Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
##     pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, 
##     data = imputed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.1320  -0.7369   0.3636   1.1058   4.4248  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.398e+00  5.507e-01   9.803  < 2e-16 ***
## VolatileAcidity    -1.581e-01  1.836e-02  -8.613  < 2e-16 ***
## CitricAcid          4.285e-02  1.671e-02   2.564 0.010348 *  
## Chlorides          -1.677e-01  4.551e-02  -3.684 0.000231 ***
## FreeSulfurDioxide   4.223e-04  9.718e-05   4.345 1.40e-05 ***
## TotalSulfurDioxide  2.835e-04  6.234e-05   4.548 5.46e-06 ***
## Density            -1.154e+00  5.422e-01  -2.129 0.033291 *  
## pH                 -5.937e-02  2.126e-02  -2.793 0.005238 ** 
## Sulphates          -5.693e-02  1.561e-02  -3.647 0.000267 ***
## Alcohol             1.833e-02  3.891e-03   4.710 2.50e-06 ***
## LabelAppeal         5.869e-01  1.688e-02  34.757  < 2e-16 ***
## AcidIndex          -3.223e-01  1.100e-02 -29.308  < 2e-16 ***
## STARS               6.647e-01  1.707e-02  38.935  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.639244)
## 
##     Null deviance: 47477  on 12794  degrees of freedom
## Residual deviance: 33735  on 12782  degrees of freedom
## AIC: 48743
## 
## Number of Fisher Scoring iterations: 2
glmmodelabs <- glm(absdata$TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + 
    FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
    Alcohol + LabelAppeal + AcidIndex + STARS, data = absdata)

summary(glmmodelabs)
## 
## Call:
## glm(formula = absdata$TARGET ~ 1 + VolatileAcidity + CitricAcid + 
##     Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
##     pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, 
##     data = absdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.3054  -1.1095   0.2816   1.1696   5.8865  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.113e+00  5.782e-01   8.842  < 2e-16 ***
## VolatileAcidity    -1.785e-01  2.716e-02  -6.571 5.18e-11 ***
## CitricAcid          6.520e-02  2.488e-02   2.621  0.00879 ** 
## Chlorides          -1.562e-01  6.468e-02  -2.415  0.01577 *  
## FreeSulfurDioxide   3.176e-04  1.397e-04   2.274  0.02297 *  
## TotalSulfurDioxide  2.708e-04  9.312e-05   2.908  0.00364 ** 
## Density            -1.301e+00  5.684e-01  -2.288  0.02215 *  
## pH                 -5.466e-02  2.230e-02  -2.451  0.01425 *  
## Sulphates          -6.526e-02  2.317e-02  -2.816  0.00486 ** 
## Alcohol             1.808e-02  4.187e-03   4.318 1.59e-05 ***
## LabelAppeal        -2.969e-02  2.425e-02  -1.224  0.22084    
## AcidIndex          -3.065e-01  1.148e-02 -26.692  < 2e-16 ***
## STARS               8.412e-01  1.710e-02  49.179  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.902446)
## 
##     Null deviance: 47477  on 12794  degrees of freedom
## Residual deviance: 37099  on 12782  degrees of freedom
## AIC: 49959
## 
## Number of Fisher Scoring iterations: 2

Select Models

Predictions

Similar to the train data, the evaulation data also needs some prep work. Similar to what was done for the test data, the eval data has had columns removed, and NA values imputed using the MICE - Random Forest method to predict what the NA values could be.

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
vars n mean sd median trimmed mad min max range skew kurtosis se
IN 1 3335 8048.3109445 4655.4790369 7906.0000 8044.2832522 5960.0520000 3.00000 16130.00000 1.6127e+04 0.0124697 -1.2000392 80.6151110
TARGET* 2 0 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
FixedAcidity 3 3335 6.8638081 6.3184313 6.9000 6.9142750 2.8169400 -18.20000 33.50000 5.1700e+01 -0.1172460 2.0399637 0.1094111
VolatileAcidity 4 3335 0.3102714 0.8068341 0.2800 0.3129730 0.4596060 -2.83000 3.61000 6.4400e+00 -0.0437301 1.6171958 0.0139713
CitricAcid 5 3335 0.3124288 0.8709938 0.3100 0.3110978 0.4447800 -3.12000 3.76000 6.8800e+00 -0.0284898 1.6564422 0.0150823
ResidualSugar 6 3335 5.1860570 34.1602735 3.5000 5.3186774 16.9016400 -128.30000 145.40000 2.7370e+02 -0.0492818 2.0015002 0.5915254
Chlorides 7 3335 0.0593979 0.3135205 0.0460 0.0607958 0.1156428 -1.15000 1.26300 2.4130e+00 -0.0455098 1.7219311 0.0054290
FreeSulfurDioxide 8 3335 33.9872564 148.8929259 29.0000 33.2615212 57.0801000 -563.00000 617.00000 1.1800e+03 0.0730972 1.8678963 2.5782566
TotalSulfurDioxide 9 3335 123.4229385 224.5781463 124.0000 124.0080555 136.3992000 -769.00000 1004.00000 1.7730e+03 -0.0437972 1.4893654 3.8888355
Density 10 3335 0.9946698 0.0261905 0.9946 0.9946690 0.0090290 0.88975 1.09983 2.1008e-01 -0.0296593 1.9359398 0.0004535
pH 11 3335 3.2342819 0.6740613 3.2100 3.2306669 0.3558240 0.60000 6.21000 5.6100e+00 0.1100501 1.7179640 0.0116722
Sulphates 12 3335 0.5409265 0.8949812 0.5000 0.5413713 0.3706500 -3.07000 4.18000 7.2500e+00 -0.0029721 1.8453436 0.0154977
Alcohol 13 3335 10.6136552 3.7589939 10.4000 10.6018946 2.5204200 -4.20000 25.60000 2.9800e+01 0.0820618 1.5780691 0.0650914
LabelAppeal 14 3335 0.0134933 0.8885718 0.0000 0.0063694 1.4826000 -2.00000 2.00000 4.0000e+00 0.0454887 -0.2601115 0.0153867
AcidIndex 15 3335 7.7478261 1.3154203 8.0000 7.6212064 1.4826000 5.00000 17.00000 1.2000e+01 1.5066589 4.2794836 0.0227781
STARS 16 3335 1.9985007 0.8933858 2.0000 1.9280629 1.4826000 1.00000 4.00000 3.0000e+00 0.4747543 -0.6880260 0.0154700

Evaulating the model

The model will be evaulated by looking at the MSE.

Comparison of Models RME.
Linear Model Poisson Model 2 Poisson Model 1 Negative BinomMod Zero Inflation GLmulti ABS
2.636385 6.455605 7.038349 7.038348 2.731825 2.636562 2.899497

The linear model and GLmulti model have very close RME. Both models predictions are shown below:

Model 4

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.9126  2.3506  3.0063  3.0654  3.7495  6.5648

Model 5

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.9137  2.3497  3.0036  3.0655  3.7519  6.5873

References

All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

Model selection and multimodel inference made easy. (n.d.). Retrieved from https://cran.r-project.org/web/packages/glmulti/glmulti.pdf

Best subset model selection with R.(n.d.). Retrieved from http://jadianes.me/best-subset-model-selection-with-R

ZERO-INFLATED POISSON REGRESSION | R DATA ANALYSIS EXAMPLES. (n.d.). Retrieved from https://stats.idre.ucla.edu/r/dae/zip/