Data 621 - HW 5

Murali Kunissery

May 1, 2019

Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Data Exploration

The summary below shows multiple missing variables across most of the variables in the wine dataset. The TARGET variable seems to show a discrete variable rather than continious - # of wine boxes sold.
vars n mean sd median trimmed mad min max range skew kurtosis se na_count
INDEX 1 12795 8069.9803048 4656.9051071 8110.00000 8071.0294031 5977.8432000 1.00000 16129.00000 1.6128e+04 -0.0032496 -1.2005027 41.1696565 0
TARGET 2 12795 3.0290739 1.9263682 3.00000 3.0538244 1.4826000 0.00000 8.00000 8.0000e+00 -0.3263010 -0.8772457 0.0170302 0
FixedAcidity 3 12795 7.0757171 6.3176435 6.90000 7.0736739 3.2617200 -18.10000 34.40000 5.2500e+01 -0.0225860 1.6749987 0.0558515 0
VolatileAcidity 4 12795 0.3241039 0.7840142 0.28000 0.3243890 0.4299540 -2.79000 3.68000 6.4700e+00 0.0203800 1.8322106 0.0069311 0
CitricAcid 5 12795 0.3084127 0.8620798 0.31000 0.3102520 0.4151280 -3.24000 3.86000 7.1000e+00 -0.0503070 1.8379401 0.0076213 0
ResidualSugar 6 12179 5.4187331 33.7493790 3.90000 5.5800410 15.7155600 -127.80000 141.15000 2.6895e+02 -0.0531229 1.8846917 0.3058158 616
Chlorides 7 12157 0.0548225 0.3184673 0.04600 0.0540159 0.1349166 -1.17100 1.35100 2.5220e+00 0.0304272 1.7886044 0.0028884 638
FreeSulfurDioxide 8 12148 30.8455713 148.7145577 30.00000 30.9334877 56.3388000 -555.00000 623.00000 1.1780e+03 0.0063930 1.8364966 1.3492769 647
TotalSulfurDioxide 9 12113 120.7142326 231.9132105 123.00000 120.8895367 134.9166000 -823.00000 1057.00000 1.8800e+03 -0.0071794 1.6746665 2.1071703 682
Density 10 12795 0.9942027 0.0265376 0.99449 0.9942130 0.0093552 0.88809 1.09924 2.1115e-01 -0.0186938 1.8999592 0.0002346 0
pH 11 12400 3.2076282 0.6796871 3.20000 3.2055706 0.3854760 0.48000 6.13000 5.6500e+00 0.0442880 1.6462681 0.0061038 395
Sulphates 12 11585 0.5271118 0.9321293 0.50000 0.5271453 0.4447800 -3.13000 4.24000 7.3700e+00 0.0059119 1.7525655 0.0086602 1210
Alcohol 13 12142 10.4892363 3.7278190 10.40000 10.5018255 2.3721600 -4.70000 26.50000 3.1200e+01 -0.0307158 1.5394949 0.0338306 653
LabelAppeal 14 12795 -0.0090660 0.8910892 0.00000 -0.0099639 1.4826000 -2.00000 2.00000 4.0000e+00 0.0084295 -0.2622916 0.0078777 0
AcidIndex 15 12795 7.7727237 1.3239264 8.00000 7.6431572 1.4826000 4.00000 17.00000 1.3000e+01 1.6484959 5.1900925 0.0117043 0
STARS 16 9436 2.0417550 0.9025400 2.00000 1.9711258 1.4826000 1.00000 4.00000 3.0000e+00 0.4472353 -0.6925343 0.0092912 3359

Visual Exploration

Boxplots

The below boxplots show all of the variables listed in the dataset. This visualization will assist in showing how the data is spread for each variable.

The boxplots show

The target variable, number of cases, is shown below. The data shows a large number of zero values.

The distribution looks like a Poisson distribution, with a significant amount of zero values.

## Warning: Removed 3359 rows containing non-finite values (stat_count).

AcidIndex looks more shaped like a poisson distribution, with a slight right skew. LabelAppearl and STARS seems to be more categorical.

## Warning: Removed 4841 rows containing non-finite values (stat_bin).

The other variables seem to be more normally distributed with high kurtosis.

Correlation

The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.

For this project, it makes sense to break down the correlation by target - since that’s what we’re trying to predict.
x
INDEX 0.0314911
TARGET 0.4979465
FixedAcidity 0.0113760
VolatileAcidity -0.0202420
CitricAcid 0.0153316
ResidualSugar -0.0045793
Chlorides -0.0063870
FreeSulfurDioxide 0.0149601
TotalSulfurDioxide -0.0027237
Density -0.0180944
pH 0.0002182
Sulphates 0.0037687
Alcohol -0.0006449
LabelAppeal 1.0000000
AcidIndex 0.0103010
STARS 0.3188970

Looking at the correlations, very few look correlated at all. The ones that do (STARS, LabelAppeal) have a small positive correlation, while AcidIndex and TARGET have a small negative correlation.

Missing Values

According to the graph, the data shows multiple variables with missing variables. The STARS variable has the most NA values. These missing values will be imputed later on during the data preperation using the MICE package.

Data Prep

Imputation of Missing (NA) values

The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.

Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.

In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.

Since the data has many missing values over multiple different variables. The MICE algorithm takes some computing time..

vars n mean sd median trimmed mad min max range skew kurtosis se
TARGET 1 12795 3.0290739 1.9263682 3.00000 3.0538244 1.4826000 0.00000 8.00000 8.00000 -0.3263010 -0.8772457 0.0170302
FixedAcidity 2 12795 7.0757171 6.3176435 6.90000 7.0736739 3.2617200 -18.10000 34.40000 52.50000 -0.0225860 1.6749987 0.0558515
VolatileAcidity 3 12795 0.3241039 0.7840142 0.28000 0.3243890 0.4299540 -2.79000 3.68000 6.47000 0.0203800 1.8322106 0.0069311
CitricAcid 4 12795 0.3084127 0.8620798 0.31000 0.3102520 0.4151280 -3.24000 3.86000 7.10000 -0.0503070 1.8379401 0.0076213
ResidualSugar 5 12795 5.4139977 33.4843877 3.80000 5.5908225 15.4190400 -127.80000 141.15000 268.95000 -0.0637217 1.9596143 0.2960208
Chlorides 6 12795 0.0541082 0.3165439 0.04600 0.0531466 0.1260210 -1.17100 1.35100 2.52200 0.0389921 1.8544078 0.0027984
FreeSulfurDioxide 7 12795 30.5899179 147.3925457 30.00000 30.6372472 51.8910000 -555.00000 623.00000 1178.00000 0.0101973 1.8785178 1.3030329
TotalSulfurDioxide 8 12795 121.1465807 231.4060347 123.00000 121.2777669 133.4340000 -823.00000 1057.00000 1880.00000 -0.0058771 1.7067421 2.0457593
Density 9 12795 0.9942027 0.0265376 0.99449 0.9942130 0.0093552 0.88809 1.09924 0.21115 -0.0186938 1.8999592 0.0002346
pH 10 12795 3.2082532 0.6767616 3.20000 3.2059431 0.3706500 0.48000 6.13000 5.65000 0.0539591 1.6746911 0.0059830
Sulphates 11 12795 0.5238093 0.9178227 0.50000 0.5238244 0.4151280 -3.13000 4.24000 7.37000 0.0078332 1.8619842 0.0081141
Alcohol 12 12795 10.4918776 3.7011782 10.40000 10.5064635 2.3721600 -4.70000 26.50000 31.20000 -0.0375788 1.5885671 0.0327205
LabelAppeal 13 12795 -0.0090660 0.8910892 0.00000 -0.0099639 1.4826000 -2.00000 2.00000 4.00000 0.0084295 -0.2622916 0.0078777
AcidIndex 14 12795 7.7727237 1.3239264 8.00000 7.6431572 1.4826000 4.00000 17.00000 13.00000 1.6484959 5.1900925 0.0117043
STARS 15 12795 1.9859320 0.8880368 2.00000 1.9125720 1.4826000 1.00000 4.00000 3.00000 0.5062041 -0.6228959 0.0078507

Absoulte value of variables

Some of the discussion among classmates has been about taking the abs value of the variables in the dataset - since the debate on the negative numbers for multiple variables.

In this case I will take an ABS transformation and apply it to the top performing model.

It seems however, that taking the ABS of the values in the dataset introduces a right skew where the variable would have been approx. normal.

If this data is transformed using the log transformation, it seems to become ‘more’ normal - but this might be introducting overfitting into the data?

## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 5807 rows containing non-finite values (stat_bin).

Build Models

Throughout this section, various models will be created to try to determine which will allow for the best “fit” to predict weather crime appears in a major city as given by the dataset. In this assignment, I will try various models such as: Linear models, Negative Binomial, and Poisson, as suggested by the homework instructions.

Model 1 - Poisson with imputed data

As per the homework videos, the poisson distribution works well with count data.

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = imputed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9778  -0.5321   0.2094   0.6267   2.5356  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.107e+00  1.956e-01  10.772  < 2e-16 ***
## FixedAcidity       -3.080e-04  8.200e-04  -0.376 0.707210    
## VolatileAcidity    -5.003e-02  6.490e-03  -7.709 1.27e-14 ***
## CitricAcid          1.233e-02  5.889e-03   2.094 0.036224 *  
## ResidualSugar       1.899e-04  1.519e-04   1.251 0.211021    
## Chlorides          -5.336e-02  1.614e-02  -3.307 0.000943 ***
## FreeSulfurDioxide   1.360e-04  3.453e-05   3.940 8.15e-05 ***
## TotalSulfurDioxide  1.041e-04  2.208e-05   4.715 2.41e-06 ***
## Density            -4.816e-01  1.922e-01  -2.506 0.012208 *  
## pH                 -2.406e-02  7.554e-03  -3.185 0.001449 ** 
## Sulphates          -1.761e-02  5.550e-03  -3.173 0.001509 ** 
## Alcohol             5.253e-03  1.384e-03   3.796 0.000147 ***
## LabelAppeal         1.948e-01  6.026e-03  32.322  < 2e-16 ***
## AcidIndex          -1.205e-01  4.460e-03 -27.017  < 2e-16 ***
## STARS               1.975e-01  5.792e-03  34.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18409  on 12780  degrees of freedom
## AIC: 50381
## 
## Number of Fisher Scoring iterations: 5

Model 2 - Poisson without imputed data

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2158  -0.2734   0.0616   0.3732   1.6830  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.593e+00  2.506e-01   6.359 2.03e-10 ***
## FixedAcidity        3.293e-04  1.053e-03   0.313  0.75447    
## VolatileAcidity    -2.560e-02  8.353e-03  -3.065  0.00218 ** 
## CitricAcid         -7.259e-04  7.575e-03  -0.096  0.92365    
## ResidualSugar      -6.141e-05  1.941e-04  -0.316  0.75165    
## Chlorides          -3.007e-02  2.056e-02  -1.463  0.14346    
## FreeSulfurDioxide   6.734e-05  4.404e-05   1.529  0.12620    
## TotalSulfurDioxide  2.081e-05  2.855e-05   0.729  0.46618    
## Density            -3.725e-01  2.462e-01  -1.513  0.13026    
## pH                 -4.661e-03  9.598e-03  -0.486  0.62722    
## Sulphates          -5.164e-03  7.051e-03  -0.732  0.46398    
## Alcohol             3.948e-03  1.771e-03   2.229  0.02579 *  
## LabelAppeal         1.771e-01  7.954e-03  22.271  < 2e-16 ***
## AcidIndex          -4.870e-02  5.903e-03  -8.251  < 2e-16 ***
## STARS               1.871e-01  7.487e-03  24.993  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5844.1  on 6435  degrees of freedom
## Residual deviance: 4009.1  on 6421  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23172
## 
## Number of Fisher Scoring iterations: 5

Model 3 - Negative Binomial

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## 
## Call:
## glm.nb(formula = TARGET ~ ., data = imputed, init.theta = 37593.81225, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9776  -0.5321   0.2094   0.6267   2.5355  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.107e+00  1.956e-01  10.771  < 2e-16 ***
## FixedAcidity       -3.080e-04  8.200e-04  -0.376 0.707203    
## VolatileAcidity    -5.003e-02  6.490e-03  -7.708 1.28e-14 ***
## CitricAcid          1.233e-02  5.889e-03   2.094 0.036234 *  
## ResidualSugar       1.899e-04  1.519e-04   1.251 0.211025    
## Chlorides          -5.336e-02  1.614e-02  -3.307 0.000944 ***
## FreeSulfurDioxide   1.360e-04  3.453e-05   3.940 8.16e-05 ***
## TotalSulfurDioxide  1.041e-04  2.208e-05   4.715 2.42e-06 ***
## Density            -4.816e-01  1.922e-01  -2.506 0.012211 *  
## pH                 -2.406e-02  7.554e-03  -3.185 0.001449 ** 
## Sulphates          -1.761e-02  5.550e-03  -3.173 0.001509 ** 
## Alcohol             5.253e-03  1.384e-03   3.795 0.000147 ***
## LabelAppeal         1.948e-01  6.026e-03  32.320  < 2e-16 ***
## AcidIndex          -1.205e-01  4.460e-03 -27.016  < 2e-16 ***
## STARS               1.975e-01  5.793e-03  34.087  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(37593.81) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18408  on 12780  degrees of freedom
## AIC: 50383
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  37594 
##           Std. Err.:  59621 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50351.49

Linear Model

## 
## Call:
## lm(formula = TARGET ~ ., data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1183 -0.7321  0.3690  1.1089  4.9856 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.699e+00  5.520e-01  10.324  < 2e-16 ***
## FixedAcidity       -7.640e-04  2.318e-03  -0.330 0.741743    
## VolatileAcidity    -1.520e-01  1.841e-02  -8.256  < 2e-16 ***
## CitricAcid          3.480e-02  1.676e-02   2.077 0.037832 *  
## ResidualSugar       6.234e-04  4.305e-04   1.448 0.147609    
## Chlorides          -1.718e-01  4.556e-02  -3.771 0.000163 ***
## FreeSulfurDioxide   4.084e-04  9.788e-05   4.173 3.03e-05 ***
## TotalSulfurDioxide  3.045e-04  6.238e-05   4.882 1.06e-06 ***
## Density            -1.434e+00  5.436e-01  -2.638 0.008359 ** 
## pH                 -6.452e-02  2.133e-02  -3.025 0.002494 ** 
## Sulphates          -5.004e-02  1.571e-02  -3.185 0.001453 ** 
## Alcohol             1.820e-02  3.906e-03   4.660 3.19e-06 ***
## LabelAppeal         5.907e-01  1.694e-02  34.868  < 2e-16 ***
## AcidIndex          -3.191e-01  1.121e-02 -28.467  < 2e-16 ***
## STARS               6.469e-01  1.709e-02  37.842  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.629 on 12780 degrees of freedom
## Multiple R-squared:  0.2856, Adjusted R-squared:  0.2848 
## F-statistic:   365 on 14 and 12780 DF,  p-value: < 2.2e-16

Zero inflation

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = imputed, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2471 -0.3092  0.1839  0.5108  2.2256 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.682e+00  2.061e-01   8.162 3.28e-16 ***
## FixedAcidity        1.662e-04  8.542e-04   0.195  0.84578    
## VolatileAcidity    -1.894e-02  6.845e-03  -2.767  0.00567 ** 
## CitricAcid          1.580e-03  6.121e-03   0.258  0.79627    
## ResidualSugar      -1.627e-05  1.589e-04  -0.102  0.91844    
## Chlorides          -2.498e-02  1.691e-02  -1.478  0.13943    
## FreeSulfurDioxide   4.829e-05  3.532e-05   1.367  0.17158    
## TotalSulfurDioxide -3.512e-06  2.235e-05  -0.157  0.87512    
## Density            -3.368e-01  2.015e-01  -1.671  0.09465 .  
## pH                  1.824e-04  7.915e-03   0.023  0.98161    
## Sulphates          -3.111e-03  5.823e-03  -0.534  0.59313    
## Alcohol             7.102e-03  1.434e-03   4.952 7.36e-07 ***
## LabelAppeal         2.434e-01  6.368e-03  38.216  < 2e-16 ***
## AcidIndex          -4.254e-02  5.406e-03  -7.870 3.55e-15 ***
## STARS               9.273e-02  6.307e-03  14.703  < 2e-16 ***
## Log(theta)          1.759e+01  1.117e+00  15.747  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.34111    0.06205  -5.497 3.85e-08 ***
## STARS       -0.62108    0.03332 -18.642  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 43780234.4711 
## Number of iterations in BFGS optimization: 79 
## Log-likelihood: -2.297e+04 on 18 Df

Model- glmulti Package

The glmulti package is an “automated model selection and model averaging” tool. The package automatically generates all possible models “with the specified response and explanatory variables”. The tool is basically used to find the “best” model.

glmmodel <- glm(imputed$TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + 
    FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
    Alcohol + LabelAppeal + AcidIndex + STARS, data = imputed)

summary(glmmodel)
## 
## Call:
## glm(formula = imputed$TARGET ~ 1 + VolatileAcidity + CitricAcid + 
##     Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
##     pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, 
##     data = imputed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.0970  -0.7356   0.3684   1.1067   4.9945  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.698e+00  5.520e-01  10.323  < 2e-16 ***
## VolatileAcidity    -1.522e-01  1.841e-02  -8.265  < 2e-16 ***
## CitricAcid          3.464e-02  1.676e-02   2.067 0.038738 *  
## Chlorides          -1.719e-01  4.556e-02  -3.773 0.000162 ***
## FreeSulfurDioxide   4.103e-04  9.787e-05   4.193 2.77e-05 ***
## TotalSulfurDioxide  3.063e-04  6.236e-05   4.911 9.15e-07 ***
## Density            -1.430e+00  5.436e-01  -2.630 0.008547 ** 
## pH                 -6.417e-02  2.133e-02  -3.009 0.002630 ** 
## Sulphates          -5.029e-02  1.571e-02  -3.202 0.001369 ** 
## Alcohol             1.810e-02  3.905e-03   4.634 3.62e-06 ***
## LabelAppeal         5.907e-01  1.694e-02  34.872  < 2e-16 ***
## AcidIndex          -3.199e-01  1.104e-02 -28.985  < 2e-16 ***
## STARS               6.470e-01  1.709e-02  37.854  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.653908)
## 
##     Null deviance: 47477  on 12794  degrees of freedom
## Residual deviance: 33922  on 12782  degrees of freedom
## AIC: 48814
## 
## Number of Fisher Scoring iterations: 2
glmmodelabs <- glm(absdata$TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + 
    FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
    Alcohol + LabelAppeal + AcidIndex + STARS, data = absdata)

summary(glmmodelabs)
## 
## Call:
## glm(formula = absdata$TARGET ~ 1 + VolatileAcidity + CitricAcid + 
##     Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
##     pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, 
##     data = absdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.5940  -1.1216   0.2798   1.1736   5.8825  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.463e+00  5.795e-01   9.428  < 2e-16 ***
## VolatileAcidity    -1.817e-01  2.722e-02  -6.674 2.60e-11 ***
## CitricAcid          5.816e-02  2.495e-02   2.331  0.01974 *  
## Chlorides          -1.758e-01  6.465e-02  -2.719  0.00655 ** 
## FreeSulfurDioxide   3.193e-04  1.407e-04   2.270  0.02323 *  
## TotalSulfurDioxide  2.886e-04  9.275e-05   3.112  0.00186 ** 
## Density            -1.629e+00  5.698e-01  -2.858  0.00426 ** 
## pH                 -5.824e-02  2.237e-02  -2.604  0.00923 ** 
## Sulphates          -6.274e-02  2.339e-02  -2.683  0.00731 ** 
## Alcohol             1.724e-02  4.205e-03   4.101 4.14e-05 ***
## LabelAppeal        -3.692e-02  2.431e-02  -1.519  0.12890    
## AcidIndex          -3.029e-01  1.152e-02 -26.281  < 2e-16 ***
## STARS               8.276e-01  1.711e-02  48.378  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.917411)
## 
##     Null deviance: 47477  on 12794  degrees of freedom
## Residual deviance: 37290  on 12782  degrees of freedom
## AIC: 50025
## 
## Number of Fisher Scoring iterations: 2

Select Models

Predictions

Similar to the train data, the evaulation data also needs some prep work. Similar to what was done for the test data, the eval data has had columns removed, and NA values imputed using the MICE - Random Forest method to predict what the NA values could be.

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
vars n mean sd median trimmed mad min max range skew kurtosis se
IN 1 3335 8048.3109445 4655.4790369 7906.0000 8044.2832522 5960.0520000 3.00000 16130.00000 1.6127e+04 0.0124697 -1.2000392 80.6151110
TARGET 2 0 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
FixedAcidity 3 3335 6.8638081 6.3184313 6.9000 6.9142750 2.8169400 -18.20000 33.50000 5.1700e+01 -0.1172460 2.0399637 0.1094111
VolatileAcidity 4 3335 0.3102714 0.8068341 0.2800 0.3129730 0.4596060 -2.83000 3.61000 6.4400e+00 -0.0437301 1.6171958 0.0139713
CitricAcid 5 3335 0.3124288 0.8709938 0.3100 0.3110978 0.4447800 -3.12000 3.76000 6.8800e+00 -0.0284898 1.6564422 0.0150823
ResidualSugar 6 3335 5.2461319 34.3476127 3.5000 5.3553203 16.9016400 -128.30000 145.40000 2.7370e+02 -0.0403317 1.9591596 0.5947694
Chlorides 7 3335 0.0620654 0.3136001 0.0460 0.0633162 0.1141602 -1.15000 1.26300 2.4130e+00 -0.0323702 1.7410859 0.0054304
FreeSulfurDioxide 8 3335 34.6649175 148.7904450 30.0000 34.2701386 56.3388000 -563.00000 617.00000 1.1800e+03 0.0406824 1.9560211 2.5764821
TotalSulfurDioxide 9 3335 123.4229385 226.0029862 124.0000 123.8049831 137.8818000 -769.00000 1004.00000 1.7730e+03 -0.0382917 1.4363477 3.9135083
Density 10 3335 0.9946698 0.0261905 0.9946 0.9946690 0.0090290 0.88975 1.09983 2.1008e-01 -0.0296593 1.9359398 0.0004535
pH 11 3335 3.2358021 0.6707997 3.2100 3.2318471 0.3558240 0.60000 6.21000 5.6100e+00 0.1180977 1.7105032 0.0116157
Sulphates 12 3335 0.5277631 0.9037564 0.5000 0.5282203 0.3706500 -3.07000 4.18000 7.2500e+00 0.0039835 1.9231225 0.0156496
Alcohol 13 3335 10.5898321 3.7352246 10.4000 10.5818809 2.5204200 -4.20000 25.60000 2.9800e+01 0.0627340 1.6219438 0.0646798
LabelAppeal 14 3335 0.0134933 0.8885718 0.0000 0.0063694 1.4826000 -2.00000 2.00000 4.0000e+00 0.0454887 -0.2601115 0.0153867
AcidIndex 15 3335 7.7478261 1.3154203 8.0000 7.6212064 1.4826000 5.00000 17.00000 1.2000e+01 1.5066589 4.2794836 0.0227781
STARS 16 3335 1.9880060 0.8964906 2.0000 1.9142001 1.4826000 1.00000 4.00000 3.0000e+00 0.4954129 -0.6776145 0.0155238

Evaulating the model

The model will be evaulated by looking at the MSE.

Comparison of Models RME.
Linear Model Poisson Model 2 Poisson Model 1 Negative BinomMod Zero Inflation GLmulti ABS
2.650751 6.454063 7.043009 7.043008 2.748571 2.651212 2.914447

The linear model and GLmulti model have very close RME. Both models predictions are shown below:

Model 4

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.09173 2.36401 2.98761 3.05411 3.72062 6.50130

Model 5

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06155 2.36607 2.98843 3.05409 3.72283 6.53702