Work in Progress
Cover Page
CUNY MSDS HW5 -
Nicholas Schettini
CUNY School of Professional Studies
Abstract
In this research assignment, we investigated data on a number of wine boxes sold. The data consists of two response variables: TARGET. The explanatory variables in this dataset include: AcidIndex, Alchol, Chlorides, CitricAcid, Density, FixedAcidity, FreeSulferDioxide, LabelAppeal, ResidualSugar, STARS, Sulphates, TotalSulfurDioxide, VolatileAcidity, pH. The data consits of ~ 12795 observatrions and 14 variables. The research included 4 overall groups: data exploration, data preparation, creating models, and selecting the best model. The data was visualized using multiple methods, including histograms and boxplots. The data was prepped by adding imputations using the mice package in R to correct NA values. Different models were created based on different approaches (for example, Poisson and Zero Inflaction), and finally the best model was selected. The research shows that certain variables from within the dataset set were better predictors than others.
Overview
In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
Data Exploration
The summary below shows multiple missing variables across most of the variables in the wine dataset. The TARGET variable seems to show a discrete variable rather than continious - # of wine boxes sold.| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | na_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| INDEX | 1 | 12795 | 8069.9803048 | 4656.9051071 | 8110.00000 | 8071.0294031 | 5977.8432000 | 1.00000 | 16129.00000 | 1.6128e+04 | -0.0032496 | -1.2005027 | 41.1696565 | 0 |
| TARGET | 2 | 12795 | 3.0290739 | 1.9263682 | 3.00000 | 3.0538244 | 1.4826000 | 0.00000 | 8.00000 | 8.0000e+00 | -0.3263010 | -0.8772457 | 0.0170302 | 0 |
| FixedAcidity | 3 | 12795 | 7.0757171 | 6.3176435 | 6.90000 | 7.0736739 | 3.2617200 | -18.10000 | 34.40000 | 5.2500e+01 | -0.0225860 | 1.6749987 | 0.0558515 | 0 |
| VolatileAcidity | 4 | 12795 | 0.3241039 | 0.7840142 | 0.28000 | 0.3243890 | 0.4299540 | -2.79000 | 3.68000 | 6.4700e+00 | 0.0203800 | 1.8322106 | 0.0069311 | 0 |
| CitricAcid | 5 | 12795 | 0.3084127 | 0.8620798 | 0.31000 | 0.3102520 | 0.4151280 | -3.24000 | 3.86000 | 7.1000e+00 | -0.0503070 | 1.8379401 | 0.0076213 | 0 |
| ResidualSugar | 6 | 12179 | 5.4187331 | 33.7493790 | 3.90000 | 5.5800410 | 15.7155600 | -127.80000 | 141.15000 | 2.6895e+02 | -0.0531229 | 1.8846917 | 0.3058158 | 616 |
| Chlorides | 7 | 12157 | 0.0548225 | 0.3184673 | 0.04600 | 0.0540159 | 0.1349166 | -1.17100 | 1.35100 | 2.5220e+00 | 0.0304272 | 1.7886044 | 0.0028884 | 638 |
| FreeSulfurDioxide | 8 | 12148 | 30.8455713 | 148.7145577 | 30.00000 | 30.9334877 | 56.3388000 | -555.00000 | 623.00000 | 1.1780e+03 | 0.0063930 | 1.8364966 | 1.3492769 | 647 |
| TotalSulfurDioxide | 9 | 12113 | 120.7142326 | 231.9132105 | 123.00000 | 120.8895367 | 134.9166000 | -823.00000 | 1057.00000 | 1.8800e+03 | -0.0071794 | 1.6746665 | 2.1071703 | 682 |
| Density | 10 | 12795 | 0.9942027 | 0.0265376 | 0.99449 | 0.9942130 | 0.0093552 | 0.88809 | 1.09924 | 2.1115e-01 | -0.0186938 | 1.8999592 | 0.0002346 | 0 |
| pH | 11 | 12400 | 3.2076282 | 0.6796871 | 3.20000 | 3.2055706 | 0.3854760 | 0.48000 | 6.13000 | 5.6500e+00 | 0.0442880 | 1.6462681 | 0.0061038 | 395 |
| Sulphates | 12 | 11585 | 0.5271118 | 0.9321293 | 0.50000 | 0.5271453 | 0.4447800 | -3.13000 | 4.24000 | 7.3700e+00 | 0.0059119 | 1.7525655 | 0.0086602 | 1210 |
| Alcohol | 13 | 12142 | 10.4892363 | 3.7278190 | 10.40000 | 10.5018255 | 2.3721600 | -4.70000 | 26.50000 | 3.1200e+01 | -0.0307158 | 1.5394949 | 0.0338306 | 653 |
| LabelAppeal | 14 | 12795 | -0.0090660 | 0.8910892 | 0.00000 | -0.0099639 | 1.4826000 | -2.00000 | 2.00000 | 4.0000e+00 | 0.0084295 | -0.2622916 | 0.0078777 | 0 |
| AcidIndex | 15 | 12795 | 7.7727237 | 1.3239264 | 8.00000 | 7.6431572 | 1.4826000 | 4.00000 | 17.00000 | 1.3000e+01 | 1.6484959 | 5.1900925 | 0.0117043 | 0 |
| STARS | 16 | 9436 | 2.0417550 | 0.9025400 | 2.00000 | 1.9711258 | 1.4826000 | 1.00000 | 4.00000 | 3.0000e+00 | 0.4472353 | -0.6925343 | 0.0092912 | 3359 |
Visual Exploration
Boxplots
The below boxplots show all of the variables listed in the dataset. This visualization will assist in showing how the data is spread for each variable.
The boxplots show
The target variable, number of cases, is shown below. The data shows a large number of zero values.
The distribution looks like a Poisson distribution, with a significant amount of zero values.
## Warning: Removed 3359 rows containing non-finite values (stat_count).
AcidIndex looks more shaped like a poisson distribution, with a slight right skew. LabelAppearl and STARS seems to be more categorical.
## Warning: Removed 4841 rows containing non-finite values (stat_bin).
The other variables seem to be more normally distributed with high kurtosis.
Correlation
The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.
For this project, it makes sense to break down the correlation by target - since that’s what we’re trying to predict.| x | |
|---|---|
| INDEX | 0.0314911 |
| TARGET | 0.4979465 |
| FixedAcidity | 0.0113760 |
| VolatileAcidity | -0.0202420 |
| CitricAcid | 0.0153316 |
| ResidualSugar | -0.0045793 |
| Chlorides | -0.0063870 |
| FreeSulfurDioxide | 0.0149601 |
| TotalSulfurDioxide | -0.0027237 |
| Density | -0.0180944 |
| pH | 0.0002182 |
| Sulphates | 0.0037687 |
| Alcohol | -0.0006449 |
| LabelAppeal | 1.0000000 |
| AcidIndex | 0.0103010 |
| STARS | 0.3188970 |
Looking at the correlations, very few look correlated at all. The ones that do (STARS, LabelAppeal) have a small positive correlation, while AcidIndex and TARGET have a small negative correlation.
Missing Values
According to the graph, the data shows multiple variables with missing variables. The STARS variable has the most NA values. These missing values will be imputed later on during the data preperation using the MICE package.
Data Prep
Imputation of Missing (NA) values
The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.
Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.
In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.
Since the data has many missing values over multiple different variables. The MICE algorithm takes some computing time..
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TARGET | 1 | 12795 | 3.0290739 | 1.9263682 | 3.00000 | 3.0538244 | 1.4826000 | 0.00000 | 8.00000 | 8.00000 | -0.3263010 | -0.8772457 | 0.0170302 |
| FixedAcidity | 2 | 12795 | 7.0757171 | 6.3176435 | 6.90000 | 7.0736739 | 3.2617200 | -18.10000 | 34.40000 | 52.50000 | -0.0225860 | 1.6749987 | 0.0558515 |
| VolatileAcidity | 3 | 12795 | 0.3241039 | 0.7840142 | 0.28000 | 0.3243890 | 0.4299540 | -2.79000 | 3.68000 | 6.47000 | 0.0203800 | 1.8322106 | 0.0069311 |
| CitricAcid | 4 | 12795 | 0.3084127 | 0.8620798 | 0.31000 | 0.3102520 | 0.4151280 | -3.24000 | 3.86000 | 7.10000 | -0.0503070 | 1.8379401 | 0.0076213 |
| ResidualSugar | 5 | 12795 | 5.4560688 | 33.5479209 | 3.80000 | 5.5931328 | 15.5673000 | -127.80000 | 141.15000 | 268.95000 | -0.0418962 | 1.9063668 | 0.2965825 |
| Chlorides | 6 | 12795 | 0.0539703 | 0.3159624 | 0.04600 | 0.0533848 | 0.1275036 | -1.17100 | 1.35100 | 2.52200 | 0.0204250 | 1.8584778 | 0.0027933 |
| FreeSulfurDioxide | 7 | 12795 | 31.2710434 | 148.0337336 | 30.00000 | 31.4126209 | 53.3736000 | -555.00000 | 623.00000 | 1178.00000 | 0.0016334 | 1.8837818 | 1.3087013 |
| TotalSulfurDioxide | 8 | 12795 | 120.3816335 | 230.8142427 | 124.00000 | 120.7789880 | 133.4340000 | -823.00000 | 1057.00000 | 1880.00000 | -0.0177176 | 1.6983299 | 2.0405275 |
| Density | 9 | 12795 | 0.9942027 | 0.0265376 | 0.99449 | 0.9942130 | 0.0093552 | 0.88809 | 1.09924 | 0.21115 | -0.0186938 | 1.8999592 | 0.0002346 |
| pH | 10 | 12795 | 3.2073834 | 0.6769933 | 3.20000 | 3.2054889 | 0.3854760 | 0.48000 | 6.13000 | 5.65000 | 0.0426209 | 1.6611990 | 0.0059850 |
| Sulphates | 11 | 12795 | 0.5277061 | 0.9207721 | 0.50000 | 0.5272687 | 0.4003020 | -3.13000 | 4.24000 | 7.37000 | 0.0114220 | 1.8670775 | 0.0081401 |
| Alcohol | 12 | 12795 | 10.4818189 | 3.7032024 | 10.40000 | 10.4915128 | 2.3721600 | -4.70000 | 26.50000 | 31.20000 | -0.0199093 | 1.5761104 | 0.0327384 |
| LabelAppeal | 13 | 12795 | -0.0090660 | 0.8910892 | 0.00000 | -0.0099639 | 1.4826000 | -2.00000 | 2.00000 | 4.00000 | 0.0084295 | -0.2622916 | 0.0078777 |
| AcidIndex | 14 | 12795 | 7.7727237 | 1.3239264 | 8.00000 | 7.6431572 | 1.4826000 | 4.00000 | 17.00000 | 13.00000 | 1.6484959 | 5.1900925 | 0.0117043 |
| STARS | 15 | 12795 | 1.9802267 | 0.8855040 | 2.00000 | 1.9059295 | 1.4826000 | 1.00000 | 4.00000 | 3.00000 | 0.5180282 | -0.5978441 | 0.0078284 |
Absoulte value of variables
Some of the discussion among classmates has been about taking the abs value of the variables in the dataset - since the debate on the negative numbers for multiple variables.
In this case I will take an ABS transformation and apply it to the top performing model.
It seems however, that taking the ABS of the values in the dataset introduces a right skew where the variable would have been approx. normal.
If this data is transformed using the log transformation, it seems to become ‘more’ normal - but this might be introducting overfitting into the data?
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 5804 rows containing non-finite values (stat_bin).
Build Models
Throughout this section, various models will be created to try to determine which will allow for the best “fit” to predict weather crime appears in a major city as given by the dataset. In this assignment, I will try various models such as: Linear models, Negative Binomial, and Poisson, as suggested by the homework instructions.
Model 1 - Poisson with imputed data
As per the homework videos, the poisson distribution works well with count data.
##
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = imputed)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9784 -0.5298 0.2051 0.6296 2.5442
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.035e+00 1.956e-01 10.403 < 2e-16 ***
## FixedAcidity -2.602e-04 8.201e-04 -0.317 0.751045
## VolatileAcidity -5.162e-02 6.494e-03 -7.949 1.87e-15 ***
## CitricAcid 1.431e-02 5.891e-03 2.429 0.015134 *
## ResidualSugar 1.163e-04 1.517e-04 0.767 0.443001
## Chlorides -5.287e-02 1.615e-02 -3.272 0.001066 **
## FreeSulfurDioxide 1.405e-04 3.442e-05 4.081 4.48e-05 ***
## TotalSulfurDioxide 9.744e-05 2.215e-05 4.399 1.09e-05 ***
## Density -4.109e-01 1.921e-01 -2.139 0.032433 *
## pH -2.290e-02 7.550e-03 -3.033 0.002420 **
## Sulphates -1.939e-02 5.520e-03 -3.513 0.000443 ***
## Alcohol 5.155e-03 1.382e-03 3.731 0.000190 ***
## LabelAppeal 1.937e-01 6.022e-03 32.158 < 2e-16 ***
## AcidIndex -1.217e-01 4.463e-03 -27.259 < 2e-16 ***
## STARS 2.027e-01 5.788e-03 35.027 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 18351 on 12780 degrees of freedom
## AIC: 50323
##
## Number of Fisher Scoring iterations: 5
Model 2 - Poisson without imputed data
##
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2158 -0.2734 0.0616 0.3732 1.6830
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.593e+00 2.506e-01 6.359 2.03e-10 ***
## FixedAcidity 3.293e-04 1.053e-03 0.313 0.75447
## VolatileAcidity -2.560e-02 8.353e-03 -3.065 0.00218 **
## CitricAcid -7.259e-04 7.575e-03 -0.096 0.92365
## ResidualSugar -6.141e-05 1.941e-04 -0.316 0.75165
## Chlorides -3.007e-02 2.056e-02 -1.463 0.14346
## FreeSulfurDioxide 6.734e-05 4.404e-05 1.529 0.12620
## TotalSulfurDioxide 2.081e-05 2.855e-05 0.729 0.46618
## Density -3.725e-01 2.462e-01 -1.513 0.13026
## pH -4.661e-03 9.598e-03 -0.486 0.62722
## Sulphates -5.164e-03 7.051e-03 -0.732 0.46398
## Alcohol 3.948e-03 1.771e-03 2.229 0.02579 *
## LabelAppeal 1.771e-01 7.954e-03 22.271 < 2e-16 ***
## AcidIndex -4.870e-02 5.903e-03 -8.251 < 2e-16 ***
## STARS 1.871e-01 7.487e-03 24.993 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5844.1 on 6435 degrees of freedom
## Residual deviance: 4009.1 on 6421 degrees of freedom
## (6359 observations deleted due to missingness)
## AIC: 23172
##
## Number of Fisher Scoring iterations: 5
Model 3 - Negative Binomial
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
##
## Call:
## glm.nb(formula = TARGET ~ ., data = imputed, init.theta = 38344.98616,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9782 -0.5297 0.2051 0.6296 2.5442
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.035e+00 1.956e-01 10.402 < 2e-16 ***
## FixedAcidity -2.602e-04 8.202e-04 -0.317 0.751042
## VolatileAcidity -5.162e-02 6.494e-03 -7.949 1.88e-15 ***
## CitricAcid 1.431e-02 5.891e-03 2.429 0.015139 *
## ResidualSugar 1.163e-04 1.517e-04 0.767 0.442988
## Chlorides -5.287e-02 1.616e-02 -3.272 0.001066 **
## FreeSulfurDioxide 1.405e-04 3.442e-05 4.081 4.49e-05 ***
## TotalSulfurDioxide 9.744e-05 2.215e-05 4.398 1.09e-05 ***
## Density -4.109e-01 1.921e-01 -2.139 0.032438 *
## pH -2.290e-02 7.550e-03 -3.033 0.002420 **
## Sulphates -1.939e-02 5.521e-03 -3.513 0.000444 ***
## Alcohol 5.155e-03 1.382e-03 3.731 0.000191 ***
## LabelAppeal 1.937e-01 6.022e-03 32.156 < 2e-16 ***
## AcidIndex -1.217e-01 4.464e-03 -27.258 < 2e-16 ***
## STARS 2.027e-01 5.788e-03 35.025 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(38344.99) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 18350 on 12780 degrees of freedom
## AIC: 50325
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 38345
## Std. Err.: 59918
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -50293.1
Model 4 - Linear Model
##
## Call:
## lm(formula = TARGET ~ ., data = imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1455 -0.7398 0.3661 1.1045 4.4181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.401e+00 5.507e-01 9.807 < 2e-16 ***
## FixedAcidity -4.122e-04 2.312e-03 -0.178 0.858533
## VolatileAcidity -1.580e-01 1.836e-02 -8.605 < 2e-16 ***
## CitricAcid 4.299e-02 1.671e-02 2.573 0.010107 *
## ResidualSugar 3.892e-04 4.287e-04 0.908 0.363882
## Chlorides -1.675e-01 4.551e-02 -3.679 0.000235 ***
## FreeSulfurDioxide 4.209e-04 9.721e-05 4.330 1.51e-05 ***
## TotalSulfurDioxide 2.823e-04 6.236e-05 4.526 6.06e-06 ***
## Density -1.159e+00 5.422e-01 -2.137 0.032591 *
## pH -5.957e-02 2.126e-02 -2.801 0.005096 **
## Sulphates -5.669e-02 1.562e-02 -3.629 0.000286 ***
## Alcohol 1.840e-02 3.892e-03 4.729 2.28e-06 ***
## LabelAppeal 5.868e-01 1.689e-02 34.752 < 2e-16 ***
## AcidIndex -3.218e-01 1.117e-02 -28.817 < 2e-16 ***
## STARS 6.645e-01 1.708e-02 38.913 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.625 on 12780 degrees of freedom
## Multiple R-squared: 0.2895, Adjusted R-squared: 0.2887
## F-statistic: 372 on 14 and 12780 DF, p-value: < 2.2e-16
Model 5 - Zero inflation
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
##
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = imputed, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.2585 -0.3095 0.1807 0.5111 2.2104
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.647e+00 2.060e-01 7.998 1.27e-15 ***
## FixedAcidity 1.766e-04 8.540e-04 0.207 0.83613
## VolatileAcidity -1.959e-02 6.848e-03 -2.861 0.00423 **
## CitricAcid 2.508e-03 6.121e-03 0.410 0.68203
## ResidualSugar -3.782e-05 1.580e-04 -0.239 0.81089
## Chlorides -2.587e-02 1.689e-02 -1.532 0.12561
## FreeSulfurDioxide 4.816e-05 3.521e-05 1.368 0.17136
## TotalSulfurDioxide -6.033e-06 2.243e-05 -0.269 0.78793
## Density -3.043e-01 2.016e-01 -1.509 0.13131
## pH 1.714e-03 7.903e-03 0.217 0.82831
## Sulphates -3.187e-03 5.788e-03 -0.550 0.58198
## Alcohol 6.822e-03 1.434e-03 4.758 1.96e-06 ***
## LabelAppeal 2.423e-01 6.375e-03 38.004 < 2e-16 ***
## AcidIndex -4.319e-02 5.403e-03 -7.994 1.31e-15 ***
## STARS 9.568e-02 6.313e-03 15.155 < 2e-16 ***
## Log(theta) 1.798e+01 1.981e+00 9.080 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.33119 0.06228 -5.317 1.05e-07 ***
## STARS -0.62932 0.03362 -18.720 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 64638340.4478
## Number of iterations in BFGS optimization: 65
## Log-likelihood: -2.296e+04 on 18 Df
Model 6 - glmulti Package
The glmulti package is an “automated model selection and model averaging” tool. The package automatically generates all possible models “with the specified response and explanatory variables”. The tool is basically used to find the “best” model.
glmmodel <- glm(imputed$TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides +
FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates +
Alcohol + LabelAppeal + AcidIndex + STARS, data = imputed)
summary(glmmodel)##
## Call:
## glm(formula = imputed$TARGET ~ 1 + VolatileAcidity + CitricAcid +
## Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
## pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS,
## data = imputed)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.1320 -0.7369 0.3636 1.1058 4.4248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.398e+00 5.507e-01 9.803 < 2e-16 ***
## VolatileAcidity -1.581e-01 1.836e-02 -8.613 < 2e-16 ***
## CitricAcid 4.285e-02 1.671e-02 2.564 0.010348 *
## Chlorides -1.677e-01 4.551e-02 -3.684 0.000231 ***
## FreeSulfurDioxide 4.223e-04 9.718e-05 4.345 1.40e-05 ***
## TotalSulfurDioxide 2.835e-04 6.234e-05 4.548 5.46e-06 ***
## Density -1.154e+00 5.422e-01 -2.129 0.033291 *
## pH -5.937e-02 2.126e-02 -2.793 0.005238 **
## Sulphates -5.693e-02 1.561e-02 -3.647 0.000267 ***
## Alcohol 1.833e-02 3.891e-03 4.710 2.50e-06 ***
## LabelAppeal 5.869e-01 1.688e-02 34.757 < 2e-16 ***
## AcidIndex -3.223e-01 1.100e-02 -29.308 < 2e-16 ***
## STARS 6.647e-01 1.707e-02 38.935 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.639244)
##
## Null deviance: 47477 on 12794 degrees of freedom
## Residual deviance: 33735 on 12782 degrees of freedom
## AIC: 48743
##
## Number of Fisher Scoring iterations: 2
glmmodelabs <- glm(absdata$TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides +
FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates +
Alcohol + LabelAppeal + AcidIndex + STARS, data = absdata)
summary(glmmodelabs)##
## Call:
## glm(formula = absdata$TARGET ~ 1 + VolatileAcidity + CitricAcid +
## Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
## pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS,
## data = absdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.3054 -1.1095 0.2816 1.1696 5.8865
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.113e+00 5.782e-01 8.842 < 2e-16 ***
## VolatileAcidity -1.785e-01 2.716e-02 -6.571 5.18e-11 ***
## CitricAcid 6.520e-02 2.488e-02 2.621 0.00879 **
## Chlorides -1.562e-01 6.468e-02 -2.415 0.01577 *
## FreeSulfurDioxide 3.176e-04 1.397e-04 2.274 0.02297 *
## TotalSulfurDioxide 2.708e-04 9.312e-05 2.908 0.00364 **
## Density -1.301e+00 5.684e-01 -2.288 0.02215 *
## pH -5.466e-02 2.230e-02 -2.451 0.01425 *
## Sulphates -6.526e-02 2.317e-02 -2.816 0.00486 **
## Alcohol 1.808e-02 4.187e-03 4.318 1.59e-05 ***
## LabelAppeal -2.969e-02 2.425e-02 -1.224 0.22084
## AcidIndex -3.065e-01 1.148e-02 -26.692 < 2e-16 ***
## STARS 8.412e-01 1.710e-02 49.179 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.902446)
##
## Null deviance: 47477 on 12794 degrees of freedom
## Residual deviance: 37099 on 12782 degrees of freedom
## AIC: 49959
##
## Number of Fisher Scoring iterations: 2
Select Models
Predictions
Similar to the train data, the evaulation data also needs some prep work. Similar to what was done for the test data, the eval data has had columns removed, and NA values imputed using the MICE - Random Forest method to predict what the NA values could be.
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IN | 1 | 3335 | 8048.3109445 | 4655.4790369 | 7906.0000 | 8044.2832522 | 5960.0520000 | 3.00000 | 16130.00000 | 1.6127e+04 | 0.0124697 | -1.2000392 | 80.6151110 |
| TARGET* | 2 | 0 | NaN | NA | NA | NaN | NA | Inf | -Inf | -Inf | NA | NA | NA |
| FixedAcidity | 3 | 3335 | 6.8638081 | 6.3184313 | 6.9000 | 6.9142750 | 2.8169400 | -18.20000 | 33.50000 | 5.1700e+01 | -0.1172460 | 2.0399637 | 0.1094111 |
| VolatileAcidity | 4 | 3335 | 0.3102714 | 0.8068341 | 0.2800 | 0.3129730 | 0.4596060 | -2.83000 | 3.61000 | 6.4400e+00 | -0.0437301 | 1.6171958 | 0.0139713 |
| CitricAcid | 5 | 3335 | 0.3124288 | 0.8709938 | 0.3100 | 0.3110978 | 0.4447800 | -3.12000 | 3.76000 | 6.8800e+00 | -0.0284898 | 1.6564422 | 0.0150823 |
| ResidualSugar | 6 | 3335 | 5.1860570 | 34.1602735 | 3.5000 | 5.3186774 | 16.9016400 | -128.30000 | 145.40000 | 2.7370e+02 | -0.0492818 | 2.0015002 | 0.5915254 |
| Chlorides | 7 | 3335 | 0.0593979 | 0.3135205 | 0.0460 | 0.0607958 | 0.1156428 | -1.15000 | 1.26300 | 2.4130e+00 | -0.0455098 | 1.7219311 | 0.0054290 |
| FreeSulfurDioxide | 8 | 3335 | 33.9872564 | 148.8929259 | 29.0000 | 33.2615212 | 57.0801000 | -563.00000 | 617.00000 | 1.1800e+03 | 0.0730972 | 1.8678963 | 2.5782566 |
| TotalSulfurDioxide | 9 | 3335 | 123.4229385 | 224.5781463 | 124.0000 | 124.0080555 | 136.3992000 | -769.00000 | 1004.00000 | 1.7730e+03 | -0.0437972 | 1.4893654 | 3.8888355 |
| Density | 10 | 3335 | 0.9946698 | 0.0261905 | 0.9946 | 0.9946690 | 0.0090290 | 0.88975 | 1.09983 | 2.1008e-01 | -0.0296593 | 1.9359398 | 0.0004535 |
| pH | 11 | 3335 | 3.2342819 | 0.6740613 | 3.2100 | 3.2306669 | 0.3558240 | 0.60000 | 6.21000 | 5.6100e+00 | 0.1100501 | 1.7179640 | 0.0116722 |
| Sulphates | 12 | 3335 | 0.5409265 | 0.8949812 | 0.5000 | 0.5413713 | 0.3706500 | -3.07000 | 4.18000 | 7.2500e+00 | -0.0029721 | 1.8453436 | 0.0154977 |
| Alcohol | 13 | 3335 | 10.6136552 | 3.7589939 | 10.4000 | 10.6018946 | 2.5204200 | -4.20000 | 25.60000 | 2.9800e+01 | 0.0820618 | 1.5780691 | 0.0650914 |
| LabelAppeal | 14 | 3335 | 0.0134933 | 0.8885718 | 0.0000 | 0.0063694 | 1.4826000 | -2.00000 | 2.00000 | 4.0000e+00 | 0.0454887 | -0.2601115 | 0.0153867 |
| AcidIndex | 15 | 3335 | 7.7478261 | 1.3154203 | 8.0000 | 7.6212064 | 1.4826000 | 5.00000 | 17.00000 | 1.2000e+01 | 1.5066589 | 4.2794836 | 0.0227781 |
| STARS | 16 | 3335 | 1.9985007 | 0.8933858 | 2.0000 | 1.9280629 | 1.4826000 | 1.00000 | 4.00000 | 3.0000e+00 | 0.4747543 | -0.6880260 | 0.0154700 |
Evaulating the model
The model will be evaulated by looking at the MSE.
Comparison of Models RME.| Linear Model | Poisson Model 2 | Poisson Model 1 | Negative BinomMod | Zero Inflation | GLmulti | ABS |
|---|---|---|---|---|---|---|
| 2.636385 | 6.455605 | 7.038349 | 7.038348 | 2.731825 | 2.636562 | 2.899497 |
The linear model and GLmulti model have very close RME. Both models predictions are shown below:
Model 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.9126 2.3506 3.0063 3.0654 3.7495 6.5648
Model 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.9137 2.3497 3.0036 3.0655 3.7519 6.5873
References
All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html
All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html
All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html
Model selection and multimodel inference made easy. (n.d.). Retrieved from https://cran.r-project.org/web/packages/glmulti/glmulti.pdf
Best subset model selection with R.(n.d.). Retrieved from http://jadianes.me/best-subset-model-selection-with-R
ZERO-INFLATED POISSON REGRESSION | R DATA ANALYSIS EXAMPLES. (n.d.). Retrieved from https://stats.idre.ucla.edu/r/dae/zip/