Cover Page

CUNY MSDS HW5 -

Nicholas Schettini

CUNY School of Professional Studies

Abstract

In this research assignment, we investigated data on a number of wine boxes sold. The data consists of two response variables: TARGET. The explanatory variables in this dataset include: AcidIndex, Alchol, Chlorides, CitricAcid, Density, FixedAcidity, FreeSulferDioxide, LabelAppeal, ResidualSugar, STARS, Sulphates, TotalSulfurDioxide, VolatileAcidity, pH. The data consits of ~ 12795 observatrions and 14 variables. The research included 4 overall groups: data exploration, data preparation, creating models, and selecting the best model. The data was visualized using multiple methods, including histograms and boxplots. The data was prepped by adding imputations using the mice package in R to correct NA values. Different models were created based on different approaches (for example, Poisson and Zero Inflaction), and finally the best model was selected. The research shows that certain variables from within the dataset set were better predictors than others.

Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Data Exploration

The summary below shows multiple missing variables across most of the variables in the wine dataset. The TARGET variable seems to show a discrete variable rather than continious - # of wine boxes sold.

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se	na_count
INDEX	1	12795	8069.9803048	4656.9051071	8110.00000	8071.0294031	5977.8432000	1.00000	16129.00000	1.6128e+04	-0.0032496	-1.2005027	41.1696565	0
TARGET	2	12795	3.0290739	1.9263682	3.00000	3.0538244	1.4826000	0.00000	8.00000	8.0000e+00	-0.3263010	-0.8772457	0.0170302	0
FixedAcidity	3	12795	7.0757171	6.3176435	6.90000	7.0736739	3.2617200	-18.10000	34.40000	5.2500e+01	-0.0225860	1.6749987	0.0558515	0
VolatileAcidity	4	12795	0.3241039	0.7840142	0.28000	0.3243890	0.4299540	-2.79000	3.68000	6.4700e+00	0.0203800	1.8322106	0.0069311	0
CitricAcid	5	12795	0.3084127	0.8620798	0.31000	0.3102520	0.4151280	-3.24000	3.86000	7.1000e+00	-0.0503070	1.8379401	0.0076213	0
ResidualSugar	6	12179	5.4187331	33.7493790	3.90000	5.5800410	15.7155600	-127.80000	141.15000	2.6895e+02	-0.0531229	1.8846917	0.3058158	616
Chlorides	7	12157	0.0548225	0.3184673	0.04600	0.0540159	0.1349166	-1.17100	1.35100	2.5220e+00	0.0304272	1.7886044	0.0028884	638
FreeSulfurDioxide	8	12148	30.8455713	148.7145577	30.00000	30.9334877	56.3388000	-555.00000	623.00000	1.1780e+03	0.0063930	1.8364966	1.3492769	647
TotalSulfurDioxide	9	12113	120.7142326	231.9132105	123.00000	120.8895367	134.9166000	-823.00000	1057.00000	1.8800e+03	-0.0071794	1.6746665	2.1071703	682
Density	10	12795	0.9942027	0.0265376	0.99449	0.9942130	0.0093552	0.88809	1.09924	2.1115e-01	-0.0186938	1.8999592	0.0002346	0
pH	11	12400	3.2076282	0.6796871	3.20000	3.2055706	0.3854760	0.48000	6.13000	5.6500e+00	0.0442880	1.6462681	0.0061038	395
Sulphates	12	11585	0.5271118	0.9321293	0.50000	0.5271453	0.4447800	-3.13000	4.24000	7.3700e+00	0.0059119	1.7525655	0.0086602	1210
Alcohol	13	12142	10.4892363	3.7278190	10.40000	10.5018255	2.3721600	-4.70000	26.50000	3.1200e+01	-0.0307158	1.5394949	0.0338306	653
LabelAppeal	14	12795	-0.0090660	0.8910892	0.00000	-0.0099639	1.4826000	-2.00000	2.00000	4.0000e+00	0.0084295	-0.2622916	0.0078777	0
AcidIndex	15	12795	7.7727237	1.3239264	8.00000	7.6431572	1.4826000	4.00000	17.00000	1.3000e+01	1.6484959	5.1900925	0.0117043	0
STARS	16	9436	2.0417550	0.9025400	2.00000	1.9711258	1.4826000	1.00000	4.00000	3.0000e+00	0.4472353	-0.6925343	0.0092912	3359

Visual Exploration

Boxplots

The below boxplots show all of the variables listed in the dataset. This visualization will assist in showing how the data is spread for each variable.

The boxplots show

The target variable, number of cases, is shown below. The data shows a large number of zero values.

The distribution looks like a Poisson distribution, with a significant amount of zero values.

## Warning: Removed 3359 rows containing non-finite values (stat_count).

AcidIndex looks more shaped like a poisson distribution, with a slight right skew. LabelAppearl and STARS seems to be more categorical.

## Warning: Removed 4841 rows containing non-finite values (stat_bin).

The other variables seem to be more normally distributed with high kurtosis.

Correlation

The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.

For this project, it makes sense to break down the correlation by target - since that’s what we’re trying to predict.

	x
INDEX	0.0314911
TARGET	0.4979465
FixedAcidity	0.0113760
VolatileAcidity	-0.0202420
CitricAcid	0.0153316
ResidualSugar	-0.0045793
Chlorides	-0.0063870
FreeSulfurDioxide	0.0149601
TotalSulfurDioxide	-0.0027237
Density	-0.0180944
pH	0.0002182
Sulphates	0.0037687
Alcohol	-0.0006449
LabelAppeal	1.0000000
AcidIndex	0.0103010
STARS	0.3188970

Looking at the correlations, very few look correlated at all. The ones that do (STARS, LabelAppeal) have a small positive correlation, while AcidIndex and TARGET have a small negative correlation.

Missing Values

According to the graph, the data shows multiple variables with missing variables. The STARS variable has the most NA values. These missing values will be imputed later on during the data preperation using the MICE package.

Data Prep

Imputation of Missing (NA) values

The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.

Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.

In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.

Since the data has many missing values over multiple different variables. The MICE algorithm takes some computing time..

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TARGET	1	12795	3.0290739	1.9263682	3.00000	3.0538244	1.4826000	0.00000	8.00000	8.00000	-0.3263010	-0.8772457	0.0170302
FixedAcidity	2	12795	7.0757171	6.3176435	6.90000	7.0736739	3.2617200	-18.10000	34.40000	52.50000	-0.0225860	1.6749987	0.0558515
VolatileAcidity	3	12795	0.3241039	0.7840142	0.28000	0.3243890	0.4299540	-2.79000	3.68000	6.47000	0.0203800	1.8322106	0.0069311
CitricAcid	4	12795	0.3084127	0.8620798	0.31000	0.3102520	0.4151280	-3.24000	3.86000	7.10000	-0.0503070	1.8379401	0.0076213
ResidualSugar	5	12795	5.4560688	33.5479209	3.80000	5.5931328	15.5673000	-127.80000	141.15000	268.95000	-0.0418962	1.9063668	0.2965825
Chlorides	6	12795	0.0539703	0.3159624	0.04600	0.0533848	0.1275036	-1.17100	1.35100	2.52200	0.0204250	1.8584778	0.0027933
FreeSulfurDioxide	7	12795	31.2710434	148.0337336	30.00000	31.4126209	53.3736000	-555.00000	623.00000	1178.00000	0.0016334	1.8837818	1.3087013
TotalSulfurDioxide	8	12795	120.3816335	230.8142427	124.00000	120.7789880	133.4340000	-823.00000	1057.00000	1880.00000	-0.0177176	1.6983299	2.0405275
Density	9	12795	0.9942027	0.0265376	0.99449	0.9942130	0.0093552	0.88809	1.09924	0.21115	-0.0186938	1.8999592	0.0002346
pH	10	12795	3.2073834	0.6769933	3.20000	3.2054889	0.3854760	0.48000	6.13000	5.65000	0.0426209	1.6611990	0.0059850
Sulphates	11	12795	0.5277061	0.9207721	0.50000	0.5272687	0.4003020	-3.13000	4.24000	7.37000	0.0114220	1.8670775	0.0081401
Alcohol	12	12795	10.4818189	3.7032024	10.40000	10.4915128	2.3721600	-4.70000	26.50000	31.20000	-0.0199093	1.5761104	0.0327384
LabelAppeal	13	12795	-0.0090660	0.8910892	0.00000	-0.0099639	1.4826000	-2.00000	2.00000	4.00000	0.0084295	-0.2622916	0.0078777
AcidIndex	14	12795	7.7727237	1.3239264	8.00000	7.6431572	1.4826000	4.00000	17.00000	13.00000	1.6484959	5.1900925	0.0117043
STARS	15	12795	1.9802267	0.8855040	2.00000	1.9059295	1.4826000	1.00000	4.00000	3.00000	0.5180282	-0.5978441	0.0078284

Absoulte value of variables

Some of the discussion among classmates has been about taking the abs value of the variables in the dataset - since the debate on the negative numbers for multiple variables.

In this case I will take an ABS transformation and apply it to the top performing model.

It seems however, that taking the ABS of the values in the dataset introduces a right skew where the variable would have been approx. normal.

If this data is transformed using the log transformation, it seems to become ‘more’ normal - but this might be introducting overfitting into the data?

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 5804 rows containing non-finite values (stat_bin).

Build Models

Throughout this section, various models will be created to try to determine which will allow for the best “fit” to predict weather crime appears in a major city as given by the dataset. In this assignment, I will try various models such as: Linear models, Negative Binomial, and Poisson, as suggested by the homework instructions.

Model 1 - Poisson with imputed data

As per the homework videos, the poisson distribution works well with count data.

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = imputed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9784  -0.5298   0.2051   0.6296   2.5442  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.035e+00  1.956e-01  10.403  < 2e-16 ***
## FixedAcidity       -2.602e-04  8.201e-04  -0.317 0.751045    
## VolatileAcidity    -5.162e-02  6.494e-03  -7.949 1.87e-15 ***
## CitricAcid          1.431e-02  5.891e-03   2.429 0.015134 *  
## ResidualSugar       1.163e-04  1.517e-04   0.767 0.443001    
## Chlorides          -5.287e-02  1.615e-02  -3.272 0.001066 ** 
## FreeSulfurDioxide   1.405e-04  3.442e-05   4.081 4.48e-05 ***
## TotalSulfurDioxide  9.744e-05  2.215e-05   4.399 1.09e-05 ***
## Density            -4.109e-01  1.921e-01  -2.139 0.032433 *  
## pH                 -2.290e-02  7.550e-03  -3.033 0.002420 ** 
## Sulphates          -1.939e-02  5.520e-03  -3.513 0.000443 ***
## Alcohol             5.155e-03  1.382e-03   3.731 0.000190 ***
## LabelAppeal         1.937e-01  6.022e-03  32.158  < 2e-16 ***
## AcidIndex          -1.217e-01  4.463e-03 -27.259  < 2e-16 ***
## STARS               2.027e-01  5.788e-03  35.027  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18351  on 12780  degrees of freedom
## AIC: 50323
## 
## Number of Fisher Scoring iterations: 5

Model 2 - Poisson without imputed data

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2158  -0.2734   0.0616   0.3732   1.6830  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.593e+00  2.506e-01   6.359 2.03e-10 ***
## FixedAcidity        3.293e-04  1.053e-03   0.313  0.75447    
## VolatileAcidity    -2.560e-02  8.353e-03  -3.065  0.00218 ** 
## CitricAcid         -7.259e-04  7.575e-03  -0.096  0.92365    
## ResidualSugar      -6.141e-05  1.941e-04  -0.316  0.75165    
## Chlorides          -3.007e-02  2.056e-02  -1.463  0.14346    
## FreeSulfurDioxide   6.734e-05  4.404e-05   1.529  0.12620    
## TotalSulfurDioxide  2.081e-05  2.855e-05   0.729  0.46618    
## Density            -3.725e-01  2.462e-01  -1.513  0.13026    
## pH                 -4.661e-03  9.598e-03  -0.486  0.62722    
## Sulphates          -5.164e-03  7.051e-03  -0.732  0.46398    
## Alcohol             3.948e-03  1.771e-03   2.229  0.02579 *  
## LabelAppeal         1.771e-01  7.954e-03  22.271  < 2e-16 ***
## AcidIndex          -4.870e-02  5.903e-03  -8.251  < 2e-16 ***
## STARS               1.871e-01  7.487e-03  24.993  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5844.1  on 6435  degrees of freedom
## Residual deviance: 4009.1  on 6421  degrees of freedom
##   (6359 observations deleted due to missingness)
## AIC: 23172
## 
## Number of Fisher Scoring iterations: 5

Model 3 - Negative Binomial

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = imputed, init.theta = 38344.98616, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9782  -0.5297   0.2051   0.6296   2.5442  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.035e+00  1.956e-01  10.402  < 2e-16 ***
## FixedAcidity       -2.602e-04  8.202e-04  -0.317 0.751042    
## VolatileAcidity    -5.162e-02  6.494e-03  -7.949 1.88e-15 ***
## CitricAcid          1.431e-02  5.891e-03   2.429 0.015139 *  
## ResidualSugar       1.163e-04  1.517e-04   0.767 0.442988    
## Chlorides          -5.287e-02  1.616e-02  -3.272 0.001066 ** 
## FreeSulfurDioxide   1.405e-04  3.442e-05   4.081 4.49e-05 ***
## TotalSulfurDioxide  9.744e-05  2.215e-05   4.398 1.09e-05 ***
## Density            -4.109e-01  1.921e-01  -2.139 0.032438 *  
## pH                 -2.290e-02  7.550e-03  -3.033 0.002420 ** 
## Sulphates          -1.939e-02  5.521e-03  -3.513 0.000444 ***
## Alcohol             5.155e-03  1.382e-03   3.731 0.000191 ***
## LabelAppeal         1.937e-01  6.022e-03  32.156  < 2e-16 ***
## AcidIndex          -1.217e-01  4.464e-03 -27.258  < 2e-16 ***
## STARS               2.027e-01  5.788e-03  35.025  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(38344.99) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18350  on 12780  degrees of freedom
## AIC: 50325
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  38345 
##           Std. Err.:  59918 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50293.1

Model 4 - Linear Model

## 
## Call:
## lm(formula = TARGET ~ ., data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1455 -0.7398  0.3661  1.1045  4.4181 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.401e+00  5.507e-01   9.807  < 2e-16 ***
## FixedAcidity       -4.122e-04  2.312e-03  -0.178 0.858533    
## VolatileAcidity    -1.580e-01  1.836e-02  -8.605  < 2e-16 ***
## CitricAcid          4.299e-02  1.671e-02   2.573 0.010107 *  
## ResidualSugar       3.892e-04  4.287e-04   0.908 0.363882    
## Chlorides          -1.675e-01  4.551e-02  -3.679 0.000235 ***
## FreeSulfurDioxide   4.209e-04  9.721e-05   4.330 1.51e-05 ***
## TotalSulfurDioxide  2.823e-04  6.236e-05   4.526 6.06e-06 ***
## Density            -1.159e+00  5.422e-01  -2.137 0.032591 *  
## pH                 -5.957e-02  2.126e-02  -2.801 0.005096 ** 
## Sulphates          -5.669e-02  1.562e-02  -3.629 0.000286 ***
## Alcohol             1.840e-02  3.892e-03   4.729 2.28e-06 ***
## LabelAppeal         5.868e-01  1.689e-02  34.752  < 2e-16 ***
## AcidIndex          -3.218e-01  1.117e-02 -28.817  < 2e-16 ***
## STARS               6.645e-01  1.708e-02  38.913  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.625 on 12780 degrees of freedom
## Multiple R-squared:  0.2895, Adjusted R-squared:  0.2887 
## F-statistic:   372 on 14 and 12780 DF,  p-value: < 2.2e-16

Model 5 - Zero inflation

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

## 
## Call:
## zeroinfl(formula = TARGET ~ . | STARS, data = imputed, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2585 -0.3095  0.1807  0.5111  2.2104 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.647e+00  2.060e-01   7.998 1.27e-15 ***
## FixedAcidity        1.766e-04  8.540e-04   0.207  0.83613    
## VolatileAcidity    -1.959e-02  6.848e-03  -2.861  0.00423 ** 
## CitricAcid          2.508e-03  6.121e-03   0.410  0.68203    
## ResidualSugar      -3.782e-05  1.580e-04  -0.239  0.81089    
## Chlorides          -2.587e-02  1.689e-02  -1.532  0.12561    
## FreeSulfurDioxide   4.816e-05  3.521e-05   1.368  0.17136    
## TotalSulfurDioxide -6.033e-06  2.243e-05  -0.269  0.78793    
## Density            -3.043e-01  2.016e-01  -1.509  0.13131    
## pH                  1.714e-03  7.903e-03   0.217  0.82831    
## Sulphates          -3.187e-03  5.788e-03  -0.550  0.58198    
## Alcohol             6.822e-03  1.434e-03   4.758 1.96e-06 ***
## LabelAppeal         2.423e-01  6.375e-03  38.004  < 2e-16 ***
## AcidIndex          -4.319e-02  5.403e-03  -7.994 1.31e-15 ***
## STARS               9.568e-02  6.313e-03  15.155  < 2e-16 ***
## Log(theta)          1.798e+01  1.981e+00   9.080  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.33119    0.06228  -5.317 1.05e-07 ***
## STARS       -0.62932    0.03362 -18.720  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 64638340.4478 
## Number of iterations in BFGS optimization: 65 
## Log-likelihood: -2.296e+04 on 18 Df

Model 6 - glmulti Package

The glmulti package is an “automated model selection and model averaging” tool. The package automatically generates all possible models “with the specified response and explanatory variables”. The tool is basically used to find the “best” model.

glmmodel <- glm(imputed$TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + 
    FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
    Alcohol + LabelAppeal + AcidIndex + STARS, data = imputed)

summary(glmmodel)

## 
## Call:
## glm(formula = imputed$TARGET ~ 1 + VolatileAcidity + CitricAcid + 
##     Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
##     pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, 
##     data = imputed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.1320  -0.7369   0.3636   1.1058   4.4248  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.398e+00  5.507e-01   9.803  < 2e-16 ***
## VolatileAcidity    -1.581e-01  1.836e-02  -8.613  < 2e-16 ***
## CitricAcid          4.285e-02  1.671e-02   2.564 0.010348 *  
## Chlorides          -1.677e-01  4.551e-02  -3.684 0.000231 ***
## FreeSulfurDioxide   4.223e-04  9.718e-05   4.345 1.40e-05 ***
## TotalSulfurDioxide  2.835e-04  6.234e-05   4.548 5.46e-06 ***
## Density            -1.154e+00  5.422e-01  -2.129 0.033291 *  
## pH                 -5.937e-02  2.126e-02  -2.793 0.005238 ** 
## Sulphates          -5.693e-02  1.561e-02  -3.647 0.000267 ***
## Alcohol             1.833e-02  3.891e-03   4.710 2.50e-06 ***
## LabelAppeal         5.869e-01  1.688e-02  34.757  < 2e-16 ***
## AcidIndex          -3.223e-01  1.100e-02 -29.308  < 2e-16 ***
## STARS               6.647e-01  1.707e-02  38.935  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.639244)
## 
##     Null deviance: 47477  on 12794  degrees of freedom
## Residual deviance: 33735  on 12782  degrees of freedom
## AIC: 48743
## 
## Number of Fisher Scoring iterations: 2

glmmodelabs <- glm(absdata$TARGET ~ 1 + VolatileAcidity + CitricAcid + Chlorides + 
    FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates + 
    Alcohol + LabelAppeal + AcidIndex + STARS, data = absdata)

summary(glmmodelabs)

## 
## Call:
## glm(formula = absdata$TARGET ~ 1 + VolatileAcidity + CitricAcid + 
##     Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density + 
##     pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, 
##     data = absdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.3054  -1.1095   0.2816   1.1696   5.8865  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.113e+00  5.782e-01   8.842  < 2e-16 ***
## VolatileAcidity    -1.785e-01  2.716e-02  -6.571 5.18e-11 ***
## CitricAcid          6.520e-02  2.488e-02   2.621  0.00879 ** 
## Chlorides          -1.562e-01  6.468e-02  -2.415  0.01577 *  
## FreeSulfurDioxide   3.176e-04  1.397e-04   2.274  0.02297 *  
## TotalSulfurDioxide  2.708e-04  9.312e-05   2.908  0.00364 ** 
## Density            -1.301e+00  5.684e-01  -2.288  0.02215 *  
## pH                 -5.466e-02  2.230e-02  -2.451  0.01425 *  
## Sulphates          -6.526e-02  2.317e-02  -2.816  0.00486 ** 
## Alcohol             1.808e-02  4.187e-03   4.318 1.59e-05 ***
## LabelAppeal        -2.969e-02  2.425e-02  -1.224  0.22084    
## AcidIndex          -3.065e-01  1.148e-02 -26.692  < 2e-16 ***
## STARS               8.412e-01  1.710e-02  49.179  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.902446)
## 
##     Null deviance: 47477  on 12794  degrees of freedom
## Residual deviance: 37099  on 12782  degrees of freedom
## AIC: 49959
## 
## Number of Fisher Scoring iterations: 2

Select Models

Predictions

Similar to the train data, the evaulation data also needs some prep work. Similar to what was done for the test data, the eval data has had columns removed, and NA values imputed using the MICE - Random Forest method to predict what the NA values could be.

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
IN	1	3335	8048.3109445	4655.4790369	7906.0000	8044.2832522	5960.0520000	3.00000	16130.00000	1.6127e+04	0.0124697	-1.2000392	80.6151110
TARGET*	2	0	NaN	NA	NA	NaN	NA	Inf	-Inf	-Inf	NA	NA	NA
FixedAcidity	3	3335	6.8638081	6.3184313	6.9000	6.9142750	2.8169400	-18.20000	33.50000	5.1700e+01	-0.1172460	2.0399637	0.1094111
VolatileAcidity	4	3335	0.3102714	0.8068341	0.2800	0.3129730	0.4596060	-2.83000	3.61000	6.4400e+00	-0.0437301	1.6171958	0.0139713
CitricAcid	5	3335	0.3124288	0.8709938	0.3100	0.3110978	0.4447800	-3.12000	3.76000	6.8800e+00	-0.0284898	1.6564422	0.0150823
ResidualSugar	6	3335	5.1860570	34.1602735	3.5000	5.3186774	16.9016400	-128.30000	145.40000	2.7370e+02	-0.0492818	2.0015002	0.5915254
Chlorides	7	3335	0.0593979	0.3135205	0.0460	0.0607958	0.1156428	-1.15000	1.26300	2.4130e+00	-0.0455098	1.7219311	0.0054290
FreeSulfurDioxide	8	3335	33.9872564	148.8929259	29.0000	33.2615212	57.0801000	-563.00000	617.00000	1.1800e+03	0.0730972	1.8678963	2.5782566
TotalSulfurDioxide	9	3335	123.4229385	224.5781463	124.0000	124.0080555	136.3992000	-769.00000	1004.00000	1.7730e+03	-0.0437972	1.4893654	3.8888355
Density	10	3335	0.9946698	0.0261905	0.9946	0.9946690	0.0090290	0.88975	1.09983	2.1008e-01	-0.0296593	1.9359398	0.0004535
pH	11	3335	3.2342819	0.6740613	3.2100	3.2306669	0.3558240	0.60000	6.21000	5.6100e+00	0.1100501	1.7179640	0.0116722
Sulphates	12	3335	0.5409265	0.8949812	0.5000	0.5413713	0.3706500	-3.07000	4.18000	7.2500e+00	-0.0029721	1.8453436	0.0154977
Alcohol	13	3335	10.6136552	3.7589939	10.4000	10.6018946	2.5204200	-4.20000	25.60000	2.9800e+01	0.0820618	1.5780691	0.0650914
LabelAppeal	14	3335	0.0134933	0.8885718	0.0000	0.0063694	1.4826000	-2.00000	2.00000	4.0000e+00	0.0454887	-0.2601115	0.0153867
AcidIndex	15	3335	7.7478261	1.3154203	8.0000	7.6212064	1.4826000	5.00000	17.00000	1.2000e+01	1.5066589	4.2794836	0.0227781
STARS	16	3335	1.9985007	0.8933858	2.0000	1.9280629	1.4826000	1.00000	4.00000	3.0000e+00	0.4747543	-0.6880260	0.0154700

Evaulating the model

The model will be evaulated by looking at the MSE.

Comparison of Models RME.

Linear Model	Poisson Model 2	Poisson Model 1	Negative BinomMod	Zero Inflation	GLmulti	ABS
2.636385	6.455605	7.038349	7.038348	2.731825	2.636562	2.899497

The linear model and GLmulti model have very close RME. Both models predictions are shown below:

Model 4

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.9126  2.3506  3.0063  3.0654  3.7495  6.5648

Model 5

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.9137  2.3497  3.0036  3.0655  3.7519  6.5873

References

All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

Model selection and multimodel inference made easy. (n.d.). Retrieved from https://cran.r-project.org/web/packages/glmulti/glmulti.pdf

Best subset model selection with R.(n.d.). Retrieved from http://jadianes.me/best-subset-model-selection-with-R

ZERO-INFLATED POISSON REGRESSION | R DATA ANALYSIS EXAMPLES. (n.d.). Retrieved from https://stats.idre.ucla.edu/r/dae/zip/

Appendix

https://github.com/nschettini/CUNY-MSDS-DATA-621/blob/master/HW5

CUNY MSDS Data 621 - HW 5

NIcholas Schettini

July 9, 2018