In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
I uploaded the csv files on my github account and loaded here. First column INDEX in training set and IN in evaluation set doesn’t have any significance in our analysis so the columns will be dropped.
The training dataset has 12795 observations and 16 variables. We see from the summary there are missing values in many variables. We will clean up accordingly in the later sections.
## [1] 12795 16
## INDEX TARGET FixedAcidity VolatileAcidity
## Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
## 1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
## Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
## Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
## 3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
## Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
##
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00
## 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00
## Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00
## Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85
## 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00
## Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00
## NA's :616 NA's :638 NA's :647
## TotalSulfurDioxide Density pH Sulphates
## Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
## 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800
## Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
## Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
## 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600
## Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
## NA's :682 NA's :395 NA's :1210
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
## Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :653 NA's :3359
The first column from both of the datasets will be dropped as they are just index numbers and doesn’t help in our analysis.
We see from the histogram plots we see many of the variables have normal distribution,there is right skewing in AcidIndex and STARS
From the variable plots below plotted against target varaible, we see - the values of STARS and LabelAppeal increases with target variable, so they have a positive relationship with target variable.
From the correlation matrix below - STARS and LabelAppeal have positive reltionship and other variables are loosely correlated, there doesn’t seem to be any relationships.
In our earlier analysis we saw there are variables with missing values. So firstly for STARS we plan to replace NA it with 0. For the remaining missing data we will be using caret::preProcess and method=knnImpute. With preProcess it will also center, scale and BoxCox our features at the same time. df_clean is the cleaned dataframe.
We now start building the models
Split the dataset df_clean into 80% as training and 20% as testing datasets.
In the first model we use generalized linear model glm with poisson family and include all the features.
##
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = poisson,
## data = training_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2254 -0.6505 -0.0054 0.4448 3.6717
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.335e-01 4.496e-01 0.964 0.334905
## FixedAcidity 7.682e-04 5.771e-03 0.133 0.894099
## VolatileAcidity -2.170e-02 5.738e-03 -3.782 0.000155
## CitricAcid 2.702e-04 5.662e-03 0.048 0.961932
## ResidualSugar 2.304e-03 5.820e-03 0.396 0.692234
## Chlorides -1.277e-02 5.871e-03 -2.175 0.029608
## FreeSulfurDioxide 1.309e-02 5.775e-03 2.267 0.023378
## TotalSulfurDioxide 1.792e-02 5.852e-03 3.063 0.002190
## Density -1.038e-02 5.697e-03 -1.821 0.068587
## pH -6.798e-03 5.829e-03 -1.166 0.243577
## Sulphates -1.048e-02 5.932e-03 -1.767 0.077181
## Alcohol 1.788e-02 5.903e-03 3.028 0.002459
## as.factor(LabelAppeal)-1.11204793733397 2.201e-01 4.277e-02 5.147 2.65e-07
## as.factor(LabelAppeal)0.0101741115806247 4.108e-01 4.172e-02 9.844 < 2e-16
## as.factor(LabelAppeal)1.13239616049522 5.502e-01 4.245e-02 12.960 < 2e-16
## as.factor(LabelAppeal)2.25461820940981 6.842e-01 4.770e-02 14.346 < 2e-16
## as.factor(AcidIndex)-3.59682937695875 -5.236e-01 4.532e-01 -1.155 0.247994
## as.factor(AcidIndex)-1.79176983045029 -5.106e-01 4.484e-01 -1.139 0.254765
## as.factor(AcidIndex)-0.545318540973785 -5.436e-01 4.481e-01 -1.213 0.225112
## as.factor(AcidIndex)0.362910765511677 -5.757e-01 4.481e-01 -1.285 0.198947
## as.factor(AcidIndex)1.05172974217783 -6.939e-01 4.484e-01 -1.547 0.121782
## as.factor(AcidIndex)1.59059728918163 -8.152e-01 4.493e-01 -1.814 0.069633
## as.factor(AcidIndex)2.02271372429848 -1.197e+00 4.526e-01 -2.644 0.008201
## as.factor(AcidIndex)2.37629509167962 -1.330e+00 4.591e-01 -2.898 0.003755
## as.factor(AcidIndex)2.67051656830802 -1.057e+00 4.610e-01 -2.292 0.021882
## as.factor(AcidIndex)2.9188445277671 -1.184e+00 4.728e-01 -2.505 0.012236
## as.factor(AcidIndex)3.13100139587667 -7.599e-01 5.330e-01 -1.426 0.153969
## as.factor(AcidIndex)3.31417429494859 -1.347e+01 1.626e+02 -0.083 0.933945
## as.factor(AcidIndex)3.47378568897179 -1.484e+00 6.335e-01 -2.342 0.019185
## as.factor(STARS)-0.42623524866846 7.568e-01 2.193e-02 34.514 < 2e-16
## as.factor(STARS)0.416552574962037 1.071e+00 2.050e-02 52.234 < 2e-16
## as.factor(STARS)1.25934039859254 1.186e+00 2.163e-02 54.804 < 2e-16
## as.factor(STARS)2.10212822222303 1.310e+00 2.724e-02 48.067 < 2e-16
##
## (Intercept)
## FixedAcidity
## VolatileAcidity ***
## CitricAcid
## ResidualSugar
## Chlorides *
## FreeSulfurDioxide *
## TotalSulfurDioxide **
## Density .
## pH
## Sulphates .
## Alcohol **
## as.factor(LabelAppeal)-1.11204793733397 ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522 ***
## as.factor(LabelAppeal)2.25461820940981 ***
## as.factor(AcidIndex)-3.59682937695875
## as.factor(AcidIndex)-1.79176983045029
## as.factor(AcidIndex)-0.545318540973785
## as.factor(AcidIndex)0.362910765511677
## as.factor(AcidIndex)1.05172974217783
## as.factor(AcidIndex)1.59059728918163 .
## as.factor(AcidIndex)2.02271372429848 **
## as.factor(AcidIndex)2.37629509167962 **
## as.factor(AcidIndex)2.67051656830802 *
## as.factor(AcidIndex)2.9188445277671 *
## as.factor(AcidIndex)3.13100139587667
## as.factor(AcidIndex)3.31417429494859
## as.factor(AcidIndex)3.47378568897179 *
## as.factor(STARS)-0.42623524866846 ***
## as.factor(STARS)0.416552574962037 ***
## as.factor(STARS)1.25934039859254 ***
## as.factor(STARS)2.10212822222303 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18300 on 10235 degrees of freedom
## Residual deviance: 10887 on 10203 degrees of freedom
## AIC: 36509
##
## Number of Fisher Scoring iterations: 10
For Model 2 we use significant features from Model 1 using the stepAIC
## Start: AIC=36509.07
## TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
## Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
## pH + Sulphates + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
## as.factor(STARS)
##
## Df Deviance AIC
## - CitricAcid 1 10887 36507
## - FixedAcidity 1 10887 36507
## - ResidualSugar 1 10887 36507
## - pH 1 10888 36508
## <none> 10887 36509
## - Sulphates 1 10890 36510
## - Density 1 10890 36510
## - Chlorides 1 10892 36512
## - FreeSulfurDioxide 1 10892 36512
## - Alcohol 1 10896 36516
## - TotalSulfurDioxide 1 10896 36516
## - VolatileAcidity 1 10901 36521
## - as.factor(AcidIndex) 13 11234 36831
## - as.factor(LabelAppeal) 4 11448 37063
## - as.factor(STARS) 4 15384 40998
##
## Step: AIC=36507.07
## TARGET ~ FixedAcidity + VolatileAcidity + ResidualSugar + Chlorides +
## FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates +
## Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
## as.factor(STARS)
##
## Df Deviance AIC
## - FixedAcidity 1 10887 36505
## - ResidualSugar 1 10887 36505
## - pH 1 10888 36506
## <none> 10887 36507
## - Sulphates 1 10890 36508
## - Density 1 10890 36508
## + CitricAcid 1 10887 36509
## - Chlorides 1 10892 36510
## - FreeSulfurDioxide 1 10892 36510
## - Alcohol 1 10896 36514
## - TotalSulfurDioxide 1 10896 36514
## - VolatileAcidity 1 10901 36519
## - as.factor(AcidIndex) 13 11235 36829
## - as.factor(LabelAppeal) 4 11449 37061
## - as.factor(STARS) 4 15386 40998
##
## Step: AIC=36505.09
## TARGET ~ VolatileAcidity + ResidualSugar + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + pH + Sulphates + Alcohol +
## as.factor(LabelAppeal) + as.factor(AcidIndex) + as.factor(STARS)
##
## Df Deviance AIC
## - ResidualSugar 1 10887 36503
## - pH 1 10888 36504
## <none> 10887 36505
## - Sulphates 1 10890 36506
## - Density 1 10890 36506
## + FixedAcidity 1 10887 36507
## + CitricAcid 1 10887 36507
## - Chlorides 1 10892 36508
## - FreeSulfurDioxide 1 10892 36508
## - Alcohol 1 10896 36512
## - TotalSulfurDioxide 1 10896 36512
## - VolatileAcidity 1 10901 36517
## - as.factor(AcidIndex) 13 11241 36833
## - as.factor(LabelAppeal) 4 11449 37059
## - as.factor(STARS) 4 15386 40996
##
## Step: AIC=36503.24
## TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS)
##
## Df Deviance AIC
## - pH 1 10888 36503
## <none> 10887 36503
## - Sulphates 1 10890 36504
## - Density 1 10890 36505
## + ResidualSugar 1 10887 36505
## + FixedAcidity 1 10887 36505
## + CitricAcid 1 10887 36505
## - Chlorides 1 10892 36506
## - FreeSulfurDioxide 1 10892 36506
## - Alcohol 1 10896 36510
## - TotalSulfurDioxide 1 10897 36511
## - VolatileAcidity 1 10902 36516
## - as.factor(AcidIndex) 13 11242 36832
## - as.factor(LabelAppeal) 4 11449 37057
## - as.factor(STARS) 4 15387 40995
##
## Step: AIC=36502.59
## TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS)
##
## Df Deviance AIC
## <none> 10888 36503
## + pH 1 10887 36503
## - Sulphates 1 10892 36504
## - Density 1 10892 36504
## + ResidualSugar 1 10888 36504
## + FixedAcidity 1 10888 36505
## + CitricAcid 1 10888 36505
## - Chlorides 1 10893 36505
## - FreeSulfurDioxide 1 10894 36506
## - Alcohol 1 10898 36510
## - TotalSulfurDioxide 1 10898 36510
## - VolatileAcidity 1 10903 36515
## - as.factor(AcidIndex) 13 11242 36830
## - as.factor(LabelAppeal) 4 11450 37056
## - as.factor(STARS) 4 15394 41000
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = poisson,
## data = training_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2383 -0.6493 -0.0062 0.4429 3.6689
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.429534 0.449207 0.956 0.338968
## VolatileAcidity -0.021755 0.005736 -3.792 0.000149
## Chlorides -0.012601 0.005868 -2.147 0.031765
## FreeSulfurDioxide 0.013217 0.005774 2.289 0.022075
## TotalSulfurDioxide 0.018007 0.005848 3.079 0.002077
## Density -0.010443 0.005696 -1.834 0.066721
## Sulphates -0.010458 0.005930 -1.764 0.077808
## Alcohol 0.017944 0.005900 3.042 0.002354
## as.factor(LabelAppeal)-1.11204793733397 0.220335 0.042769 5.152 2.58e-07
## as.factor(LabelAppeal)0.0101741115806247 0.410655 0.041722 9.843 < 2e-16
## as.factor(LabelAppeal)1.13239616049522 0.550258 0.042448 12.963 < 2e-16
## as.factor(LabelAppeal)2.25461820940981 0.683988 0.047692 14.342 < 2e-16
## as.factor(AcidIndex)-3.59682937695875 -0.521321 0.452873 -1.151 0.249674
## as.factor(AcidIndex)-1.79176983045029 -0.508192 0.448002 -1.134 0.256646
## as.factor(AcidIndex)-0.545318540973785 -0.540219 0.447745 -1.207 0.227612
## as.factor(AcidIndex)0.362910765511677 -0.571580 0.447757 -1.277 0.201765
## as.factor(AcidIndex)1.05172974217783 -0.689490 0.447998 -1.539 0.123793
## as.factor(AcidIndex)1.59059728918163 -0.810650 0.448852 -1.806 0.070910
## as.factor(AcidIndex)2.02271372429848 -1.192447 0.452154 -2.637 0.008358
## as.factor(AcidIndex)2.37629509167962 -1.326170 0.458580 -2.892 0.003829
## as.factor(AcidIndex)2.67051656830802 -1.051957 0.460499 -2.284 0.022349
## as.factor(AcidIndex)2.9188445277671 -1.180235 0.472275 -2.499 0.012453
## as.factor(AcidIndex)3.13100139587667 -0.754149 0.532630 -1.416 0.156805
## as.factor(AcidIndex)3.31417429494859 -13.466519 162.534380 -0.083 0.933968
## as.factor(AcidIndex)3.47378568897179 -1.478336 0.632950 -2.336 0.019511
## as.factor(STARS)-0.42623524866846 0.757022 0.021925 34.528 < 2e-16
## as.factor(STARS)0.416552574962037 1.071567 0.020496 52.281 < 2e-16
## as.factor(STARS)1.25934039859254 1.186170 0.021629 54.841 < 2e-16
## as.factor(STARS)2.10212822222303 1.309666 0.027240 48.078 < 2e-16
##
## (Intercept)
## VolatileAcidity ***
## Chlorides *
## FreeSulfurDioxide *
## TotalSulfurDioxide **
## Density .
## Sulphates .
## Alcohol **
## as.factor(LabelAppeal)-1.11204793733397 ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522 ***
## as.factor(LabelAppeal)2.25461820940981 ***
## as.factor(AcidIndex)-3.59682937695875
## as.factor(AcidIndex)-1.79176983045029
## as.factor(AcidIndex)-0.545318540973785
## as.factor(AcidIndex)0.362910765511677
## as.factor(AcidIndex)1.05172974217783
## as.factor(AcidIndex)1.59059728918163 .
## as.factor(AcidIndex)2.02271372429848 **
## as.factor(AcidIndex)2.37629509167962 **
## as.factor(AcidIndex)2.67051656830802 *
## as.factor(AcidIndex)2.9188445277671 *
## as.factor(AcidIndex)3.13100139587667
## as.factor(AcidIndex)3.31417429494859
## as.factor(AcidIndex)3.47378568897179 *
## as.factor(STARS)-0.42623524866846 ***
## as.factor(STARS)0.416552574962037 ***
## as.factor(STARS)1.25934039859254 ***
## as.factor(STARS)2.10212822222303 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18300 on 10235 degrees of freedom
## Residual deviance: 10889 on 10207 degrees of freedom
## AIC: 36503
##
## Number of Fisher Scoring iterations: 10
We will build another model, model 3 we choose negative binomial model this time with all the predictors
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide +
## TotalSulfurDioxide + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
## as.factor(STARS), data = training_df, init.theta = 40382.20944,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2483 -0.6519 -0.0011 0.4398 3.6855
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.316e-01 4.492e-01 0.961 0.336574
## VolatileAcidity -2.200e-02 5.737e-03 -3.835 0.000126
## FreeSulfurDioxide 1.324e-02 5.769e-03 2.296 0.021688
## TotalSulfurDioxide 1.781e-02 5.846e-03 3.047 0.002313
## Alcohol 1.819e-02 5.898e-03 3.083 0.002046
## as.factor(LabelAppeal)-1.11204793733397 2.204e-01 4.277e-02 5.154 2.55e-07
## as.factor(LabelAppeal)0.0101741115806247 4.111e-01 4.172e-02 9.854 < 2e-16
## as.factor(LabelAppeal)1.13239616049522 5.501e-01 4.245e-02 12.960 < 2e-16
## as.factor(LabelAppeal)2.25461820940981 6.831e-01 4.769e-02 14.323 < 2e-16
## as.factor(AcidIndex)-3.59682937695875 -5.203e-01 4.528e-01 -1.149 0.250608
## as.factor(AcidIndex)-1.79176983045029 -5.102e-01 4.480e-01 -1.139 0.254789
## as.factor(AcidIndex)-0.545318540973785 -5.423e-01 4.477e-01 -1.211 0.225791
## as.factor(AcidIndex)0.362910765511677 -5.744e-01 4.477e-01 -1.283 0.199516
## as.factor(AcidIndex)1.05172974217783 -6.939e-01 4.480e-01 -1.549 0.121397
## as.factor(AcidIndex)1.59059728918163 -8.164e-01 4.488e-01 -1.819 0.068900
## as.factor(AcidIndex)2.02271372429848 -1.199e+00 4.521e-01 -2.652 0.008002
## as.factor(AcidIndex)2.37629509167962 -1.332e+00 4.585e-01 -2.905 0.003674
## as.factor(AcidIndex)2.67051656830802 -1.058e+00 4.605e-01 -2.298 0.021577
## as.factor(AcidIndex)2.9188445277671 -1.184e+00 4.722e-01 -2.508 0.012145
## as.factor(AcidIndex)3.13100139587667 -7.431e-01 5.326e-01 -1.395 0.162964
## as.factor(AcidIndex)3.31417429494859 -3.814e+01 3.773e+07 0.000 0.999999
## as.factor(AcidIndex)3.47378568897179 -1.510e+00 6.328e-01 -2.386 0.017039
## as.factor(STARS)-0.42623524866846 7.580e-01 2.192e-02 34.575 < 2e-16
## as.factor(STARS)0.416552574962037 1.073e+00 2.049e-02 52.332 < 2e-16
## as.factor(STARS)1.25934039859254 1.188e+00 2.162e-02 54.951 < 2e-16
## as.factor(STARS)2.10212822222303 1.311e+00 2.724e-02 48.142 < 2e-16
##
## (Intercept)
## VolatileAcidity ***
## FreeSulfurDioxide *
## TotalSulfurDioxide **
## Alcohol **
## as.factor(LabelAppeal)-1.11204793733397 ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522 ***
## as.factor(LabelAppeal)2.25461820940981 ***
## as.factor(AcidIndex)-3.59682937695875
## as.factor(AcidIndex)-1.79176983045029
## as.factor(AcidIndex)-0.545318540973785
## as.factor(AcidIndex)0.362910765511677
## as.factor(AcidIndex)1.05172974217783
## as.factor(AcidIndex)1.59059728918163 .
## as.factor(AcidIndex)2.02271372429848 **
## as.factor(AcidIndex)2.37629509167962 **
## as.factor(AcidIndex)2.67051656830802 *
## as.factor(AcidIndex)2.9188445277671 *
## as.factor(AcidIndex)3.13100139587667
## as.factor(AcidIndex)3.31417429494859
## as.factor(AcidIndex)3.47378568897179 *
## as.factor(STARS)-0.42623524866846 ***
## as.factor(STARS)0.416552574962037 ***
## as.factor(STARS)1.25934039859254 ***
## as.factor(STARS)2.10212822222303 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40382.19) family taken to be 1)
##
## Null deviance: 18299 on 10235 degrees of freedom
## Residual deviance: 10899 on 10210 degrees of freedom
## AIC: 36510
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40382
## Std. Err.: 37785
## Warning while fitting theta: alternation limit reached
##
## 2 x log-likelihood: -36456.02
We use the significant predictors using stepAIC and run again the negative binomial model.
## Start: AIC=36508.02
## TARGET ~ VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide +
## Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
## as.factor(STARS)
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide +
## TotalSulfurDioxide + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) +
## as.factor(STARS), data = training_df, init.theta = 40382.20944,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2483 -0.6519 -0.0011 0.4398 3.6855
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.316e-01 4.492e-01 0.961 0.336574
## VolatileAcidity -2.200e-02 5.737e-03 -3.835 0.000126
## FreeSulfurDioxide 1.324e-02 5.769e-03 2.296 0.021688
## TotalSulfurDioxide 1.781e-02 5.846e-03 3.047 0.002313
## Alcohol 1.819e-02 5.898e-03 3.083 0.002046
## as.factor(LabelAppeal)-1.11204793733397 2.204e-01 4.277e-02 5.154 2.55e-07
## as.factor(LabelAppeal)0.0101741115806247 4.111e-01 4.172e-02 9.854 < 2e-16
## as.factor(LabelAppeal)1.13239616049522 5.501e-01 4.245e-02 12.960 < 2e-16
## as.factor(LabelAppeal)2.25461820940981 6.831e-01 4.769e-02 14.323 < 2e-16
## as.factor(AcidIndex)-3.59682937695875 -5.203e-01 4.528e-01 -1.149 0.250608
## as.factor(AcidIndex)-1.79176983045029 -5.102e-01 4.480e-01 -1.139 0.254789
## as.factor(AcidIndex)-0.545318540973785 -5.423e-01 4.477e-01 -1.211 0.225791
## as.factor(AcidIndex)0.362910765511677 -5.744e-01 4.477e-01 -1.283 0.199516
## as.factor(AcidIndex)1.05172974217783 -6.939e-01 4.480e-01 -1.549 0.121397
## as.factor(AcidIndex)1.59059728918163 -8.164e-01 4.488e-01 -1.819 0.068900
## as.factor(AcidIndex)2.02271372429848 -1.199e+00 4.521e-01 -2.652 0.008002
## as.factor(AcidIndex)2.37629509167962 -1.332e+00 4.585e-01 -2.905 0.003674
## as.factor(AcidIndex)2.67051656830802 -1.058e+00 4.605e-01 -2.298 0.021577
## as.factor(AcidIndex)2.9188445277671 -1.184e+00 4.722e-01 -2.508 0.012145
## as.factor(AcidIndex)3.13100139587667 -7.431e-01 5.326e-01 -1.395 0.162964
## as.factor(AcidIndex)3.31417429494859 -3.814e+01 3.773e+07 0.000 0.999999
## as.factor(AcidIndex)3.47378568897179 -1.510e+00 6.328e-01 -2.386 0.017039
## as.factor(STARS)-0.42623524866846 7.580e-01 2.192e-02 34.575 < 2e-16
## as.factor(STARS)0.416552574962037 1.073e+00 2.049e-02 52.332 < 2e-16
## as.factor(STARS)1.25934039859254 1.188e+00 2.162e-02 54.951 < 2e-16
## as.factor(STARS)2.10212822222303 1.311e+00 2.724e-02 48.142 < 2e-16
##
## (Intercept)
## VolatileAcidity ***
## FreeSulfurDioxide *
## TotalSulfurDioxide **
## Alcohol **
## as.factor(LabelAppeal)-1.11204793733397 ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522 ***
## as.factor(LabelAppeal)2.25461820940981 ***
## as.factor(AcidIndex)-3.59682937695875
## as.factor(AcidIndex)-1.79176983045029
## as.factor(AcidIndex)-0.545318540973785
## as.factor(AcidIndex)0.362910765511677
## as.factor(AcidIndex)1.05172974217783
## as.factor(AcidIndex)1.59059728918163 .
## as.factor(AcidIndex)2.02271372429848 **
## as.factor(AcidIndex)2.37629509167962 **
## as.factor(AcidIndex)2.67051656830802 *
## as.factor(AcidIndex)2.9188445277671 *
## as.factor(AcidIndex)3.13100139587667
## as.factor(AcidIndex)3.31417429494859
## as.factor(AcidIndex)3.47378568897179 *
## as.factor(STARS)-0.42623524866846 ***
## as.factor(STARS)0.416552574962037 ***
## as.factor(STARS)1.25934039859254 ***
## as.factor(STARS)2.10212822222303 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40382.19) family taken to be 1)
##
## Null deviance: 18299 on 10235 degrees of freedom
## Residual deviance: 10899 on 10210 degrees of freedom
## AIC: 36510
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40382
## Std. Err.: 37785
## Warning while fitting theta: alternation limit reached
##
## 2 x log-likelihood: -36456.02
Among the 4 model Poisson 2 fairs well among the others, it has lowest AIC and almost similar MSE as Poisson 1 which is in lower range. Due to lower AIC and lower MSE we choose Poisson Model 2 as our best Model.
| Poisson1 | Poisson2 | Neg binomial1 | Neg binomial2 | |
|---|---|---|---|---|
| MSE | 6.759896 | 6.760038 | 7.117969 | 7.117969 |
| Predictors | 33.000000 | 29.000000 | 26.000000 | 26.000000 |
| AIC | 36509.069767 | 36502.586471 | 36510.019636 | 36510.019636 |