Given a set of metrics describing the quality of different wines, can we predict the number of sales of each wine? Since the target is a count response variable, we will build models appropriate for count regression. After evaluating our set of models, we will select the best one and predict the number of sales in a wine data set the model has not yet seen.
## ï..INDEX TARGET FixedAcidity VolatileAcidity
## Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
## 1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
## Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
## Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
## 3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
## Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
##
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00
## 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00
## Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00
## Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85
## 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00
## Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00
## NA's :616 NA's :638 NA's :647
## TotalSulfurDioxide Density pH Sulphates
## Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
## 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800
## Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
## Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
## 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600
## Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
## NA's :682 NA's :395 NA's :1210
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
## Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :653 NA's :3359
## 'data.frame': 12795 obs. of 16 variables:
## $ ï..INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET : int 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : int 2 3 3 1 2 NA NA 3 NA 4 ...
TARGET.TARGET, the response variable, is the number of sample
cases of wine that were purchased by wine distribution companies after
sampling a wine.STARS and LabelAppeal, the
predictors are mostly chemical metrics of the wines.STARS are ratings given to the wines by
experts whereas values in LabelAppeal are marketing scores
indicate the level of visual appeal of the wine label to customers. Note
thatLabelAppeal is not a score given by customers
themselves, but by marketing tools that have used other sources to make
assumptions.## ï..INDEX TARGET FixedAcidity VolatileAcidity
## 0.000000 0.000000 0.000000 0.000000
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## 0.000000 4.814381 4.986323 5.056663
## TotalSulfurDioxide Density pH Sulphates
## 5.330207 0.000000 3.087143 9.456819
## Alcohol LabelAppeal AcidIndex STARS
## 5.103556 0.000000 0.000000 26.252442
Over 25% of the wines don’t have a value for STARS,
meaning they have not been rated by experts. What is the relationship
between lack of a rating and number of sales?
STARS,
LabelAppeal, and TARGET. This makes sense
because the better a wine label appears, the more likely a customer will
buy the wine. And if an expert rates a wine highly, it is indicative
that other people will like it is as well and decide to buy it.AcidIndex and TARGET, and it is interesting
that this does not appear to be the case for pH and
TARGET. Since pH is also a metric for acidity, we might
have expected a relationship to exist.Note that TARGET contains many 0 values. This means that
many of the wines have 0 sales.
Upon further inspection, we determine that 21% of the wines have 0
sales. We will explore how much the frequency of zeroes in
TARGET affects our models later.
For the AcidIndex, STARS, and
LabelAppeal predictors, due to the small range of both them
and the target variable, it may be easier to see their relationships if
the data is jittered. Also, how large is the difference between
pH and AcidIndex?
LabelAppeal, STARS, and
TARGET.AcidIndex plot reveals that most of the wines have
a lower total acidity, between 5 and 10.##
## Call:
## lm(formula = TARGET ~ ., data = select(wine.train, -"INDEX"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0614 -0.5143 0.1240 0.7170 3.2419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.563e+00 5.530e-01 8.251 < 2e-16 ***
## FixedAcidity 1.685e-03 2.319e-03 0.727 0.4675
## VolatileAcidity -9.466e-02 1.846e-02 -5.129 3.00e-07 ***
## CitricAcid -4.836e-03 1.675e-02 -0.289 0.7728
## ResidualSugar -2.513e-04 4.276e-04 -0.588 0.5567
## Chlorides -1.134e-01 4.546e-02 -2.494 0.0126 *
## FreeSulfurDioxide 2.264e-04 9.711e-05 2.332 0.0198 *
## TotalSulfurDioxide 7.810e-05 6.288e-05 1.242 0.2142
## Density -1.281e+00 5.435e-01 -2.357 0.0185 *
## pH -9.441e-03 2.121e-02 -0.445 0.6563
## Sulphates -1.727e-02 1.558e-02 -1.109 0.2676
## Alcohol 1.653e-02 3.887e-03 4.252 2.15e-05 ***
## LabelAppeal 6.442e-01 1.743e-02 36.947 < 2e-16 ***
## AcidIndex -1.649e-01 1.235e-02 -13.346 < 2e-16 ***
## STARS 7.278e-01 1.710e-02 42.571 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.153 on 6421 degrees of freedom
## (6359 observations deleted due to missingness)
## Multiple R-squared: 0.445, Adjusted R-squared: 0.4438
## F-statistic: 367.8 on 14 and 6421 DF, p-value: < 2.2e-16
44% of the variability observed in the number of sales is explained by the model.
We choose the highly significant variables as outputted by the previous model:
##
## Call:
## lm(formula = TARGET ~ VolatileAcidity + Sulphates + Alcohol +
## LabelAppeal + AcidIndex + STARS, data = wine.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0351 -0.5123 0.1289 0.7224 3.1485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.282082 0.100168 32.766 < 2e-16 ***
## VolatileAcidity -0.090141 0.016389 -5.500 3.91e-08 ***
## Sulphates -0.014904 0.013840 -1.077 0.282
## Alcohol 0.020198 0.003436 5.879 4.29e-09 ***
## LabelAppeal 0.653326 0.015456 42.269 < 2e-16 ***
## AcidIndex -0.167059 0.010937 -15.274 < 2e-16 ***
## STARS 0.719125 0.015136 47.511 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.152 on 8127 degrees of freedom
## (4661 observations deleted due to missingness)
## Multiple R-squared: 0.4472, Adjusted R-squared: 0.4468
## F-statistic: 1096 on 6 and 8127 DF, p-value: < 2.2e-16
The Adjusted \(R^{2}\) is hardly better, as seen by the output below:
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.447 0.447 1.15 1096. 0 6 -12687. 25391. 25447.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Multicollinearity is when independent variables are correlated. Variance Inflation (VIF) is a metric that can be used to determine multicollinearity between variables in a model. A score over 5 is considered severe, and the variable would not be as statistically significant. If there is a problem with multicollinearity, one solution is to carefully trim the model by removing some of the offending variables.
## FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## 1.027295 1.003524 1.006280 1.002597
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density
## 1.003084 1.004023 1.002931 1.004776
## pH Sulphates Alcohol LabelAppeal
## 1.004116 1.004483 1.009935 1.117057
## AcidIndex STARS
## 1.048563 1.134199
## VolatileAcidity Sulphates Alcohol LabelAppeal AcidIndex
## 1.001844 1.000869 1.006664 1.126009 1.012806
## STARS
## 1.138001
In each model, the VIF scores are very close to 1, which is good. Still,
there is overall evidence that a multiple linear regression is not the
right fit for this data.
##
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = poisson,
## data = wine.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2212 -0.2662 0.0461 0.3943 1.7274
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.306e-01 5.130e-01 1.424 0.15441
## FixedAcidity 5.803e-04 1.055e-03 0.550 0.58216
## VolatileAcidity -2.371e-02 8.369e-03 -2.833 0.00461 **
## CitricAcid -2.313e-03 7.581e-03 -0.305 0.76034
## ResidualSugar -7.068e-05 1.943e-04 -0.364 0.71598
## Chlorides -3.261e-02 2.056e-02 -1.587 0.11260
## FreeSulfurDioxide 5.617e-05 4.399e-05 1.277 0.20173
## TotalSulfurDioxide 1.985e-05 2.858e-05 0.695 0.48734
## Density -3.803e-01 2.464e-01 -1.543 0.12274
## pH -1.103e-03 9.614e-03 -0.115 0.90867
## Sulphates -5.343e-03 7.055e-03 -0.757 0.44883
## Alcohol 4.762e-03 1.774e-03 2.685 0.00726 **
## as.factor(LabelAppeal)-1 2.701e-01 5.337e-02 5.061 4.18e-07 ***
## as.factor(LabelAppeal)0 4.943e-01 5.205e-02 9.497 < 2e-16 ***
## as.factor(LabelAppeal)1 6.493e-01 5.277e-02 12.305 < 2e-16 ***
## as.factor(LabelAppeal)2 7.637e-01 5.840e-02 13.077 < 2e-16 ***
## as.factor(AcidIndex)5 1.186e-01 4.553e-01 0.260 0.79455
## as.factor(AcidIndex)6 1.902e-01 4.487e-01 0.424 0.67166
## as.factor(AcidIndex)7 1.536e-01 4.484e-01 0.343 0.73190
## as.factor(AcidIndex)8 1.286e-01 4.485e-01 0.287 0.77430
## as.factor(AcidIndex)9 7.794e-02 4.488e-01 0.174 0.86215
## as.factor(AcidIndex)10 -1.910e-02 4.502e-01 -0.042 0.96617
## as.factor(AcidIndex)11 -2.417e-01 4.540e-01 -0.532 0.59456
## as.factor(AcidIndex)12 -2.259e-01 4.605e-01 -0.490 0.62379
## as.factor(AcidIndex)13 -1.328e-01 4.634e-01 -0.287 0.77447
## as.factor(AcidIndex)14 -1.998e-01 4.813e-01 -0.415 0.67806
## as.factor(AcidIndex)15 2.322e-02 5.591e-01 0.042 0.96687
## as.factor(AcidIndex)16 -2.005e-01 6.341e-01 -0.316 0.75185
## as.factor(AcidIndex)17 7.303e-02 6.351e-01 0.115 0.90845
## as.factor(STARS)2 3.175e-01 1.744e-02 18.198 < 2e-16 ***
## as.factor(STARS)3 4.320e-01 1.899e-02 22.747 < 2e-16 ***
## as.factor(STARS)4 5.519e-01 2.692e-02 20.504 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5844.1 on 6435 degrees of freedom
## Residual deviance: 3890.7 on 6404 degrees of freedom
## (6359 observations deleted due to missingness)
## AIC: 23087
##
## Number of Fisher Scoring iterations: 5
##
## Overdispersion test
##
## data: Poisson_Model1
## z = -46.626, p-value = 1
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## -0.5779479
Since the p value is 1, meaning this is not Over-dispersion Which is good.
We choose the highly significant variables as outputted by Model 3:
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = poisson,
## data = wine.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2167 -0.2669 0.0479 0.3933 2.0507
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.750e-01 4.512e-01 0.831 0.40593
## VolatileAcidity -2.133e-02 8.008e-03 -2.663 0.00775 **
## Chlorides -2.800e-02 1.973e-02 -1.420 0.15571
## FreeSulfurDioxide 7.109e-05 4.206e-05 1.690 0.09101 .
## TotalSulfurDioxide 2.420e-05 2.745e-05 0.881 0.37807
## Sulphates -4.931e-03 6.766e-03 -0.729 0.46614
## Alcohol 4.738e-03 1.703e-03 2.782 0.00541 **
## as.factor(LabelAppeal)-1 2.515e-01 5.140e-02 4.892 9.96e-07 ***
## as.factor(LabelAppeal)0 4.722e-01 5.016e-02 9.414 < 2e-16 ***
## as.factor(LabelAppeal)1 6.277e-01 5.086e-02 12.340 < 2e-16 ***
## as.factor(LabelAppeal)2 7.395e-01 5.605e-02 13.194 < 2e-16 ***
## as.factor(AcidIndex)5 1.177e-01 4.545e-01 0.259 0.79571
## as.factor(AcidIndex)6 1.790e-01 4.482e-01 0.399 0.68958
## as.factor(AcidIndex)7 1.493e-01 4.479e-01 0.333 0.73883
## as.factor(AcidIndex)8 1.213e-01 4.480e-01 0.271 0.78663
## as.factor(AcidIndex)9 6.778e-02 4.483e-01 0.151 0.87982
## as.factor(AcidIndex)10 -1.896e-02 4.496e-01 -0.042 0.96636
## as.factor(AcidIndex)11 -2.527e-01 4.533e-01 -0.557 0.57724
## as.factor(AcidIndex)12 -2.125e-01 4.590e-01 -0.463 0.64335
## as.factor(AcidIndex)13 -1.314e-01 4.622e-01 -0.284 0.77627
## as.factor(AcidIndex)14 -3.701e-01 4.807e-01 -0.770 0.44131
## as.factor(AcidIndex)15 1.800e-02 5.585e-01 0.032 0.97430
## as.factor(AcidIndex)16 -1.916e-01 6.333e-01 -0.303 0.76226
## as.factor(AcidIndex)17 3.635e-02 6.337e-01 0.057 0.95425
## as.factor(STARS)2 3.226e-01 1.679e-02 19.210 < 2e-16 ***
## as.factor(STARS)3 4.381e-01 1.826e-02 23.999 < 2e-16 ***
## as.factor(STARS)4 5.566e-01 2.565e-02 21.701 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 6339.3 on 6952 degrees of freedom
## Residual deviance: 4216.8 on 6926 degrees of freedom
## (5842 observations deleted due to missingness)
## AIC: 24948
##
## Number of Fisher Scoring iterations: 5
The deviance residuals increases than before with increase degrees of freedom. Furthermore, the AIC score increased significantly from 23087 to 24948. So we can say Poisson Model 1 is better fit than Model2.
Since the residual deviance is smaller than the degrees of freedom, then our data is under-dispersion.
##
## Overdispersion test
##
## data: Poisson_Model2
## z = -48.265, p-value = 1
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## -0.5757886
Since the p value is exactly 1, meaning this is not Over-dispersion Which is good.
##
## Call:
## glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), data = wine.train,
## init.theta = 134433.0376, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2212 -0.2662 0.0461 0.3943 1.7274
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.306e-01 5.131e-01 1.424 0.15442
## FixedAcidity 5.803e-04 1.055e-03 0.550 0.58216
## VolatileAcidity -2.371e-02 8.370e-03 -2.833 0.00461 **
## CitricAcid -2.313e-03 7.581e-03 -0.305 0.76034
## ResidualSugar -7.068e-05 1.943e-04 -0.364 0.71599
## Chlorides -3.261e-02 2.056e-02 -1.587 0.11260
## FreeSulfurDioxide 5.617e-05 4.400e-05 1.277 0.20174
## TotalSulfurDioxide 1.985e-05 2.858e-05 0.695 0.48734
## Density -3.803e-01 2.464e-01 -1.543 0.12274
## pH -1.103e-03 9.614e-03 -0.115 0.90867
## Sulphates -5.343e-03 7.055e-03 -0.757 0.44884
## Alcohol 4.762e-03 1.774e-03 2.685 0.00726 **
## as.factor(LabelAppeal)-1 2.701e-01 5.337e-02 5.060 4.18e-07 ***
## as.factor(LabelAppeal)0 4.943e-01 5.205e-02 9.497 < 2e-16 ***
## as.factor(LabelAppeal)1 6.493e-01 5.277e-02 12.305 < 2e-16 ***
## as.factor(LabelAppeal)2 7.637e-01 5.840e-02 13.077 < 2e-16 ***
## as.factor(AcidIndex)5 1.186e-01 4.553e-01 0.260 0.79455
## as.factor(AcidIndex)6 1.902e-01 4.487e-01 0.424 0.67167
## as.factor(AcidIndex)7 1.536e-01 4.484e-01 0.343 0.73190
## as.factor(AcidIndex)8 1.286e-01 4.485e-01 0.287 0.77430
## as.factor(AcidIndex)9 7.794e-02 4.489e-01 0.174 0.86215
## as.factor(AcidIndex)10 -1.910e-02 4.502e-01 -0.042 0.96617
## as.factor(AcidIndex)11 -2.417e-01 4.540e-01 -0.532 0.59456
## as.factor(AcidIndex)12 -2.259e-01 4.605e-01 -0.490 0.62380
## as.factor(AcidIndex)13 -1.328e-01 4.634e-01 -0.287 0.77447
## as.factor(AcidIndex)14 -1.998e-01 4.813e-01 -0.415 0.67806
## as.factor(AcidIndex)15 2.322e-02 5.591e-01 0.042 0.96687
## as.factor(AcidIndex)16 -2.005e-01 6.341e-01 -0.316 0.75185
## as.factor(AcidIndex)17 7.303e-02 6.351e-01 0.115 0.90845
## as.factor(STARS)2 3.175e-01 1.744e-02 18.198 < 2e-16 ***
## as.factor(STARS)3 4.320e-01 1.899e-02 22.747 < 2e-16 ***
## as.factor(STARS)4 5.519e-01 2.692e-02 20.504 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(134433) family taken to be 1)
##
## Null deviance: 5843.9 on 6435 degrees of freedom
## Residual deviance: 3890.6 on 6404 degrees of freedom
## (6359 observations deleted due to missingness)
## AIC: 23089
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 134433
## Std. Err.: 217492
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -23023.41
We choose the highly significant variables as outputted by Model 5:
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), data = wine.train,
## init.theta = 133561.251, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2167 -0.2669 0.0479 0.3933 2.0506
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.750e-01 4.512e-01 0.831 0.40593
## VolatileAcidity -2.133e-02 8.008e-03 -2.663 0.00775 **
## Chlorides -2.800e-02 1.973e-02 -1.420 0.15572
## FreeSulfurDioxide 7.109e-05 4.206e-05 1.690 0.09101 .
## TotalSulfurDioxide 2.420e-05 2.745e-05 0.881 0.37807
## Sulphates -4.931e-03 6.766e-03 -0.729 0.46614
## Alcohol 4.738e-03 1.703e-03 2.782 0.00541 **
## as.factor(LabelAppeal)-1 2.515e-01 5.140e-02 4.892 9.96e-07 ***
## as.factor(LabelAppeal)0 4.722e-01 5.016e-02 9.414 < 2e-16 ***
## as.factor(LabelAppeal)1 6.277e-01 5.086e-02 12.340 < 2e-16 ***
## as.factor(LabelAppeal)2 7.395e-01 5.605e-02 13.194 < 2e-16 ***
## as.factor(AcidIndex)5 1.177e-01 4.545e-01 0.259 0.79572
## as.factor(AcidIndex)6 1.790e-01 4.482e-01 0.399 0.68958
## as.factor(AcidIndex)7 1.493e-01 4.479e-01 0.333 0.73883
## as.factor(AcidIndex)8 1.213e-01 4.480e-01 0.271 0.78663
## as.factor(AcidIndex)9 6.778e-02 4.483e-01 0.151 0.87982
## as.factor(AcidIndex)10 -1.896e-02 4.496e-01 -0.042 0.96636
## as.factor(AcidIndex)11 -2.527e-01 4.533e-01 -0.557 0.57725
## as.factor(AcidIndex)12 -2.125e-01 4.590e-01 -0.463 0.64335
## as.factor(AcidIndex)13 -1.314e-01 4.622e-01 -0.284 0.77628
## as.factor(AcidIndex)14 -3.701e-01 4.807e-01 -0.770 0.44132
## as.factor(AcidIndex)15 1.799e-02 5.585e-01 0.032 0.97430
## as.factor(AcidIndex)16 -1.916e-01 6.333e-01 -0.303 0.76226
## as.factor(AcidIndex)17 3.635e-02 6.337e-01 0.057 0.95425
## as.factor(STARS)2 3.226e-01 1.679e-02 19.209 < 2e-16 ***
## as.factor(STARS)3 4.381e-01 1.826e-02 23.999 < 2e-16 ***
## as.factor(STARS)4 5.566e-01 2.565e-02 21.700 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(133561.3) family taken to be 1)
##
## Null deviance: 6339.2 on 6952 degrees of freedom
## Residual deviance: 4216.8 on 6926 degrees of freedom
## (5842 observations deleted due to missingness)
## AIC: 24950
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 133561
## Std. Err.: 206991
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -24894.47
Looking into the AIC value, we can say that Model 5 is better than Model 6.
Since the data set indicates under-dispersion it is a good idea to fit a Quasi-Poisson regression model and check whether we see any difference in the standard error estimation for the model regression coefficients.
##
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = quasipoisson,
## data = wine.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2212 -0.2662 0.0461 0.3943 1.7274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.306e-01 3.338e-01 2.189 0.0287 *
## FixedAcidity 5.803e-04 6.862e-04 0.846 0.3978
## VolatileAcidity -2.371e-02 5.446e-03 -4.354 1.36e-05 ***
## CitricAcid -2.313e-03 4.933e-03 -0.469 0.6392
## ResidualSugar -7.068e-05 1.264e-04 -0.559 0.5761
## Chlorides -3.261e-02 1.337e-02 -2.438 0.0148 *
## FreeSulfurDioxide 5.617e-05 2.863e-05 1.962 0.0498 *
## TotalSulfurDioxide 1.985e-05 1.860e-05 1.067 0.2858
## Density -3.803e-01 1.603e-01 -2.372 0.0177 *
## pH -1.103e-03 6.256e-03 -0.176 0.8601
## Sulphates -5.343e-03 4.590e-03 -1.164 0.2445
## Alcohol 4.762e-03 1.154e-03 4.126 3.73e-05 ***
## as.factor(LabelAppeal)-1 2.701e-01 3.473e-02 7.777 8.57e-15 ***
## as.factor(LabelAppeal)0 4.943e-01 3.387e-02 14.596 < 2e-16 ***
## as.factor(LabelAppeal)1 6.493e-01 3.433e-02 18.912 < 2e-16 ***
## as.factor(LabelAppeal)2 7.637e-01 3.800e-02 20.098 < 2e-16 ***
## as.factor(AcidIndex)5 1.186e-01 2.963e-01 0.400 0.6890
## as.factor(AcidIndex)6 1.902e-01 2.919e-01 0.651 0.5148
## as.factor(AcidIndex)7 1.536e-01 2.917e-01 0.527 0.5985
## as.factor(AcidIndex)8 1.286e-01 2.918e-01 0.441 0.6594
## as.factor(AcidIndex)9 7.794e-02 2.920e-01 0.267 0.7896
## as.factor(AcidIndex)10 -1.910e-02 2.929e-01 -0.065 0.9480
## as.factor(AcidIndex)11 -2.417e-01 2.954e-01 -0.818 0.4134
## as.factor(AcidIndex)12 -2.259e-01 2.996e-01 -0.754 0.4510
## as.factor(AcidIndex)13 -1.328e-01 3.015e-01 -0.440 0.6597
## as.factor(AcidIndex)14 -1.998e-01 3.132e-01 -0.638 0.5235
## as.factor(AcidIndex)15 2.322e-02 3.638e-01 0.064 0.9491
## as.factor(AcidIndex)16 -2.005e-01 4.126e-01 -0.486 0.6270
## as.factor(AcidIndex)17 7.303e-02 4.132e-01 0.177 0.8597
## as.factor(STARS)2 3.175e-01 1.135e-02 27.968 < 2e-16 ***
## as.factor(STARS)3 4.320e-01 1.236e-02 34.960 < 2e-16 ***
## as.factor(STARS)4 5.519e-01 1.751e-02 31.513 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 0.4233659)
##
## Null deviance: 5844.1 on 6435 degrees of freedom
## Residual deviance: 3890.7 on 6404 degrees of freedom
## (6359 observations deleted due to missingness)
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = quasipoisson,
## data = wine.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2167 -0.2669 0.0479 0.3933 2.0507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.750e-01 2.942e-01 1.275 0.20247
## VolatileAcidity -2.133e-02 5.221e-03 -4.084 4.47e-05 ***
## Chlorides -2.800e-02 1.286e-02 -2.177 0.02949 *
## FreeSulfurDioxide 7.109e-05 2.743e-05 2.592 0.00956 **
## TotalSulfurDioxide 2.420e-05 1.790e-05 1.352 0.17645
## Sulphates -4.931e-03 4.411e-03 -1.118 0.26371
## Alcohol 4.738e-03 1.111e-03 4.267 2.01e-05 ***
## as.factor(LabelAppeal)-1 2.515e-01 3.351e-02 7.504 6.98e-14 ***
## as.factor(LabelAppeal)0 4.722e-01 3.271e-02 14.438 < 2e-16 ***
## as.factor(LabelAppeal)1 6.277e-01 3.316e-02 18.926 < 2e-16 ***
## as.factor(LabelAppeal)2 7.395e-01 3.654e-02 20.237 < 2e-16 ***
## as.factor(AcidIndex)5 1.177e-01 2.964e-01 0.397 0.69132
## as.factor(AcidIndex)6 1.790e-01 2.922e-01 0.613 0.54015
## as.factor(AcidIndex)7 1.493e-01 2.920e-01 0.511 0.60912
## as.factor(AcidIndex)8 1.213e-01 2.921e-01 0.415 0.67803
## as.factor(AcidIndex)9 6.778e-02 2.923e-01 0.232 0.81662
## as.factor(AcidIndex)10 -1.896e-02 2.931e-01 -0.065 0.94843
## as.factor(AcidIndex)11 -2.527e-01 2.955e-01 -0.855 0.39262
## as.factor(AcidIndex)12 -2.125e-01 2.993e-01 -0.710 0.47764
## as.factor(AcidIndex)13 -1.314e-01 3.014e-01 -0.436 0.66296
## as.factor(AcidIndex)14 -3.701e-01 3.134e-01 -1.181 0.23766
## as.factor(AcidIndex)15 1.800e-02 3.642e-01 0.049 0.96059
## as.factor(AcidIndex)16 -1.916e-01 4.129e-01 -0.464 0.64268
## as.factor(AcidIndex)17 3.635e-02 4.131e-01 0.088 0.92989
## as.factor(STARS)2 3.226e-01 1.095e-02 29.462 < 2e-16 ***
## as.factor(STARS)3 4.381e-01 1.190e-02 36.808 < 2e-16 ***
## as.factor(STARS)4 5.566e-01 1.672e-02 33.283 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 0.4251092)
##
## Null deviance: 6339.3 on 6952 degrees of freedom
## Residual deviance: 4216.8 on 6926 degrees of freedom
## (5842 observations deleted due to missingness)
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
| pois.coef | negbinom.coef | pois.quasi.coef | pois.stderr | negbinom.stderr | pois.quasi.stderr | |
|---|---|---|---|---|---|---|
| (Intercept) | 0.3749910 | 0.3749910 | 0.3749910 | 0.4512069 | 0.4512152 | 0.2941887 |
| VolatileAcidity | -0.0213258 | -0.0213259 | -0.0213258 | 0.0080083 | 0.0080084 | 0.0052214 |
| Chlorides | -0.0280045 | -0.0280046 | -0.0280045 | 0.0197264 | 0.0197267 | 0.0128617 |
| FreeSulfurDioxide | 0.0000711 | 0.0000711 | 0.0000711 | 0.0000421 | 0.0000421 | 0.0000274 |
| TotalSulfurDioxide | 0.0000242 | 0.0000242 | 0.0000242 | 0.0000275 | 0.0000275 | 0.0000179 |
| Sulphates | -0.0049309 | -0.0049309 | -0.0049309 | 0.0067660 | 0.0067661 | 0.0044115 |
| Alcohol | 0.0047382 | 0.0047382 | 0.0047382 | 0.0017033 | 0.0017033 | 0.0011106 |
| as.factor(LabelAppeal)-1 | 0.2514805 | 0.2514805 | 0.2514805 | 0.0514025 | 0.0514029 | 0.0335146 |
| as.factor(LabelAppeal)0 | 0.4722327 | 0.4722327 | 0.4722327 | 0.0501642 | 0.0501646 | 0.0327072 |
| as.factor(LabelAppeal)1 | 0.6276530 | 0.6276530 | 0.6276530 | 0.0508641 | 0.0508645 | 0.0331636 |
From the above table we can see that the model coefficients and standard errors for Poisson and Negative Binomial regression models are the same up to 4 decimal places. This can be due to the fact that under-dispersion in the dataset is not severe enough to impact the accuracy of the Poisson regression model.
The model coefficients for Poisson Regression and Quasi-Poisson Regression models are same, but the estimates for the standard errors are different. This is expected since the data set has under-dispersion.
Standard error estimations for regression coefficients of the Poisson regression model will not be accurate. We need to rely on standard error estimates from the Quasi-Poisson regression model, which is better suited for data sets exhibiting under-dispersion or over-dispersion.
If we need to use these coefficients for inference, it is better to rely on standard error estimates from the Quasi Poisson regression model to calculate the confidence intervals, rather than from the normal Poisson regression model, for better accuracy of inference.
Previously, we determined that 21% of the wines have 0 sales. Also,
over 25% of the wines have not been rated by an expert, indicated by
STARS. Is there a relationship? Of the predictors with
missing values, we can visualize the relationship between them and the
number of sales. Recall that the number of sales amongst all the wines
ranged from 0-8.
The bar at 0 in the STARS plot stands out as the
largest. It indicates that about 2000 wines without experts’ ratings had
no sales. It is much more than any other predictor. There is a clear
relationship between the number of wines that don’t have an expert’s
rating and the number of sales. When there is no rating, no wine is
sold.
Since there is a large number of 0 sales that is likely related to
the STARS predictor, we can build a zero-inflated negative
binomial model. A zero-inflated model assumes that a zero outcome is due
to two different processes. For this model, it assumes that if there is
no expert rating, then a zero is produced. If there is an expert rating,
then the count portion of the model will be used instead. Since the
other portion has only 2 outcomes, we can use the negative binomial
model. In other words, the overall model is a combination of two
models.
Here, the model will take the STARS predictor for the
negative binomial portion, and all of the predictors in the count
portion.
##
## Call:
## zeroinfl(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) |
## STARS, data = wine.train, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.43095 -0.25615 0.05577 0.35670 2.45205
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.520e-01 5.154e-01 1.847 0.0647 .
## FixedAcidity 6.000e-04 1.075e-03 0.558 0.5768
## VolatileAcidity -1.691e-02 8.479e-03 -1.995 0.0461 *
## CitricAcid -1.945e-03 7.700e-03 -0.253 0.8005
## ResidualSugar -7.271e-05 1.962e-04 -0.371 0.7109
## Chlorides -2.722e-02 2.101e-02 -1.296 0.1951
## FreeSulfurDioxide 7.817e-06 4.425e-05 0.177 0.8598
## TotalSulfurDioxide -2.343e-06 2.827e-05 -0.083 0.9340
## Density -3.595e-01 2.516e-01 -1.429 0.1530
## pH 4.591e-03 9.780e-03 0.469 0.6388
## Sulphates -1.880e-03 7.167e-03 -0.262 0.7930
## Alcohol 8.381e-03 1.794e-03 4.673 2.97e-06 ***
## as.factor(LabelAppeal)-1 3.574e-01 5.428e-02 6.585 4.56e-11 ***
## as.factor(LabelAppeal)0 6.550e-01 5.271e-02 12.427 < 2e-16 ***
## as.factor(LabelAppeal)1 8.703e-01 5.314e-02 16.377 < 2e-16 ***
## as.factor(LabelAppeal)2 1.060e+00 5.859e-02 18.099 < 2e-16 ***
## as.factor(AcidIndex)5 -3.590e-02 4.549e-01 -0.079 0.9371
## as.factor(AcidIndex)6 5.884e-02 4.482e-01 0.131 0.8956
## as.factor(AcidIndex)7 1.884e-02 4.479e-01 0.042 0.9665
## as.factor(AcidIndex)8 -1.553e-03 4.480e-01 -0.003 0.9972
## as.factor(AcidIndex)9 -3.515e-02 4.484e-01 -0.078 0.9375
## as.factor(AcidIndex)10 -1.195e-01 4.499e-01 -0.266 0.7904
## as.factor(AcidIndex)11 -1.876e-01 4.550e-01 -0.412 0.6800
## as.factor(AcidIndex)12 -1.279e-01 4.633e-01 -0.276 0.7825
## as.factor(AcidIndex)13 -3.604e-02 4.668e-01 -0.077 0.9385
## as.factor(AcidIndex)14 -4.569e-02 4.891e-01 -0.093 0.9256
## as.factor(AcidIndex)15 -2.199e-02 5.747e-01 -0.038 0.9695
## as.factor(AcidIndex)16 2.464e-01 6.638e-01 0.371 0.7104
## as.factor(AcidIndex)17 -1.053e-01 6.345e-01 -0.166 0.8683
## Log(theta) 1.811e+01 1.852e+00 9.779 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3357 0.5346 4.369 1.25e-05 ***
## STARS -3.8692 0.5248 -7.373 1.66e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 73631876.5559
## Number of iterations in BFGS optimization: 40
## Log-likelihood: -1.134e+04 on 32 Df
The STARS predictor is statistically significant, as
well as VolatileAcidity, Alcohol, and
LabelAppeal. A simpler model with these predictors can be
built.
##
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + Alcohol + as.factor(LabelAppeal) |
## STARS, data = wine.train, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.44720 -0.29223 0.06532 0.35521 2.19805
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.603290 0.047589 12.677 < 2e-16 ***
## VolatileAcidity -0.014676 0.007191 -2.041 0.0413 *
## Alcohol 0.009111 0.001496 6.092 1.12e-09 ***
## as.factor(LabelAppeal)-1 0.362691 0.046620 7.780 7.26e-15 ***
## as.factor(LabelAppeal)0 0.657882 0.045361 14.503 < 2e-16 ***
## as.factor(LabelAppeal)1 0.879500 0.045672 19.257 < 2e-16 ***
## as.factor(LabelAppeal)2 1.065331 0.049974 21.318 < 2e-16 ***
## Log(theta) 17.274647 NaN NaN NaN
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.2194 0.4315 5.143 2.7e-07 ***
## STARS -3.7580 0.4227 -8.891 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 31789517.9144
## Number of iterations in BFGS optimization: 22
## Log-likelihood: -1.584e+04 on 10 Df
One takeaway from the model output is the log odds of the number of
sales, TARGET, being an excessive zero would decrease by
3.7 for every additional unit increase in the expert rating. In other
words, the higher the expert rating, the more likely that the wine had
at least one sale.
Does the evaluation set have characteristics to the training set?
## IN TARGET FixedAcidity VolatileAcidity
## 0.000000 100.000000 0.000000 0.000000
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## 0.000000 5.037481 4.137931 4.557721
## TotalSulfurDioxide Density pH Sulphates
## 4.707646 0.000000 3.118441 9.295352
## Alcohol LabelAppeal AcidIndex STARS
## 5.547226 0.000000 0.000000 25.217391
Similar to the training set, over 25% of the wines don’t have a value
for STARS. Now that we have revealed the relationship
between the absence of an expert’s rating and 0 sales, we decide to use
the simpler zero-inflated negative binomial model.
We notice that there are many NAs for TARGET. How many
are owed to NAs in STARS? The number of observations where
both STARS is NA and TARGET is NA is 3,335.
The number of observations where STARS is NA and
TARGET is not NA is 0.
Every NA in TARGET is an effect of an NA in
STARS. As seen in the training set, the number of sales for
wines without an expert rating is overwhelmingly zero. We can add these
zeroes in place of the NAs in TARGET. Finally, make
predictions on the evaluation set.
Although the evaluation set is about 25% of the size of the training
set, the distribution of TARGET appears similar. We also
note that the range of TARGET is smaller for the evaluation
set, 0-6. Also, there are only 2 wines that were predicted to sell only
once.
Based on diagnostic plots and visualizations of relationships between variables, were able to determine that there is a strong connection between lack of an expert’s rating and whether or not the wine was sold. In order to maximize sales, we propose prioritizing having the wines rated.
knitr::opts_chunk$set(echo = F, warning = F, message = F)
library(corrplot)
library(reshape2)
library(MASS)
library(tidyverse)
library(ggplot2)
library(ggfortify)
library(ggthemes)
library(knitr)
library(broom)
library(caret)
library(leaps)
library(MASS)
library(magrittr)
library(betareg)
library(pscl)
library(gtsummary)
library(nnet)
library(arm)
library(AER)
library(kableExtra)
wine.train <- read.csv('https://raw.githubusercontent.com/djunga/DATA621HW5/main/wine-training-data.csv')
wine.eval <- read.csv('https://raw.githubusercontent.com/djunga/DATA621HW5/main/wine-evaluation-data.csv')
summary(wine.train)
## ï..INDEX TARGET FixedAcidity VolatileAcidity
## Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
## 1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
## Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
## Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
## 3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
## Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
##
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00
## 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00
## Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00
## Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85
## 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00
## Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00
## NA's :616 NA's :638 NA's :647
## TotalSulfurDioxide Density pH Sulphates
## Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
## 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800
## Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
## Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
## 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600
## Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
## NA's :682 NA's :395 NA's :1210
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
## Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :653 NA's :3359
str(wine.train)
## 'data.frame': 12795 obs. of 16 variables:
## $ ï..INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET : int 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : int 2 3 3 1 2 NA NA 3 NA 4 ...
(colSums(is.na(wine.train)) / nrow(wine.train)) * 100
## ï..INDEX TARGET FixedAcidity VolatileAcidity
## 0.000000 0.000000 0.000000 0.000000
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## 0.000000 4.814381 4.986323 5.056663
## TotalSulfurDioxide Density pH Sulphates
## 5.330207 0.000000 3.087143 9.456819
## Alcohol LabelAppeal AcidIndex STARS
## 5.103556 0.000000 0.000000 26.252442
missing_val <- data.frame(num_missing=colSums(is.na(wine.train)))
ggplot(wine.train, aes(STARS, TARGET)) + geom_jitter(width=0.5, height=0.5)
colnames(wine.train)[1] <- "INDEX"
corrplot::corrplot(cor(wine.train, use = "complete.obs"), tl.col="black", tl.cex=0.6, order='AOE')
mlt.train = wine.train
mlt.train = melt(mlt.train, id.vars = "INDEX")
ggplot(aes(value), data = mlt.train) + geom_histogram(stat = "bin", fill = "navyblue") + facet_wrap(~variable, scales = "free") + labs(title = "Distributions of Continuous Variables", x = "Variable", y = "Count")
sum(wine.train %>% select("TARGET") == 0) / nrow(wine.train)
## [1] 0.2136772
mlt.train <- melt(wine.train, id.vars = c("INDEX", "TARGET"))
ggplot(aes(value, TARGET), data = mlt.train) + geom_point() + facet_wrap(~variable, scales = "free") + labs(title = "Distributions of Continuous Variables", x = "Variable", y = "TARGET")
mlt.train <- melt(select(wine.train, "pH", "AcidIndex", "LabelAppeal", "STARS", "TARGET"), id.vars = c("TARGET"))
ggplot(aes(value, TARGET), data = mlt.train) + geom_jitter(width = 0.5, height = 0.5) + facet_wrap(~variable, scales = "free") + labs(title = "Distributions of Continuous Variables Subset, Jittered", x = "Variable", y = "TARGET")
mod1 <- lm(TARGET ~ ., data=select(wine.train, -"INDEX"))
summary(mod1)
##
## Call:
## lm(formula = TARGET ~ ., data = select(wine.train, -"INDEX"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0614 -0.5143 0.1240 0.7170 3.2419
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.563e+00 5.530e-01 8.251 < 2e-16 ***
## FixedAcidity 1.685e-03 2.319e-03 0.727 0.4675
## VolatileAcidity -9.466e-02 1.846e-02 -5.129 3.00e-07 ***
## CitricAcid -4.836e-03 1.675e-02 -0.289 0.7728
## ResidualSugar -2.513e-04 4.276e-04 -0.588 0.5567
## Chlorides -1.134e-01 4.546e-02 -2.494 0.0126 *
## FreeSulfurDioxide 2.264e-04 9.711e-05 2.332 0.0198 *
## TotalSulfurDioxide 7.810e-05 6.288e-05 1.242 0.2142
## Density -1.281e+00 5.435e-01 -2.357 0.0185 *
## pH -9.441e-03 2.121e-02 -0.445 0.6563
## Sulphates -1.727e-02 1.558e-02 -1.109 0.2676
## Alcohol 1.653e-02 3.887e-03 4.252 2.15e-05 ***
## LabelAppeal 6.442e-01 1.743e-02 36.947 < 2e-16 ***
## AcidIndex -1.649e-01 1.235e-02 -13.346 < 2e-16 ***
## STARS 7.278e-01 1.710e-02 42.571 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.153 on 6421 degrees of freedom
## (6359 observations deleted due to missingness)
## Multiple R-squared: 0.445, Adjusted R-squared: 0.4438
## F-statistic: 367.8 on 14 and 6421 DF, p-value: < 2.2e-16
mod2 <- lm(TARGET ~ VolatileAcidity + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, data=wine.train)
summary(mod2)
##
## Call:
## lm(formula = TARGET ~ VolatileAcidity + Sulphates + Alcohol +
## LabelAppeal + AcidIndex + STARS, data = wine.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0351 -0.5123 0.1289 0.7224 3.1485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.282082 0.100168 32.766 < 2e-16 ***
## VolatileAcidity -0.090141 0.016389 -5.500 3.91e-08 ***
## Sulphates -0.014904 0.013840 -1.077 0.282
## Alcohol 0.020198 0.003436 5.879 4.29e-09 ***
## LabelAppeal 0.653326 0.015456 42.269 < 2e-16 ***
## AcidIndex -0.167059 0.010937 -15.274 < 2e-16 ***
## STARS 0.719125 0.015136 47.511 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.152 on 8127 degrees of freedom
## (4661 observations deleted due to missingness)
## Multiple R-squared: 0.4472, Adjusted R-squared: 0.4468
## F-statistic: 1096 on 6 and 8127 DF, p-value: < 2.2e-16
mod2 %>%glance()
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.447 0.447 1.15 1096. 0 6 -12687. 25391. 25447.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
car::vif(mod1)
## FixedAcidity VolatileAcidity CitricAcid ResidualSugar
## 1.027295 1.003524 1.006280 1.002597
## Chlorides FreeSulfurDioxide TotalSulfurDioxide Density
## 1.003084 1.004023 1.002931 1.004776
## pH Sulphates Alcohol LabelAppeal
## 1.004116 1.004483 1.009935 1.117057
## AcidIndex STARS
## 1.048563 1.134199
autoplot(mod1)
car::vif(mod2)
## VolatileAcidity Sulphates Alcohol LabelAppeal AcidIndex
## 1.001844 1.000869 1.006664 1.126009 1.012806
## STARS
## 1.138001
autoplot(mod2)
Poisson_Model1<- glm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=wine.train,
family=poisson
)
summary(Poisson_Model1)
##
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = poisson,
## data = wine.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2212 -0.2662 0.0461 0.3943 1.7274
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.306e-01 5.130e-01 1.424 0.15441
## FixedAcidity 5.803e-04 1.055e-03 0.550 0.58216
## VolatileAcidity -2.371e-02 8.369e-03 -2.833 0.00461 **
## CitricAcid -2.313e-03 7.581e-03 -0.305 0.76034
## ResidualSugar -7.068e-05 1.943e-04 -0.364 0.71598
## Chlorides -3.261e-02 2.056e-02 -1.587 0.11260
## FreeSulfurDioxide 5.617e-05 4.399e-05 1.277 0.20173
## TotalSulfurDioxide 1.985e-05 2.858e-05 0.695 0.48734
## Density -3.803e-01 2.464e-01 -1.543 0.12274
## pH -1.103e-03 9.614e-03 -0.115 0.90867
## Sulphates -5.343e-03 7.055e-03 -0.757 0.44883
## Alcohol 4.762e-03 1.774e-03 2.685 0.00726 **
## as.factor(LabelAppeal)-1 2.701e-01 5.337e-02 5.061 4.18e-07 ***
## as.factor(LabelAppeal)0 4.943e-01 5.205e-02 9.497 < 2e-16 ***
## as.factor(LabelAppeal)1 6.493e-01 5.277e-02 12.305 < 2e-16 ***
## as.factor(LabelAppeal)2 7.637e-01 5.840e-02 13.077 < 2e-16 ***
## as.factor(AcidIndex)5 1.186e-01 4.553e-01 0.260 0.79455
## as.factor(AcidIndex)6 1.902e-01 4.487e-01 0.424 0.67166
## as.factor(AcidIndex)7 1.536e-01 4.484e-01 0.343 0.73190
## as.factor(AcidIndex)8 1.286e-01 4.485e-01 0.287 0.77430
## as.factor(AcidIndex)9 7.794e-02 4.488e-01 0.174 0.86215
## as.factor(AcidIndex)10 -1.910e-02 4.502e-01 -0.042 0.96617
## as.factor(AcidIndex)11 -2.417e-01 4.540e-01 -0.532 0.59456
## as.factor(AcidIndex)12 -2.259e-01 4.605e-01 -0.490 0.62379
## as.factor(AcidIndex)13 -1.328e-01 4.634e-01 -0.287 0.77447
## as.factor(AcidIndex)14 -1.998e-01 4.813e-01 -0.415 0.67806
## as.factor(AcidIndex)15 2.322e-02 5.591e-01 0.042 0.96687
## as.factor(AcidIndex)16 -2.005e-01 6.341e-01 -0.316 0.75185
## as.factor(AcidIndex)17 7.303e-02 6.351e-01 0.115 0.90845
## as.factor(STARS)2 3.175e-01 1.744e-02 18.198 < 2e-16 ***
## as.factor(STARS)3 4.320e-01 1.899e-02 22.747 < 2e-16 ***
## as.factor(STARS)4 5.519e-01 2.692e-02 20.504 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5844.1 on 6435 degrees of freedom
## Residual deviance: 3890.7 on 6404 degrees of freedom
## (6359 observations deleted due to missingness)
## AIC: 23087
##
## Number of Fisher Scoring iterations: 5
Poisson_Model2 <- glm(TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Sulphates +Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=wine.train,
family=poisson
)
summary(Poisson_Model2)
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = poisson,
## data = wine.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2167 -0.2669 0.0479 0.3933 2.0507
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.750e-01 4.512e-01 0.831 0.40593
## VolatileAcidity -2.133e-02 8.008e-03 -2.663 0.00775 **
## Chlorides -2.800e-02 1.973e-02 -1.420 0.15571
## FreeSulfurDioxide 7.109e-05 4.206e-05 1.690 0.09101 .
## TotalSulfurDioxide 2.420e-05 2.745e-05 0.881 0.37807
## Sulphates -4.931e-03 6.766e-03 -0.729 0.46614
## Alcohol 4.738e-03 1.703e-03 2.782 0.00541 **
## as.factor(LabelAppeal)-1 2.515e-01 5.140e-02 4.892 9.96e-07 ***
## as.factor(LabelAppeal)0 4.722e-01 5.016e-02 9.414 < 2e-16 ***
## as.factor(LabelAppeal)1 6.277e-01 5.086e-02 12.340 < 2e-16 ***
## as.factor(LabelAppeal)2 7.395e-01 5.605e-02 13.194 < 2e-16 ***
## as.factor(AcidIndex)5 1.177e-01 4.545e-01 0.259 0.79571
## as.factor(AcidIndex)6 1.790e-01 4.482e-01 0.399 0.68958
## as.factor(AcidIndex)7 1.493e-01 4.479e-01 0.333 0.73883
## as.factor(AcidIndex)8 1.213e-01 4.480e-01 0.271 0.78663
## as.factor(AcidIndex)9 6.778e-02 4.483e-01 0.151 0.87982
## as.factor(AcidIndex)10 -1.896e-02 4.496e-01 -0.042 0.96636
## as.factor(AcidIndex)11 -2.527e-01 4.533e-01 -0.557 0.57724
## as.factor(AcidIndex)12 -2.125e-01 4.590e-01 -0.463 0.64335
## as.factor(AcidIndex)13 -1.314e-01 4.622e-01 -0.284 0.77627
## as.factor(AcidIndex)14 -3.701e-01 4.807e-01 -0.770 0.44131
## as.factor(AcidIndex)15 1.800e-02 5.585e-01 0.032 0.97430
## as.factor(AcidIndex)16 -1.916e-01 6.333e-01 -0.303 0.76226
## as.factor(AcidIndex)17 3.635e-02 6.337e-01 0.057 0.95425
## as.factor(STARS)2 3.226e-01 1.679e-02 19.210 < 2e-16 ***
## as.factor(STARS)3 4.381e-01 1.826e-02 23.999 < 2e-16 ***
## as.factor(STARS)4 5.566e-01 2.565e-02 21.701 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 6339.3 on 6952 degrees of freedom
## Residual deviance: 4216.8 on 6926 degrees of freedom
## (5842 observations deleted due to missingness)
## AIC: 24948
##
## Number of Fisher Scoring iterations: 5
dispersiontest(Poisson_Model2, trafo = 1)
##
## Overdispersion test
##
## data: Poisson_Model2
## z = -48.265, p-value = 1
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
## alpha
## -0.5757886
Negative_Bin_Model1 <- glm.nb(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=wine.train)
summary(Negative_Bin_Model1)
##
## Call:
## glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), data = wine.train,
## init.theta = 134433.0376, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2212 -0.2662 0.0461 0.3943 1.7274
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.306e-01 5.131e-01 1.424 0.15442
## FixedAcidity 5.803e-04 1.055e-03 0.550 0.58216
## VolatileAcidity -2.371e-02 8.370e-03 -2.833 0.00461 **
## CitricAcid -2.313e-03 7.581e-03 -0.305 0.76034
## ResidualSugar -7.068e-05 1.943e-04 -0.364 0.71599
## Chlorides -3.261e-02 2.056e-02 -1.587 0.11260
## FreeSulfurDioxide 5.617e-05 4.400e-05 1.277 0.20174
## TotalSulfurDioxide 1.985e-05 2.858e-05 0.695 0.48734
## Density -3.803e-01 2.464e-01 -1.543 0.12274
## pH -1.103e-03 9.614e-03 -0.115 0.90867
## Sulphates -5.343e-03 7.055e-03 -0.757 0.44884
## Alcohol 4.762e-03 1.774e-03 2.685 0.00726 **
## as.factor(LabelAppeal)-1 2.701e-01 5.337e-02 5.060 4.18e-07 ***
## as.factor(LabelAppeal)0 4.943e-01 5.205e-02 9.497 < 2e-16 ***
## as.factor(LabelAppeal)1 6.493e-01 5.277e-02 12.305 < 2e-16 ***
## as.factor(LabelAppeal)2 7.637e-01 5.840e-02 13.077 < 2e-16 ***
## as.factor(AcidIndex)5 1.186e-01 4.553e-01 0.260 0.79455
## as.factor(AcidIndex)6 1.902e-01 4.487e-01 0.424 0.67167
## as.factor(AcidIndex)7 1.536e-01 4.484e-01 0.343 0.73190
## as.factor(AcidIndex)8 1.286e-01 4.485e-01 0.287 0.77430
## as.factor(AcidIndex)9 7.794e-02 4.489e-01 0.174 0.86215
## as.factor(AcidIndex)10 -1.910e-02 4.502e-01 -0.042 0.96617
## as.factor(AcidIndex)11 -2.417e-01 4.540e-01 -0.532 0.59456
## as.factor(AcidIndex)12 -2.259e-01 4.605e-01 -0.490 0.62380
## as.factor(AcidIndex)13 -1.328e-01 4.634e-01 -0.287 0.77447
## as.factor(AcidIndex)14 -1.998e-01 4.813e-01 -0.415 0.67806
## as.factor(AcidIndex)15 2.322e-02 5.591e-01 0.042 0.96687
## as.factor(AcidIndex)16 -2.005e-01 6.341e-01 -0.316 0.75185
## as.factor(AcidIndex)17 7.303e-02 6.351e-01 0.115 0.90845
## as.factor(STARS)2 3.175e-01 1.744e-02 18.198 < 2e-16 ***
## as.factor(STARS)3 4.320e-01 1.899e-02 22.747 < 2e-16 ***
## as.factor(STARS)4 5.519e-01 2.692e-02 20.504 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(134433) family taken to be 1)
##
## Null deviance: 5843.9 on 6435 degrees of freedom
## Residual deviance: 3890.6 on 6404 degrees of freedom
## (6359 observations deleted due to missingness)
## AIC: 23089
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 134433
## Std. Err.: 217492
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -23023.41
Negative_Bin_Model2 <- glm.nb(TARGET~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Sulphates
+
Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=wine.train)
summary(Negative_Bin_Model2)
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), data = wine.train,
## init.theta = 133561.251, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2167 -0.2669 0.0479 0.3933 2.0506
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.750e-01 4.512e-01 0.831 0.40593
## VolatileAcidity -2.133e-02 8.008e-03 -2.663 0.00775 **
## Chlorides -2.800e-02 1.973e-02 -1.420 0.15572
## FreeSulfurDioxide 7.109e-05 4.206e-05 1.690 0.09101 .
## TotalSulfurDioxide 2.420e-05 2.745e-05 0.881 0.37807
## Sulphates -4.931e-03 6.766e-03 -0.729 0.46614
## Alcohol 4.738e-03 1.703e-03 2.782 0.00541 **
## as.factor(LabelAppeal)-1 2.515e-01 5.140e-02 4.892 9.96e-07 ***
## as.factor(LabelAppeal)0 4.722e-01 5.016e-02 9.414 < 2e-16 ***
## as.factor(LabelAppeal)1 6.277e-01 5.086e-02 12.340 < 2e-16 ***
## as.factor(LabelAppeal)2 7.395e-01 5.605e-02 13.194 < 2e-16 ***
## as.factor(AcidIndex)5 1.177e-01 4.545e-01 0.259 0.79572
## as.factor(AcidIndex)6 1.790e-01 4.482e-01 0.399 0.68958
## as.factor(AcidIndex)7 1.493e-01 4.479e-01 0.333 0.73883
## as.factor(AcidIndex)8 1.213e-01 4.480e-01 0.271 0.78663
## as.factor(AcidIndex)9 6.778e-02 4.483e-01 0.151 0.87982
## as.factor(AcidIndex)10 -1.896e-02 4.496e-01 -0.042 0.96636
## as.factor(AcidIndex)11 -2.527e-01 4.533e-01 -0.557 0.57725
## as.factor(AcidIndex)12 -2.125e-01 4.590e-01 -0.463 0.64335
## as.factor(AcidIndex)13 -1.314e-01 4.622e-01 -0.284 0.77628
## as.factor(AcidIndex)14 -3.701e-01 4.807e-01 -0.770 0.44132
## as.factor(AcidIndex)15 1.799e-02 5.585e-01 0.032 0.97430
## as.factor(AcidIndex)16 -1.916e-01 6.333e-01 -0.303 0.76226
## as.factor(AcidIndex)17 3.635e-02 6.337e-01 0.057 0.95425
## as.factor(STARS)2 3.226e-01 1.679e-02 19.209 < 2e-16 ***
## as.factor(STARS)3 4.381e-01 1.826e-02 23.999 < 2e-16 ***
## as.factor(STARS)4 5.566e-01 2.565e-02 21.700 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(133561.3) family taken to be 1)
##
## Null deviance: 6339.2 on 6952 degrees of freedom
## Residual deviance: 4216.8 on 6926 degrees of freedom
## (5842 observations deleted due to missingness)
## AIC: 24950
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 133561
## Std. Err.: 206991
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -24894.47
Quasi_Poisson_Model1<- glm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=wine.train,
family=quasipoisson
)
summary(Quasi_Poisson_Model1)
##
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = quasipoisson,
## data = wine.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2212 -0.2662 0.0461 0.3943 1.7274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.306e-01 3.338e-01 2.189 0.0287 *
## FixedAcidity 5.803e-04 6.862e-04 0.846 0.3978
## VolatileAcidity -2.371e-02 5.446e-03 -4.354 1.36e-05 ***
## CitricAcid -2.313e-03 4.933e-03 -0.469 0.6392
## ResidualSugar -7.068e-05 1.264e-04 -0.559 0.5761
## Chlorides -3.261e-02 1.337e-02 -2.438 0.0148 *
## FreeSulfurDioxide 5.617e-05 2.863e-05 1.962 0.0498 *
## TotalSulfurDioxide 1.985e-05 1.860e-05 1.067 0.2858
## Density -3.803e-01 1.603e-01 -2.372 0.0177 *
## pH -1.103e-03 6.256e-03 -0.176 0.8601
## Sulphates -5.343e-03 4.590e-03 -1.164 0.2445
## Alcohol 4.762e-03 1.154e-03 4.126 3.73e-05 ***
## as.factor(LabelAppeal)-1 2.701e-01 3.473e-02 7.777 8.57e-15 ***
## as.factor(LabelAppeal)0 4.943e-01 3.387e-02 14.596 < 2e-16 ***
## as.factor(LabelAppeal)1 6.493e-01 3.433e-02 18.912 < 2e-16 ***
## as.factor(LabelAppeal)2 7.637e-01 3.800e-02 20.098 < 2e-16 ***
## as.factor(AcidIndex)5 1.186e-01 2.963e-01 0.400 0.6890
## as.factor(AcidIndex)6 1.902e-01 2.919e-01 0.651 0.5148
## as.factor(AcidIndex)7 1.536e-01 2.917e-01 0.527 0.5985
## as.factor(AcidIndex)8 1.286e-01 2.918e-01 0.441 0.6594
## as.factor(AcidIndex)9 7.794e-02 2.920e-01 0.267 0.7896
## as.factor(AcidIndex)10 -1.910e-02 2.929e-01 -0.065 0.9480
## as.factor(AcidIndex)11 -2.417e-01 2.954e-01 -0.818 0.4134
## as.factor(AcidIndex)12 -2.259e-01 2.996e-01 -0.754 0.4510
## as.factor(AcidIndex)13 -1.328e-01 3.015e-01 -0.440 0.6597
## as.factor(AcidIndex)14 -1.998e-01 3.132e-01 -0.638 0.5235
## as.factor(AcidIndex)15 2.322e-02 3.638e-01 0.064 0.9491
## as.factor(AcidIndex)16 -2.005e-01 4.126e-01 -0.486 0.6270
## as.factor(AcidIndex)17 7.303e-02 4.132e-01 0.177 0.8597
## as.factor(STARS)2 3.175e-01 1.135e-02 27.968 < 2e-16 ***
## as.factor(STARS)3 4.320e-01 1.236e-02 34.960 < 2e-16 ***
## as.factor(STARS)4 5.519e-01 1.751e-02 31.513 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 0.4233659)
##
## Null deviance: 5844.1 on 6435 degrees of freedom
## Residual deviance: 3890.7 on 6404 degrees of freedom
## (6359 observations deleted due to missingness)
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
Quasi_Poisson_Model2 <- glm(TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Sulphates +Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=wine.train,
family=quasipoisson
)
summary(Quasi_Poisson_Model2)
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = quasipoisson,
## data = wine.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2167 -0.2669 0.0479 0.3933 2.0507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.750e-01 2.942e-01 1.275 0.20247
## VolatileAcidity -2.133e-02 5.221e-03 -4.084 4.47e-05 ***
## Chlorides -2.800e-02 1.286e-02 -2.177 0.02949 *
## FreeSulfurDioxide 7.109e-05 2.743e-05 2.592 0.00956 **
## TotalSulfurDioxide 2.420e-05 1.790e-05 1.352 0.17645
## Sulphates -4.931e-03 4.411e-03 -1.118 0.26371
## Alcohol 4.738e-03 1.111e-03 4.267 2.01e-05 ***
## as.factor(LabelAppeal)-1 2.515e-01 3.351e-02 7.504 6.98e-14 ***
## as.factor(LabelAppeal)0 4.722e-01 3.271e-02 14.438 < 2e-16 ***
## as.factor(LabelAppeal)1 6.277e-01 3.316e-02 18.926 < 2e-16 ***
## as.factor(LabelAppeal)2 7.395e-01 3.654e-02 20.237 < 2e-16 ***
## as.factor(AcidIndex)5 1.177e-01 2.964e-01 0.397 0.69132
## as.factor(AcidIndex)6 1.790e-01 2.922e-01 0.613 0.54015
## as.factor(AcidIndex)7 1.493e-01 2.920e-01 0.511 0.60912
## as.factor(AcidIndex)8 1.213e-01 2.921e-01 0.415 0.67803
## as.factor(AcidIndex)9 6.778e-02 2.923e-01 0.232 0.81662
## as.factor(AcidIndex)10 -1.896e-02 2.931e-01 -0.065 0.94843
## as.factor(AcidIndex)11 -2.527e-01 2.955e-01 -0.855 0.39262
## as.factor(AcidIndex)12 -2.125e-01 2.993e-01 -0.710 0.47764
## as.factor(AcidIndex)13 -1.314e-01 3.014e-01 -0.436 0.66296
## as.factor(AcidIndex)14 -3.701e-01 3.134e-01 -1.181 0.23766
## as.factor(AcidIndex)15 1.800e-02 3.642e-01 0.049 0.96059
## as.factor(AcidIndex)16 -1.916e-01 4.129e-01 -0.464 0.64268
## as.factor(AcidIndex)17 3.635e-02 4.131e-01 0.088 0.92989
## as.factor(STARS)2 3.226e-01 1.095e-02 29.462 < 2e-16 ***
## as.factor(STARS)3 4.381e-01 1.190e-02 36.808 < 2e-16 ***
## as.factor(STARS)4 5.566e-01 1.672e-02 33.283 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 0.4251092)
##
## Null deviance: 6339.3 on 6952 degrees of freedom
## Residual deviance: 4216.8 on 6926 degrees of freedom
## (5842 observations deleted due to missingness)
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
pois.coef = coef(Poisson_Model2)
negbinom.coef = coef(Negative_Bin_Model2)
pois.stderr = se.coef(Poisson_Model2)
negbinom.stderr = summary(Negative_Bin_Model2)$coefficients[, 2]
pois.quasi.coef = coef(Quasi_Poisson_Model2)
pois.quasi.stderr = se.coef(Quasi_Poisson_Model2)
df.analysis = cbind(pois.coef, negbinom.coef, pois.quasi.coef,
pois.stderr, negbinom.stderr, pois.quasi.stderr)
head(df.analysis,10) %>% kable() %>% kable_styling(c("striped", "bordered"))
| pois.coef | negbinom.coef | pois.quasi.coef | pois.stderr | negbinom.stderr | pois.quasi.stderr | |
|---|---|---|---|---|---|---|
| (Intercept) | 0.3749910 | 0.3749910 | 0.3749910 | 0.4512069 | 0.4512152 | 0.2941887 |
| VolatileAcidity | -0.0213258 | -0.0213259 | -0.0213258 | 0.0080083 | 0.0080084 | 0.0052214 |
| Chlorides | -0.0280045 | -0.0280046 | -0.0280045 | 0.0197264 | 0.0197267 | 0.0128617 |
| FreeSulfurDioxide | 0.0000711 | 0.0000711 | 0.0000711 | 0.0000421 | 0.0000421 | 0.0000274 |
| TotalSulfurDioxide | 0.0000242 | 0.0000242 | 0.0000242 | 0.0000275 | 0.0000275 | 0.0000179 |
| Sulphates | -0.0049309 | -0.0049309 | -0.0049309 | 0.0067660 | 0.0067661 | 0.0044115 |
| Alcohol | 0.0047382 | 0.0047382 | 0.0047382 | 0.0017033 | 0.0017033 | 0.0011106 |
| as.factor(LabelAppeal)-1 | 0.2514805 | 0.2514805 | 0.2514805 | 0.0514025 | 0.0514029 | 0.0335146 |
| as.factor(LabelAppeal)0 | 0.4722327 | 0.4722327 | 0.4722327 | 0.0501642 | 0.0501646 | 0.0327072 |
| as.factor(LabelAppeal)1 | 0.6276530 | 0.6276530 | 0.6276530 | 0.0508641 | 0.0508645 | 0.0331636 |
predictor_names <- colnames(wine.train %>% select(c("ResidualSugar","Chlorides","FreeSulfurDioxide",
"TotalSulfurDioxide","pH","Sulphates","Alcohol","STARS")))
missing_val <- data.frame(INDEX=NA, Variable=NA, value=NA)
colnames(missing_val) <- c("INDEX", "Variable", "value")
for (name in predictor_names) {
#new_missing <- data.frame(cbind(rep(name,5), wine.train %>% filter(is.na(wine.train[name])) %>% count(STARS)))
missing_stars_count <- wine.train %>% filter(is.na(wine.train[name])) %>% count(TARGET)
new_missing <- data.frame(cbind(rep(name,nrow(missing_stars_count)), missing_stars_count))
colnames(new_missing) <- c("INDEX", "Variable", "value")
missing_val <- rbind(missing_val, new_missing)
}
missing_val <- missing_val %>%
filter(!row_number() %in% c(1))
ggplot(data=missing_val) + geom_bar(mapping=aes(x=Variable, y=value), stat="identity") + facet_wrap(~INDEX, scales = "fixed") + labs(title = "Missing Predictors vs Sales", x = "Num sales", y = "Count")
ZInflatedModel <- zeroinfl(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
as.factor(AcidIndex) | STARS,
data = wine.train, dist = "negbin")
summary(ZInflatedModel)
##
## Call:
## zeroinfl(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) |
## STARS, data = wine.train, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.43095 -0.25615 0.05577 0.35670 2.45205
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.520e-01 5.154e-01 1.847 0.0647 .
## FixedAcidity 6.000e-04 1.075e-03 0.558 0.5768
## VolatileAcidity -1.691e-02 8.479e-03 -1.995 0.0461 *
## CitricAcid -1.945e-03 7.700e-03 -0.253 0.8005
## ResidualSugar -7.271e-05 1.962e-04 -0.371 0.7109
## Chlorides -2.722e-02 2.101e-02 -1.296 0.1951
## FreeSulfurDioxide 7.817e-06 4.425e-05 0.177 0.8598
## TotalSulfurDioxide -2.343e-06 2.827e-05 -0.083 0.9340
## Density -3.595e-01 2.516e-01 -1.429 0.1530
## pH 4.591e-03 9.780e-03 0.469 0.6388
## Sulphates -1.880e-03 7.167e-03 -0.262 0.7930
## Alcohol 8.381e-03 1.794e-03 4.673 2.97e-06 ***
## as.factor(LabelAppeal)-1 3.574e-01 5.428e-02 6.585 4.56e-11 ***
## as.factor(LabelAppeal)0 6.550e-01 5.271e-02 12.427 < 2e-16 ***
## as.factor(LabelAppeal)1 8.703e-01 5.314e-02 16.377 < 2e-16 ***
## as.factor(LabelAppeal)2 1.060e+00 5.859e-02 18.099 < 2e-16 ***
## as.factor(AcidIndex)5 -3.590e-02 4.549e-01 -0.079 0.9371
## as.factor(AcidIndex)6 5.884e-02 4.482e-01 0.131 0.8956
## as.factor(AcidIndex)7 1.884e-02 4.479e-01 0.042 0.9665
## as.factor(AcidIndex)8 -1.553e-03 4.480e-01 -0.003 0.9972
## as.factor(AcidIndex)9 -3.515e-02 4.484e-01 -0.078 0.9375
## as.factor(AcidIndex)10 -1.195e-01 4.499e-01 -0.266 0.7904
## as.factor(AcidIndex)11 -1.876e-01 4.550e-01 -0.412 0.6800
## as.factor(AcidIndex)12 -1.279e-01 4.633e-01 -0.276 0.7825
## as.factor(AcidIndex)13 -3.604e-02 4.668e-01 -0.077 0.9385
## as.factor(AcidIndex)14 -4.569e-02 4.891e-01 -0.093 0.9256
## as.factor(AcidIndex)15 -2.199e-02 5.747e-01 -0.038 0.9695
## as.factor(AcidIndex)16 2.464e-01 6.638e-01 0.371 0.7104
## as.factor(AcidIndex)17 -1.053e-01 6.345e-01 -0.166 0.8683
## Log(theta) 1.811e+01 1.852e+00 9.779 < 2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3357 0.5346 4.369 1.25e-05 ***
## STARS -3.8692 0.5248 -7.373 1.66e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 73631876.5559
## Number of iterations in BFGS optimization: 40
## Log-likelihood: -1.134e+04 on 32 Df
ZInflatedModel2 <- zeroinfl(TARGET ~ VolatileAcidity +
Alcohol + as.factor(LabelAppeal) | STARS,
data = wine.train, dist = "negbin")
summary(ZInflatedModel2)
##
## Call:
## zeroinfl(formula = TARGET ~ VolatileAcidity + Alcohol + as.factor(LabelAppeal) |
## STARS, data = wine.train, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.44720 -0.29223 0.06532 0.35521 2.19805
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.603290 0.047589 12.677 < 2e-16 ***
## VolatileAcidity -0.014676 0.007191 -2.041 0.0413 *
## Alcohol 0.009111 0.001496 6.092 1.12e-09 ***
## as.factor(LabelAppeal)-1 0.362691 0.046620 7.780 7.26e-15 ***
## as.factor(LabelAppeal)0 0.657882 0.045361 14.503 < 2e-16 ***
## as.factor(LabelAppeal)1 0.879500 0.045672 19.257 < 2e-16 ***
## as.factor(LabelAppeal)2 1.065331 0.049974 21.318 < 2e-16 ***
## Log(theta) 17.274647 NaN NaN NaN
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.2194 0.4315 5.143 2.7e-07 ***
## STARS -3.7580 0.4227 -8.891 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 31789517.9144
## Number of iterations in BFGS optimization: 22
## Log-likelihood: -1.584e+04 on 10 Df
(colSums(is.na(wine.eval)) / nrow(wine.eval)) * 100
## IN TARGET FixedAcidity VolatileAcidity
## 0.000000 100.000000 0.000000 0.000000
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## 0.000000 5.037481 4.137931 4.557721
## TotalSulfurDioxide Density pH Sulphates
## 4.707646 0.000000 3.118441 9.295352
## Alcohol LabelAppeal AcidIndex STARS
## 5.547226 0.000000 0.000000 25.217391
colnames(wine.eval)[1] <- "INDEX"
wine.eval$TARGET <- round(predict(ZInflatedModel2, wine.eval %>% select(-c("INDEX", "TARGET")), type="response"))
wine.eval %>% filter(is.na(STARS) && is.na(TARGET)) %>% count()
## n
## 1 3335
wine.eval %>% filter(is.na(STARS) && !is.na(TARGET)) %>% count()
## n
## 1 0
wine.eval[is.na(wine.eval$TARGET), 'TARGET'] <- 0
#write_csv(wine.eval,"HW5_predictions.csv")
ggplot(data=wine.eval) + geom_histogram(mapping=aes(x=TARGET))