| Variable Name | Definition | Theoretical Effect |
|---|---|---|
| INDEX | Identification Variable (do not use) | None |
| TARGET | Number of Cases Purchased | None |
| AcidIndex | Proprietary method of testing total acidity of wine by using a weighted average | |
| Alcohol | Alcohol Content | |
| Chlorides | Chloride content of wine | |
| CitricAcid | Citric Acid Content | |
| Density | Density of Wine | |
| FixedAcidity | Fixed Acidity of Wine | |
| FreeSulfurDioxide | Sulfur Dioxide content of wine | |
| LabelAppeal | Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don’t like the design. | Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales |
| ResidualSugar | Residual Sugar of wine | |
| STARS | Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor | A high number of stars suggests high sales |
| Sulphates | Sulfate conten of wine | |
| TotalSulfurDioxide | Total Sulfur Dioxide of Wine | |
| VolatileAcidity | Volatile Acid content of wine | |
| pH | pH of wine |
Our dataset consists of 15 variables and 12,795 observations with ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates, Alcohol, and STARS variables containing several missing values. As stated previously, TARGET is our response variable. Lastly, LabelAppeal, AcidIndex, and STARS are discrete variables and the rest are continuous. As such we will perform the appropriate visualization in the following figures.
## Observations: 12,795
## Variables: 15
## $ TARGET <int> 3, 3, 5, 3, 4, 0, 0, 4, 3, 6, 0, 4, 3, 7, 4...
## $ FixedAcidity <dbl> 3.2, 4.5, 7.1, 5.7, 8.0, 11.3, 7.7, 6.5, 14...
## $ VolatileAcidity <dbl> 1.160, 0.160, 2.640, 0.385, 0.330, 0.320, 0...
## $ CitricAcid <dbl> -0.98, -0.81, -0.88, 0.04, -1.26, 0.59, -0....
## $ ResidualSugar <dbl> 54.20, 26.10, 14.80, 18.80, 9.40, 2.20, 21....
## $ Chlorides <dbl> -0.567, -0.425, 0.037, -0.425, NA, 0.556, 0...
## $ FreeSulfurDioxide <dbl> NA, 15, 214, 22, -167, -37, 287, 523, -213,...
## $ TotalSulfurDioxide <dbl> 268, -327, 142, 115, 108, 15, 156, 551, NA,...
## $ Density <dbl> 0.99280, 1.02792, 0.99518, 0.99640, 0.99457...
## $ pH <dbl> 3.33, 3.38, 3.12, 2.24, 3.12, 3.20, 3.49, 3...
## $ Sulphates <dbl> -0.59, 0.70, 0.48, 1.83, 1.77, 1.29, 1.21, ...
## $ Alcohol <dbl> 9.9, NA, 22.0, 6.2, 13.7, 15.4, 10.3, 11.6,...
## $ LabelAppeal <int> 0, -1, -1, -1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2...
## $ AcidIndex <int> 8, 7, 8, 6, 9, 11, 8, 7, 6, 8, 5, 10, 7, 8,...
## $ STARS <int> 2, 3, 3, 1, 2, NA, NA, 3, NA, 4, 1, 2, 2, 3...
## TARGET FixedAcidity VolatileAcidity CitricAcid
## Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400
## 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300
## Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100
## Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084
## 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800
## Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600
##
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0
## 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0
## Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0
## Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7
## 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0
## Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0
## NA's :616 NA's :638 NA's :647 NA's :682
## Density pH Sulphates Alcohol
## Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70
## 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00
## Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40
## Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49
## 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40
## Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50
## NA's :395 NA's :1210 NA's :653
## LabelAppeal AcidIndex STARS
## Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median : 0.000000 Median : 8.000 Median :2.000
## Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :3359
In the box plot below, we plot several variables in to two panels. TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar variables have large ranges compared to other variables. Therefore, we separated those variables in to a different panel to view their distribution. From both panel, we can tell a high number of variables have numerous outliers. However, almost all of the variables are centered around zero. Lastly, we also see several variables have negative values. In the bar char below, AcidIndex tells us large quantity of wine were sold with the index number 7 and 8. LabelAppeal tells us generic labeled wine sells the most; however, better label does yield higher number of wine samples per order. Lastly, STARS tells us excellent quality does not result in high wine orders. It could be due to high star wine bottle’s high price tag.
In the histogram plot below, we see all the continous variable are normally distributed. However, several variables have negatives value in them. We will have to transform these values to positive for our analysis.
| Correlation | |
|---|---|
| TARGET | 1.0000000 |
| STARS | 0.5546857 |
| LabelAppeal | 0.4979465 |
| Alcohol | 0.0737771 |
| FreeSulfurDioxide | 0.0226398 |
| TotalSulfurDioxide | 0.0216021 |
| ResidualSugar | 0.0035196 |
| CitricAcid | 0.0023450 |
| pH | 0.0002199 |
| FixedAcidity | -0.0125381 |
| Sulphates | -0.0212204 |
| Chlorides | -0.0304301 |
| Density | -0.0475989 |
| VolatileAcidity | -0.0759979 |
| AcidIndex | -0.1676431 |
Using the aggr function from VIM package, We see several variables have missing values. According to UCI Machine Learning, who published this dataset, all wine contain some natural sulfites. Therefore, we will impute the missing values for sulfite chemical properties. Also, it’s rare to find wines with less than 1 gram/liter of sugar. We will impute this as well. Matter of fact, since all missing values are missing at random, we will impute all the missing values using the mice package and random forest method. Mice uses multivariate imputations to estimate the missing values. Using multiple imputations helps in resolving the uncertainty for the missingness. Our target variable will be removed as a predictor variable but still will be imputed. Our response variables will be removed as predictor variables but still will be imputed.
##
## Variables sorted by number of missings:
## Variable Count
## STARS 3359
## Sulphates 1210
## TotalSulfurDioxide 682
## Alcohol 653
## FreeSulfurDioxide 647
## Chlorides 638
## ResidualSugar 616
## pH 395
## TARGET 0
## FixedAcidity 0
## VolatileAcidity 0
## CitricAcid 0
## Density 0
## LabelAppeal 0
## AcidIndex 0
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## [1] "Missing value after imputation: 0"
We will build two different Poisson regression models using dataset with and without imputed values, two different negative binomial regression models using stepwise variables selection and imputed variables, and two multiple linear regression models using stepwise variables selection and imputed variables to see which model yields the best performance.
In our Poisson regression model below, we see the deviance residuals is quite symmetrical. This means the the predicted points are close to actual observed points. As expected and seen in our correlation table, STARS, LabelAppeal and AcidIndex are significant variables. And the variation in standard error is low. We will compare the AIC with other models.
##
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2158 -0.2734 0.0616 0.3732 1.6830
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.593e+00 2.506e-01 6.359 2.03e-10 ***
## FixedAcidity 3.293e-04 1.053e-03 0.313 0.75447
## VolatileAcidity -2.560e-02 8.353e-03 -3.065 0.00218 **
## CitricAcid -7.259e-04 7.575e-03 -0.096 0.92365
## ResidualSugar -6.141e-05 1.941e-04 -0.316 0.75165
## Chlorides -3.007e-02 2.056e-02 -1.463 0.14346
## FreeSulfurDioxide 6.734e-05 4.404e-05 1.529 0.12620
## TotalSulfurDioxide 2.081e-05 2.855e-05 0.729 0.46618
## Density -3.725e-01 2.462e-01 -1.513 0.13026
## pH -4.661e-03 9.598e-03 -0.486 0.62722
## Sulphates -5.164e-03 7.051e-03 -0.732 0.46398
## Alcohol 3.948e-03 1.771e-03 2.229 0.02579 *
## LabelAppeal 1.771e-01 7.954e-03 22.271 < 2e-16 ***
## AcidIndex -4.870e-02 5.903e-03 -8.251 < 2e-16 ***
## STARS 1.871e-01 7.487e-03 24.993 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5844.1 on 6435 degrees of freedom
## Residual deviance: 4009.1 on 6421 degrees of freedom
## (6359 observations deleted due to missingness)
## AIC: 23172
##
## Number of Fisher Scoring iterations: 5
In our Poisson regression model below, we see the deviance residuals is quite symmetrical. This means the the predicted points are close to actual observed points. With the imputed data, we have more significant variables than the previous model. However, the AIC score is score is increased drastically. A low residual deviance implies that the model we have trained is appropriate.
##
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train_imputed)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8449 -0.5193 0.2089 0.6305 2.5305
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.039e+00 1.953e-01 10.444 < 2e-16 ***
## FixedAcidity -2.934e-04 8.194e-04 -0.358 0.720269
## VolatileAcidity -5.090e-02 6.489e-03 -7.844 4.36e-15 ***
## CitricAcid 1.338e-02 5.891e-03 2.271 0.023124 *
## ResidualSugar 1.649e-04 1.513e-04 1.089 0.275961
## Chlorides -5.657e-02 1.613e-02 -3.507 0.000453 ***
## FreeSulfurDioxide 1.332e-04 3.457e-05 3.854 0.000116 ***
## TotalSulfurDioxide 1.106e-04 2.204e-05 5.019 5.19e-07 ***
## Density -4.090e-01 1.917e-01 -2.133 0.032893 *
## pH -2.336e-02 7.553e-03 -3.093 0.001979 **
## Sulphates -1.713e-02 5.536e-03 -3.094 0.001978 **
## Alcohol 4.966e-03 1.387e-03 3.581 0.000343 ***
## LabelAppeal 1.974e-01 6.024e-03 32.768 < 2e-16 ***
## AcidIndex -1.199e-01 4.470e-03 -26.816 < 2e-16 ***
## STARS 1.928e-01 5.816e-03 33.149 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 18474 on 12780 degrees of freedom
## AIC: 50446
##
## Number of Fisher Scoring iterations: 5
In our negative binomial regression model below, we see the deviance residuals is quite symmetrical. This means the the predicted points are close to actual observed points. Sometimes Poisson and negative binomial regression models give the same results. We can see that by comparing model 3 and model 2’s coefficients score. They are exactly same. However, fishering scoring iteration is only 1.
##
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train_imputed, init.theta = 37200.03811,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8447 -0.5193 0.2089 0.6305 2.5304
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.039e+00 1.953e-01 10.443 < 2e-16 ***
## FixedAcidity -2.935e-04 8.195e-04 -0.358 0.720264
## VolatileAcidity -5.090e-02 6.490e-03 -7.844 4.37e-15 ***
## CitricAcid 1.338e-02 5.892e-03 2.271 0.023130 *
## ResidualSugar 1.649e-04 1.513e-04 1.089 0.275963
## Chlorides -5.657e-02 1.613e-02 -3.507 0.000453 ***
## FreeSulfurDioxide 1.332e-04 3.457e-05 3.854 0.000116 ***
## TotalSulfurDioxide 1.106e-04 2.204e-05 5.019 5.19e-07 ***
## Density -4.090e-01 1.917e-01 -2.133 0.032899 *
## pH -2.336e-02 7.553e-03 -3.093 0.001980 **
## Sulphates -1.713e-02 5.536e-03 -3.094 0.001978 **
## Alcohol 4.966e-03 1.387e-03 3.580 0.000343 ***
## LabelAppeal 1.974e-01 6.024e-03 32.767 < 2e-16 ***
## AcidIndex -1.199e-01 4.471e-03 -26.815 < 2e-16 ***
## STARS 1.928e-01 5.816e-03 33.148 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(37200.04) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 18473 on 12780 degrees of freedom
## AIC: 50448
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 37200
## Std. Err.: 60018
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -50416.39
In our negative binomial regression model below, we use forward and backward step-wise variables selection algorithm. This model is only slightly better with a lower AIC score.
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides +
## FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates +
## Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train_imputed,
## init.theta = 37205.32973, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.8469 -0.5183 0.2103 0.6292 2.5379
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.039e+00 1.953e-01 10.441 < 2e-16 ***
## VolatileAcidity -5.101e-02 6.489e-03 -7.860 3.83e-15 ***
## CitricAcid 1.330e-02 5.891e-03 2.257 0.023979 *
## Chlorides -5.646e-02 1.613e-02 -3.501 0.000464 ***
## FreeSulfurDioxide 1.335e-04 3.456e-05 3.862 0.000113 ***
## TotalSulfurDioxide 1.112e-04 2.203e-05 5.045 4.54e-07 ***
## Density -4.081e-01 1.917e-01 -2.128 0.033300 *
## pH -2.322e-02 7.551e-03 -3.076 0.002101 **
## Sulphates -1.721e-02 5.535e-03 -3.108 0.001881 **
## Alcohol 4.942e-03 1.387e-03 3.563 0.000367 ***
## LabelAppeal 1.974e-01 6.024e-03 32.774 < 2e-16 ***
## AcidIndex -1.201e-01 4.419e-03 -27.186 < 2e-16 ***
## STARS 1.928e-01 5.816e-03 33.157 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(37205.33) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 18475 on 12782 degrees of freedom
## AIC: 50446
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 37205
## Std. Err.: 60034
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -50417.72
In our multiple linear regression model below, r-squared is 0.29, which means this model explains 29% of the data’s variation. As seen with previous models, FixedAcidity and ResidualSugar seem to have have no impact in this model. So far none of the model adequately explains the dataset.
##
## Call:
## lm(formula = TARGET ~ ., data = wine_train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8654 -0.7149 0.3674 1.1120 4.9647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.455e+00 5.541e-01 9.844 < 2e-16 ***
## FixedAcidity -5.024e-04 2.326e-03 -0.216 0.82901
## VolatileAcidity -1.550e-01 1.847e-02 -8.393 < 2e-16 ***
## CitricAcid 4.054e-02 1.681e-02 2.411 0.01592 *
## ResidualSugar 5.220e-04 4.308e-04 1.212 0.22568
## Chlorides -1.787e-01 4.574e-02 -3.908 9.36e-05 ***
## FreeSulfurDioxide 4.005e-04 9.824e-05 4.077 4.59e-05 ***
## TotalSulfurDioxide 3.232e-04 6.252e-05 5.170 2.37e-07 ***
## Density -1.192e+00 5.454e-01 -2.185 0.02887 *
## pH -6.102e-02 2.140e-02 -2.852 0.00436 **
## Sulphates -4.885e-02 1.574e-02 -3.103 0.00192 **
## Alcohol 1.739e-02 3.927e-03 4.429 9.54e-06 ***
## LabelAppeal 5.976e-01 1.699e-02 35.176 < 2e-16 ***
## AcidIndex -3.157e-01 1.126e-02 -28.039 < 2e-16 ***
## STARS 6.308e-01 1.724e-02 36.592 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.635 on 12780 degrees of freedom
## Multiple R-squared: 0.2808, Adjusted R-squared: 0.28
## F-statistic: 356.4 on 14 and 12780 DF, p-value: < 2.2e-16
In our last model using multiple linear regression with forward and backward step-wise variables selection algorithm, we see a similar output as model 5. R-squared is 0.29, which means this model explains 29% of the data’s variation.
##
## Call:
## lm(formula = TARGET ~ VolatileAcidity + CitricAcid + Chlorides +
## FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates +
## Alcohol + LabelAppeal + AcidIndex + STARS, data = wine_train_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8687 -0.7184 0.3665 1.1119 4.9712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.452e+00 5.541e-01 9.840 < 2e-16 ***
## VolatileAcidity -1.552e-01 1.847e-02 -8.400 < 2e-16 ***
## CitricAcid 4.041e-02 1.681e-02 2.404 0.01624 *
## Chlorides -1.786e-01 4.573e-02 -3.906 9.45e-05 ***
## FreeSulfurDioxide 4.021e-04 9.822e-05 4.094 4.26e-05 ***
## TotalSulfurDioxide 3.248e-04 6.250e-05 5.198 2.05e-07 ***
## Density -1.187e+00 5.454e-01 -2.176 0.02956 *
## pH -6.069e-02 2.140e-02 -2.836 0.00457 **
## Sulphates -4.905e-02 1.574e-02 -3.117 0.00183 **
## Alcohol 1.730e-02 3.926e-03 4.406 1.06e-05 ***
## LabelAppeal 5.976e-01 1.699e-02 35.177 < 2e-16 ***
## AcidIndex -3.162e-01 1.109e-02 -28.526 < 2e-16 ***
## STARS 6.310e-01 1.724e-02 36.608 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.635 on 12782 degrees of freedom
## Multiple R-squared: 0.2807, Adjusted R-squared: 0.28
## F-statistic: 415.7 on 12 and 12782 DF, p-value: < 2.2e-16
| Model 1 | Model 2 | Model 3 | Model 4 | |
|---|---|---|---|---|
| MSE | 6.90228375567207 | 7.04844870624825 | 7.04844726212522 | 7.04855179988549 |
| AIC | 23171.7476259535 | 50446.2749506101 | 50448.3887510875 | 50445.7160907331 |
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.5847 1.0685 1.2508 1.2705 1.4527 2.1245 1624
# knitr settings
knitr::opts_chunk$set(warning = F,
message = F,
echo = F,
fig.align = "center")
# load libraries
library(kableExtra)
library(knitr)
library(dplyr)
library(ggplot2)
library(psych)
library(tidyr)
library(ggpubr)
library(corrplot)
library(RColorBrewer)
library(VIM)
library(mice)
library(MASS)
library(caret)
library(pROC)
library(Metrics)
vr <- c("INDEX", "TARGET", "AcidIndex", "Alcohol", "Chlorides", "CitricAcid", "Density", "FixedAcidity", "FreeSulfurDioxide", "LabelAppeal", "ResidualSugar", "STARS", "Sulphates", "TotalSulfurDioxide", "VolatileAcidity", "pH")
def <- c("Identification Variable (do not use)", "Number of Cases Purchased", "Proprietary method of testing total acidity of wine by using a weighted average", "Alcohol Content", "Chloride content of wine", "Citric Acid Content", "Density of Wine", "Fixed Acidity of Wine", "Sulfur Dioxide content of wine", "Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customes don't like the design.", "Residual Sugar of wine", "Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor", "Sulfate conten of wine", "Total Sulfur Dioxide of Wine", "Volatile Acid content of wine", "pH of wine")
te <- c("None", "None", "", "", "", "", "", "", "", "Many consumers purchase based on the visual appeal of the wine label design. Higher numbers suggest better sales", "", "A high number of stars suggests high sales", "", "", "", "")
kable(cbind(vr, def, te), col.names = c("Variable Name", "Definition", "Theoretical Effect")) %>%
kable_styling()
# load data
url1 <- "https://raw.githubusercontent.com/saayedalam/Data/master/wine-training-data.csv"
url2 <- "https://raw.githubusercontent.com/saayedalam/Data/master/wine-evaluation-data.csv"
wine_train <- read.csv(url1)
wine_test <- read.csv(url2)
# remove index column as it is not needed
wine_train <- wine_train %>%
dplyr::select(-"ï..INDEX")
wine_test <- wine_test %>%
dplyr::select(-"IN")
# summary statistics
wine_train %>%
glimpse() %>%
summary()
# boxplot
p1 <- wine_train %>%
dplyr::select(-c("TotalSulfurDioxide", "FreeSulfurDioxide", "ResidualSugar")) %>%
gather(na.rm = TRUE) %>%
ggplot(aes(factor(key), value)) +
geom_boxplot(outlier.colour = "#e281cf", outlier.shape = 1, color = "#5aa1ed") +
coord_flip() +
labs(title = "Boxplot of Chemical Properties of Wine", x = "Chemical Properties", y = "Values") +
theme_minimal()
p2 <- wine_train %>%
dplyr::select(c("TotalSulfurDioxide", "FreeSulfurDioxide", "ResidualSugar")) %>%
gather(na.rm = TRUE) %>%
ggplot(aes(factor(key), value)) +
geom_boxplot(outlier.colour = "#e281cf", outlier.shape = 1, color = "#5aa1ed") +
#labs(title = "Boxplot of Chemical Properties of Wine", x = "Chemical Properties", y = "Values") +
theme_minimal()
ggarrange(p1, p2)
# barchart
p3 <- wine_train %>%
dplyr::select(TARGET, STARS) %>%
mutate(STARS = as.factor(STARS),
TARGET = as.factor(TARGET)) %>%
ggplot(aes(STARS)) +
geom_bar(aes(fill = TARGET)) +
theme_minimal()
p4 <- wine_train %>%
dplyr::select(TARGET, LabelAppeal) %>%
mutate(STARS = as.factor(LabelAppeal),
TARGET = as.factor(TARGET)) %>%
ggplot(aes(LabelAppeal)) +
geom_bar(aes(fill = TARGET)) +
theme_minimal()
p5 <- wine_train %>%
dplyr::select(TARGET, AcidIndex) %>%
mutate(STARS = as.factor(AcidIndex),
TARGET = as.factor(TARGET)) %>%
ggplot(aes(AcidIndex)) +
geom_bar(aes(fill = TARGET)) +
theme_minimal()
ggarrange(p5, ggarrange(p3, p4, ncol = 2, nrow = 1, legend = "none"), nrow = 2, common.legend = TRUE)
# histogram
wine_train %>%
dplyr::select(-c("AcidIndex", "STARS", "TARGET", "LabelAppeal")) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~key, scale = "free", ncol = 3) +
geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="#56B4E9") +
theme_minimal()
# top correlation
wine_train_corr <- wine_train %>%
drop_na() %>%
cor()
kable(sort(wine_train_corr[,1], decreasing = T), col.names = c("Correlation")) %>%
kable_styling(full_width = F)
# correlation plot
corrplot(wine_train_corr,
method = "number",
type = "lower",
col = brewer.pal(n = 15, name = "Blues"),
number.cex = .7, tl.cex = .7,
tl.col = "black", tl.srt = 45)
# missing value columns
aggr(wine_train,
sortVars=TRUE,
labels=names(wine_train),
cex.axis=.5,
bars = FALSE,
col = c("#efeaee", "#5aa1ed"),
combined = TRUE,
#border = NA,
ylab = "Missing Values")
# imputating train data
init <- mice(wine_train)
meth <- init$method
predM <- init$predictorMatrix
predM[, c("TARGET")] <- 0 #this code will remove the variable as a predictor but still will be imputed
wine_train_impute <- mice(wine_train, method = 'rf', predictorMatrix=predM)
wine_train_imputed <- complete(wine_train_impute)
print(paste0("Missing value after imputation: ", sum(is.na(wine_train_imputed))))
# poisson model with the missing values
model1 <- glm(TARGET ~ ., family = poisson, wine_train)
summary(model1)
# poisson model with the imputed values
model2 <- glm(TARGET ~ ., family = poisson, wine_train_imputed)
summary(model2)
# negative binomial regression with the imputed values
model3 <- glm.nb(TARGET ~ ., wine_train_imputed)
summary(model3)
# negative binomial regression with stepwise variable selection
model4 <- stepAIC(model3, direction = "both", trace = FALSE)
summary(model4)
# multiple linear regression with the imputed values
model5 <- lm(TARGET ~ ., wine_train_imputed)
summary(model5)
# multiple linear regression with stepwise variable selection
model6 <- stepAIC(model5, direction = "both", trace = FALSE)
summary(model6)
# metrics
aic1 <- model1$aic
aic2 <- model2$aic
aic3 <- model3$aic
aic4 <- model4$aic
mse1 <- mean((wine_train$TARGET - predict(model1))^2)
mse2 <- mean((wine_train_imputed$TARGET - predict(model2))^2)
mse3 <- mean((wine_train_imputed$TARGET - predict(model3))^2)
mse4 <- mean((wine_train_imputed$TARGET - predict(model4))^2)
MSE <- list(mse1, mse2, mse3, mse4)
AIC <- list(aic1, aic2, aic3, aic4)
kable(rbind(MSE, AIC), col.names = c("Model 1", "Model 2", "Model 3", "Model 4")) %>%
kable_styling(full_width = T)
# predict
predict <- predict(model1, wine_test, interval = "predict")
summary(predict)