Project 5
Wine Quality and Quantity Purchased:

a Count Regression Model

.

Our dataset consists of 15 variables about different qualities. A wine producer might be able to use this data, along with the target variable, number of cases purchased by restaurants, to determine what qualities consumers are looking for in wines and to be able to plan accordingly.

. .

Restaurants purchased from 0 to 8 cases of each wine. Many purchased 0. After that, the distribution is centered around 4. The count distribution for our target may be best modelled in 2 parts based on its multimodal structure. We’ll explore that probability later in our exploration.
.

The correlations between many of our various variables are quite low for most of our variables. In our corrplot, only STARS, label index, alcohol and acid index have any visible correlation with our target. Acid index is negatively correlated, indicting that consumers don’t like acidic wines. Consumers appear to be more moved by advertising qualities than by other qualities. A flashy rating and a nice label do more to make restaurants buy than taste.
.

Looking at the individual distributions of our variables, most of them appear to be fairly normal. Acid index is right-skewed. When we log it, its distribution appears normal. We maintain acid index as a logged variable. STARS appears in the dataset with a lot of NA entries. With the assumption that an unrated wine is overlooked and is likely to remain overlooked, we assume that rated wines would be more desirable. We imputed all NAs as 0 in our dataset. Below, we see that a simple linear model based on STARS as the independent variable can be improved when the STARS variable is augmented by 0s in place of NA. Above, we can see that there are a lot of 0s and our model may be more complex than a simple distribution. STARS is the variable most apparently correlated to number purchased in our corrplot. The adjusted R² for he regular STARS variable is .3123 Augmented by 0, it becomes .4739.

## 
## Call:
## lm(formula = TARGET ~ STARS, data = wine_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6459 -0.6459  0.3153  0.4319  3.3930 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.72362    0.03278   52.58   <2e-16 ***
## STARS        0.96112    0.01469   65.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.287 on 9434 degrees of freedom
##   (3359 observations deleted due to missingness)
## Multiple R-squared:  0.3123, Adjusted R-squared:  0.3122 
## F-statistic:  4283 on 1 and 9434 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = TARGET ~ augmentedSTARS, data = wine_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6780 -1.1057  0.1795  0.8943  6.8943 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.963108   0.015849   123.9   <2e-16 ***
## augmentedSTARS 0.857423   0.007987   107.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.397 on 12793 degrees of freedom
## Multiple R-squared:  0.4739, Adjusted R-squared:  0.4739 
## F-statistic: 1.152e+04 on 1 and 12793 DF,  p-value: < 2.2e-16

Looking at boxplots of our variables confirms our other look at the data. We do see, however, that some variables have interesting features that may help our model. A lot of outliers will hamper our ability to create a model with a high R². Chlorides appear to be elevated for less desirable wines. Density, sulphates and pH appear to be higher in more desirable wines.

Our single variable augmentedSTARS model serves as a benchmark to compare other models to. The Root Mean Square Error for this model is:

## [1] 1.39657

We create a negative binomial count model using the 4 variables: augmentedSTARS, LabelAppeal, AcidIndex, Alcohol. Alcohol was not statistically significant within our model. Dropping it left the AIC unchanged. We leave this model with 3 variables. We also plot a ROC curve and calculate an AUC for this model. At .5105, we see that this model, with a high RMSE, does not do a great job of predicting our count.

## 
## Call:
## glm.nb(formula = TARGET ~ augmentedSTARS + LabelAppeal + AcidIndex, 
##     data = training_set, init.theta = 46220.14491, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0443  -0.7031   0.0255   0.5355   3.3292  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     1.942763   0.084754   22.92   <2e-16 ***
## augmentedSTARS  0.276564   0.004440   62.29   <2e-16 ***
## LabelAppeal     0.138255   0.006982   19.80   <2e-16 ***
## AcidIndex      -0.636685   0.041207  -15.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(46220.14) family taken to be 1)
## 
##     Null deviance: 17155  on 9596  degrees of freedom
## Residual deviance: 10639  on 9593  degrees of freedom
## AIC: 34606
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  46220 
##           Std. Err.:  50729 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -34595.7

## [1] "RMSE for negative binomial model:"

## [1] 2.596958

## Area under the curve: 0.5105

We also create a Poisson model for our count data. It has a similar RMSE. It has a nearly identical AIC. With an AUC of .5046, it is also not capturing our data very fully.

## 
## Call:
## glm(formula = TARGET ~ augmentedSTARS + LabelAppeal + AcidIndex + 
##     Alcohol, family = "poisson", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0455  -0.7065   0.0216   0.5327   3.3341  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     1.912004   0.087340  21.891   <2e-16 ***
## augmentedSTARS  0.276160   0.004448  62.089   <2e-16 ***
## LabelAppeal     0.138301   0.006981  19.811   <2e-16 ***
## AcidIndex      -0.633486   0.041255 -15.355   <2e-16 ***
## Alcohol         0.002362   0.001626   1.452    0.146    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 17156  on 9596  degrees of freedom
## Residual deviance: 10637  on 9592  degrees of freedom
## AIC: 34603
## 
## Number of Fisher Scoring iterations: 5

## [1] "RMSE for Poisson model"

## [1] 2.596937

## [1] "actual mean"

## [1] 3.03127

## [1] "prediction mean"

## [1] 0.9831358

## [1] "poisson model"

## Area under the curve: 0.5046

Our residuals picture of our Poisson model show a potential source of trouble. There is a missing band around -1. Our model seems to be incapable of predicting about 1 higher than the actual value. The predictions instead may be aiming toward lower numbers. In fact, the mean of predictions is .983 and the mean of actual purchases is 3.031. Plotting the theoretical distribution to the empirical, our date appears too right-skewed for our model.
. . .

We now try to create a multiregression model from a forward stepwise regression process. We begin with our augmentedSTARS model. The model created by this process chooses 11 variables as predictors. This model produces an R² of .5368. It is much better than the original STARS model. It is an improvement on the augmentedSTARS model. It produces a RMSE of 1.308914. This is far better than the Poisson model. The AUC for this model is 0.5231.

## 
## Call:
## lm(formula = TARGET ~ augmentedSTARS, data = wine_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6780 -1.1057  0.1795  0.8943  6.8943 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.963108   0.015849   123.9   <2e-16 ***
## augmentedSTARS 0.857423   0.007987   107.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.397 on 12793 degrees of freedom
## Multiple R-squared:  0.4739, Adjusted R-squared:  0.4739 
## F-statistic: 1.152e+04 on 1 and 12793 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = TARGET ~ ., data = wine_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7007 -0.8493  0.0238  0.8456  6.2022 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.194e+00  4.631e-01  13.376  < 2e-16 ***
## FixedAcidity        1.067e-04  1.866e-03   0.057 0.954400    
## VolatileAcidity    -9.730e-02  1.484e-02  -6.558 5.67e-11 ***
## CitricAcid          1.728e-02  1.349e-02   1.280 0.200418    
## ResidualSugar       2.506e-04  3.525e-04   0.711 0.477079    
## Chlorides          -1.152e-01  3.741e-02  -3.079 0.002081 ** 
## FreeSulfurDioxide   2.846e-04  8.016e-05   3.550 0.000387 ***
## TotalSulfurDioxide  2.303e-04  5.151e-05   4.471 7.84e-06 ***
## Density            -8.035e-01  4.377e-01  -1.836 0.066452 .  
## pH                 -3.224e-02  1.738e-02  -1.855 0.063559 .  
## Sulphates          -3.165e-02  1.309e-02  -2.418 0.015632 *  
## Alcohol             1.228e-02  3.203e-03   3.833 0.000127 ***
## LabelAppeal         4.690e-01  1.342e-02  34.962  < 2e-16 ***
## AcidIndex          -1.629e+00  7.673e-02 -21.226  < 2e-16 ***
## augmentedSTARS      7.577e-01  7.877e-03  96.189  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.312 on 12780 degrees of freedom
## Multiple R-squared:  0.5369, Adjusted R-squared:  0.5364 
## F-statistic:  1058 on 14 and 12780 DF,  p-value: < 2.2e-16

## [1] "The forward stepwise regression chose the model below:"

## 
## Call:
## lm(formula = TARGET ~ augmentedSTARS + LabelAppeal + AcidIndex + 
##     VolatileAcidity + TotalSulfurDioxide + Alcohol + FreeSulfurDioxide + 
##     Chlorides + Sulphates + Density + pH, data = wine_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7055 -0.8505  0.0230  0.8464  6.2003 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.193e+00  4.629e-01  13.380  < 2e-16 ***
## augmentedSTARS      7.579e-01  7.874e-03  96.249  < 2e-16 ***
## LabelAppeal         4.691e-01  1.342e-02  34.966  < 2e-16 ***
## AcidIndex          -1.621e+00  7.544e-02 -21.492  < 2e-16 ***
## VolatileAcidity    -9.769e-02  1.483e-02  -6.586 4.70e-11 ***
## TotalSulfurDioxide  2.315e-04  5.149e-05   4.497 6.94e-06 ***
## Alcohol             1.231e-02  3.202e-03   3.845 0.000121 ***
## FreeSulfurDioxide   2.864e-04  8.014e-05   3.573 0.000354 ***
## Chlorides          -1.158e-01  3.741e-02  -3.094 0.001977 ** 
## Sulphates          -3.194e-02  1.309e-02  -2.440 0.014684 *  
## Density            -8.112e-01  4.377e-01  -1.854 0.063832 .  
## pH                 -3.218e-02  1.738e-02  -1.852 0.064064 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.312 on 12783 degrees of freedom
## Multiple R-squared:  0.5368, Adjusted R-squared:  0.5364 
## F-statistic:  1347 on 11 and 12783 DF,  p-value: < 2.2e-16

## [1] 1.308914

## [1] "stepwise multi-regression model"

## Area under the curve: 0.5231

We return to a Poisson model. This time, we use the variables chosen by our forward stepwise regression. This model achieves a RMSE of 2.596436. It’s clear that the choice of variables is not likely to help our Poisson or negative binomial models. We need to find a way to account for the large set of zero values. We want to create a two part model. We want to separate zero vlues from the others and then model the Poisson nature of the rest of the data. We turn to a zero-inflated model to incorporate this aspect of the data.

## [1] "RMSE for Poisson forward stepwise regression model"

## [1] 2.596436

## 
## Call:
## zeroinfl(formula = TARGET ~ augmentedSTARS + LabelAppeal + AcidIndex + 
##     VolatileAcidity + TotalSulfurDioxide + Alcohol + FreeSulfurDioxide + 
##     Chlorides + Sulphates + Density + pH, data = training_set)
## 
## Pearson residuals:
##       Min        1Q    Median        3Q       Max 
## -2.023793 -0.403177 -0.002897  0.365911  5.488482 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.737e+00  2.455e-01   7.076 1.48e-12 ***
## augmentedSTARS      8.437e-02  5.173e-03  16.308  < 2e-16 ***
## LabelAppeal         2.407e-01  7.294e-03  32.999  < 2e-16 ***
## AcidIndex          -1.682e-01  4.519e-02  -3.723 0.000197 ***
## VolatileAcidity    -1.083e-02  7.757e-03  -1.396 0.162761    
## TotalSulfurDioxide -6.613e-06  2.593e-05  -0.255 0.798710    
## Alcohol             7.073e-03  1.667e-03   4.243 2.20e-05 ***
## FreeSulfurDioxide   1.297e-05  4.070e-05   0.319 0.749976    
## Chlorides          -1.600e-02  1.954e-02  -0.819 0.412777    
## Sulphates           1.159e-03  6.921e-03   0.167 0.867004    
## Density            -3.448e-01  2.304e-01  -1.497 0.134454    
## pH                  5.561e-03  9.131e-03   0.609 0.542482    
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -9.5458687  1.6180352  -5.900 3.64e-09 ***
## augmentedSTARS     -1.3753840  0.0360098 -38.195  < 2e-16 ***
## LabelAppeal         0.7249666  0.0507035  14.298  < 2e-16 ***
## AcidIndex           3.5708259  0.2535387  14.084  < 2e-16 ***
## VolatileAcidity     0.1933707  0.0514577   3.758 0.000171 ***
## TotalSulfurDioxide -0.0008810  0.0001753  -5.027 4.99e-07 ***
## Alcohol             0.0297627  0.0111317   2.674 0.007502 ** 
## FreeSulfurDioxide  -0.0006363  0.0002784  -2.286 0.022278 *  
## Chlorides           0.1020357  0.1274027   0.801 0.423195    
## Sulphates           0.1031398  0.0449920   2.292 0.021883 *  
## Density             0.2376958  1.5272531   0.156 0.876320    
## pH                  0.2356841  0.0608324   3.874 0.000107 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 32 
## Log-likelihood: -1.541e+04 on 24 Df

The Zero Inflation model produces an R² of 1.279552 with our validation set. It is the best model of all the models we’ve tested. The AUC for this model is .6909. This model does the best job of describing our data.

## [1] 1.279552

## [1] "zero-inflated poisson model"

## Area under the curve: 0.6909

## [1] "mean of data:"

## [1] 3.03127

## [1] "mean of prediction:"

## [1] 2.993857

Our residuals plot shows a more normal distribution. The stripe of missing residuals is now barely noticable. Residuals are still positively skewed, but much more tame. The means of the predicted values and of the actual values are now closer than .04 to each other. The plot comparing the empirical distribution to the fitted distribution now appears closer to a normal q-q plot (given discrete data).

## Vuong Non-Nested Hypothesis Test-Statistic: 
## (test-statistic is asymptotically distributed N(0,1) under the
##  null that the models are indistinguishible)
## -------------------------------------------------------------
##               Vuong z-statistic             H_A    p-value
## Raw                   -40.77643 model2 > model1 < 2.22e-16
## AIC-corrected         -40.51415 model2 > model1 < 2.22e-16
## BIC-corrected         -39.57397 model2 > model1 < 2.22e-16

To test our Poisson models, we use a Vuong test. The Vuong test is a likelihood ratio test based on the weighted sum of a chi-square distribution. It shows whether one model or another is closer to the true model. (https://www.jstor.org/stable/1912557?seq=1#page_scan_tab_contents) In our case, a BIC-corrected Z-score of -39.57397 shows that it is unlikely that the regular model is better that the zero-inflated model. We chose the zero-inflated model. . . .

Our first few predictions for the test set are:

##   predict(zero_inf_model, wine_eval_data)
## 1                               1.8622171
## 2                               3.8813495
## 3                               2.6839066
## 4                               2.5606788
## 5                               0.7080849
## 6                               5.5746434

APPENDIX:

suppressWarnings(suppressMessages(library(e1071))) suppressWarnings(suppressMessages(library(MASS))) suppressWarnings(suppressMessages(library(car))) suppressWarnings(suppressMessages(library(corrplot))) suppressWarnings(suppressMessages(library(pROC))) suppressWarnings(suppressMessages(library(caret))) suppressWarnings(suppressMessages(library(tidyr))) suppressWarnings(suppressMessages(library(ggplot2))) suppressWarnings(suppressMessages(library(dplyr))) suppressWarnings(suppressMessages(library(corrplot))) suppressWarnings(suppressMessages(library(kableExtra))) suppressWarnings(suppressMessages(library(gridExtra)))

wine_data<-as_data_frame(read.csv(‘https://raw.githubusercontent.com/WigodskyD/data-sets/master/wine-training-data.csv’),stringsAsFactors=FALSE) wine_data<-wine_data[,-1] hist(wine_data$TARGET,main=‘Number Purchased’,xlab=‘Number Purchased’,col=‘darkmagenta’) correl.matrix<-cor(wine_data, use= “complete.obs”) corrplot(correl.matrix,method= “color” , type= “upper”)

wine_data$augmentedSTARS<-wine_data$STARS wine_data$augmentedSTARS[is.na(wine_data$augmentedSTARS)]<- -1 wine_data$ResidualSugar[is.na(wine_data$ResidualSugar)]<-mean(wine_data$ResidualSugar,na.rm=TRUE) wine_data$Chlorides[is.na(wine_data$Chlorides)]<-mean(wine_data$Chlorides,na.rm=TRUE) wine_data$FreeSulfurDioxide[is.na(wine_data$FreeSulfurDioxide)]<-mean(wine_data$FreeSulfurDioxide,na.rm=TRUE) wine_data$TotalSulfurDioxide[is.na(wine_data$TotalSulfurDioxide)]<-mean(wine_data$TotalSulfurDioxide,na.rm=TRUE) wine_data$Sulphates[is.na(wine_data$Sulphates)]<-mean(wine_data$Sulphates,na.rm=TRUE) wine_data$Alcohol[is.na(wine_data$Alcohol)]<-mean(wine_data$Alcohol,na.rm=TRUE) wine_data$pH[is.na(wine_data$pH)]<-mean(wine_data$pH,na.rm=TRUE)

par(mfrow=c(3,3)) par(bg = ‘azure’) hist(wine_data$ResidualSugar,main="Residual Sugar",xlab='',col='paleturquoise') hist(wine_data$Chlorides, main=“chlorides”,xlab=‘’,col=’paleturquoise’) hist(wine_data$FreeSulfurDioxide,main="free sulphur dioxide",xlab='',col='paleturquoise') hist(wine_data$Density,main=“density”,xlab=‘’,col=’paleturquoise’) hist(wine_data$pH,main="pH",xlab='',col='paleturquoise') hist(wine_data$Sulphates,main=“sulphates”,xlab=‘’,col=’paleturquoise’) hist(wine_data$LabelAppeal,main="label appeal",xlab='',col='paleturquoise') hist(wine_data$AcidIndex,main=“acid index”,xlab=‘’,col=’paleturquoise’) hist(wine_data$augmentedSTARS,main=“stars with 0 added”,xlab=‘’,col=’paleturquoise’)

par(mfrow=c(1,1)) layout(matrix(1),heights=1,widths=3) wine_data$AcidIndex<-log(wine_data$AcidIndex) hist(wine_data$AcidIndex,main=“acid index logged”,xlab=‘’,col=’paleturquoise’)

STARtest<-lm(data=wine_data,TARGET_{STARS) AUG_STARtest<-lm(data=wine_data,TARGET}augmentedSTARS) summary(STARtest) summary(AUG_STARtest)

plota<-ggplot()+geom_boxplot(data=wine_data, y=wine_data$ResidualSugar,aes(y=wine_data$ResidualSugar ,x=wine_data$TARGET,group=wine_data$TARGET))+labs(x=‘target’,y=‘residual sugar’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotb<-ggplot()+geom_boxplot(data=wine_data, y=wine_data$Chlorides,aes(y=wine_data$Chlorides ,x=wine_data$TARGET,group=wine_data$TARGET))+labs(x=‘target’,y=‘chlorides’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotc<-ggplot()+geom_boxplot(data=wine_data, y=wine_data$FreeSulfurDioxide,aes(y=wine_data$FreeSulfurDioxide ,x=wine_data$TARGET,group=wine_data$TARGET))+labs(x=‘’,y=’free sulfur dioxides’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotd<-ggplot()+geom_boxplot(data=wine_data, y=wine_data$TotalSulfurDioxide,aes(y=wine_data$TotalSulfurDioxide ,x=wine_data$TARGET,group=wine_data$TARGET))+labs(x=‘’,y=’total sulfur dioxide’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plote<-ggplot()+geom_boxplot(data=wine_data, y=wine_data$Density,aes(y=wine_data$Density ,x=wine_data$TARGET,group=wine_data$TARGET))+labs(x=‘target’,y=‘density’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotf<-ggplot()+geom_boxplot(data=wine_data, y=wine_data$pH,aes(y=wine_data$pH ,x=wine_data$TARGET,group=wine_data$TARGET))+labs(x=‘target’,y=‘pH’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotg<-ggplot()+geom_boxplot(data=wine_data, y=wine_data$Sulphates,aes(y=wine_data$Sulphates ,x=wine_data$TARGET,group=wine_data$TARGET))+labs(x=‘’,y=’sulphates’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) ploth<-ggplot()+geom_boxplot(data=wine_data, y=wine_data$Alcohol,aes(y=wine_data$Alcohol ,x=wine_data$TARGET,group=wine_data$TARGET))+labs(x=‘’,y=’alcohol’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) ploti<-ggplot()+geom_boxplot(data=wine_data, y=wine_data$LabelAppeal,aes(y=wine_data$LabelAppeal ,x=wine_data$TARGET,group=wine_data$TARGET))+labs(x=‘target’,y=‘label appeal’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotj<-ggplot()+geom_boxplot(data=wine_data, y=wine_data$AcidIndex,aes(y=wine_data$AcidIndex ,x=wine_data$TARGET,group=wine_data$TARGET))+labs(x=‘target’,y=‘acid index’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotk<-ggplot()+geom_boxplot(data=wine_data, y=wine_data$augmentedSTARS,aes(y=wine_data$augmentedSTARS ,x=wine_data$TARGET,group=wine_data$TARGET))+labs(x=‘’,y=’stars with zeros’)+ theme(panel.background = element_rect(fill = ‘Tomato’))

set.seed(102) wine_data<-wine_data[-15] testing_indices<-sample.int(length(wine_data$TARGET),size=.25*length(wine_data$TARGET)) testing_set<-wine_data[testing_indices,] training_set<-wine_data[-testing_indices,]

grid.arrange( plota,plotb,plotc,plotd,plote,plotf,plotg,ploth,ploti,plotj,plotk,nrow = 6)

predictions<-as.data.frame(predict(AUG_STARtest,testing_set)) predictions<-cbind(predictions,testing_set$TARGET) colnames(predictions)<-c('predictions','true_value') Metrics::rmse(predictions$predictions,predictions$true_value)

negative_bin_model<-glm.nb(data=training_set,TARGET~augmentedSTARS+LabelAppeal+AcidIndex) summary(negative_bin_model) predictions<-as.data.frame(predict(negative_bin_model,testing_set)) predictions<-cbind(predictions,testing_set$TARGET) colnames(predictions)<-c('predictions','true_value') print("RMSE for negative binomial model:") Metrics::rmse(predictions$predictions,predictions$true_value) roc_function_object<-roc(predictions$true_value, predictions$predictions) auc(roc_function_object) plot(roc_function_object)

poisson_model<-glm(data=training_set,TARGET~augmentedSTARS+LabelAppeal+AcidIndex+Alcohol,family=‘poisson’) summary(poisson_model) predictions<-as.data.frame(predict(poisson_model,testing_set)) predictions<-cbind(predictions,testing_set$TARGET) colnames(predictions)<-c('predictions','true_value') print('RMSE for Poisson model') Metrics::rmse(predictions$predictions,predictions$true_value) roc_function_object<-roc(predictions$true_value,predictions$predictions) print('actual mean') mean(predictions$true_value) print(‘prediction mean’) mean(predictions$predictions) plot(roc_function_object) print(‘poisson model’) auc(roc_function_object) plot(residuals(poisson_model))

qq1<-quantile(predictions$true_value,probs=seq(0,1,.001)) qq2<-quantile(predictions$predictions,probs=seq(0,1,.001)) plot(y= qq1,x= qq2,ylab=‘empirical distribution’,xlab=‘prediction distribution’)

aug_star_model<-lm(data = wine_data,TARGET~augmentedSTARS) full_model<-lm(data = wine_data,TARGET~.) summary(aug_star_model) summary(full_model) print(‘The forward stepwise regression chose the model below:’) step(aug_star_model,scope=list(lower=aug_star_model,upper=full_model) ,direction=“forward”)

forward_step_model<-lm(formula = TARGET ~ augmentedSTARS + LabelAppeal + AcidIndex + VolatileAcidity + TotalSulfurDioxide + Alcohol + FreeSulfurDioxide + Chlorides + Sulphates + Density + pH, data = wine_data) summary(forward_step_model) predictions<-as.data.frame(predict(forward_step_model,testing_set)) predictions<-cbind(predictions,testing_set$TARGET) colnames(predictions)<-c('predictions','true_value') Metrics::rmse(predictions$predictions,predictions$true_value) roc_function_object<-roc(predictions$true_value,predictions$predictions) plot(roc_function_object) print(‘stepwise multi-regression model’) auc(roc_function_object)

poisson_step_model<-glm(data=training_set,TARGET~augmentedSTARS + LabelAppeal + AcidIndex + VolatileAcidity + TotalSulfurDioxide + Alcohol + FreeSulfurDioxide + Chlorides + Sulphates + Density + pH,family=poisson(link = “log”)) predictions<-as.data.frame(predict(poisson_step_model,testing_set)) predictions<-cbind(predictions,testing_set$TARGET) colnames(predictions)<-c('predictions','true_value') print('RMSE for Poisson forward stepwise regression model') Metrics::rmse(predictions$predictions,predictions$true_value)

suppressWarnings(suppressMessages(library(pscl))) zero_inf_model <- zeroinfl(TARGET ~ augmentedSTARS + LabelAppeal + AcidIndex + VolatileAcidity + TotalSulfurDioxide + Alcohol + FreeSulfurDioxide + Chlorides + Sulphates + Density + pH, data = training_set) summary(zero_inf_model)

predictions<-as.data.frame(predict(zero_inf_model,testing_set)) predictions<-cbind(predictions,testing_set$TARGET) colnames(predictions)<-c('predictions','true_value') Metrics::rmse(predictions$predictions,predictions$true_value) roc_function_object<-roc(predictions$true_value,predictions$predictions) plot(roc_function_object) print('zero-inflated poisson model') auc(roc_function_object) print('mean of data:') mean(predictions$true_value) print(‘mean of prediction:’) mean(predictions$predictions)

plot(residuals(zero_inf_model)) qq1<-quantile(predictions$true_value,probs=seq(0,1,.005)) qq2<-quantile(predictions$predictions,probs=seq(0,1,.005)) plot(y= qq1,x= qq2,ylab=‘empirical distribution’,xlab=‘prediction distribution’) vuong(poisson_step_model, zero_inf_model)

wine_eval_data<-as_data_frame(read.csv(‘C:/Users/dawig/Desktop/Data621/Homework_5/wine-evaluation-data.csv’)) wine_eval_data<-wine_eval_data[,-1] wine_eval_data$augmentedSTARS<-wine_eval_data$STARS wine_eval_data$augmentedSTARS[is.na(wine_eval_data$augmentedSTARS)]<- -1 wine_eval_data$ResidualSugar[is.na(wine_eval_data$ResidualSugar)]<-mean(wine_eval_data$ResidualSugar,na.rm=TRUE) wine_eval_data$Chlorides[is.na(wine_eval_data$Chlorides)]<-mean(wine_eval_data$Chlorides,na.rm=TRUE) wine_eval_data$FreeSulfurDioxide[is.na(wine_eval_data$FreeSulfurDioxide)]<-mean(wine_eval_data$FreeSulfurDioxide,na.rm=TRUE) wine_eval_data$TotalSulfurDioxide[is.na(wine_eval_data$TotalSulfurDioxide)]<-mean(wine_eval_data$TotalSulfurDioxide,na.rm=TRUE) wine_eval_data$Sulphates[is.na(wine_eval_data$Sulphates)]<-mean(wine_eval_data$Sulphates,na.rm=TRUE) wine_eval_data$Alcohol[is.na(wine_eval_data$Alcohol)]<-mean(wine_eval_data$Alcohol,na.rm=TRUE) wine_eval_data$pH[is.na(wine_eval_data$pH)]<-mean(wine_eval_data$pH,na.rm=TRUE) wine_eval_data$AcidIndex<-log(wine_eval_data$AcidIndex) wine_case_predictions<-as.data.frame(predict(zero_inf_model,wine_eval_data)) head(wine_case_predictions) write.csv(wine_case_predictions,‘C:/Users/dawig/Desktop/Data621/Homework_5/WigodskyDanpredictions5.csv’)

Data_621_project_5

Dan Wigodsky

November 29, 2018

Project 5
Wine Quality and Quantity Purchased:

a Count Regression Model

Wine Quality and Quantity Purchased:

a Count Regression Model

.

.

Our dataset consists of 15 variables about different qualities. A wine producer might be able to use this data, along with the target variable, number of cases purchased by restaurants, to determine what qualities consumers are looking for in wines and to be able to plan accordingly.

Restaurants purchased from 0 to 8 cases of each wine. Many purchased 0. After that, the distribution is centered around 4. The count distribution for our target may be best modelled in 2 parts based on its multimodal structure. We’ll explore that probability later in our exploration.
.

Our single variable augmentedSTARS model serves as a benchmark to compare other models to. The Root Mean Square Error for this model is:

`## [1] 1.39657`

Our first few predictions for the test set are:

`## predict(zero_inf_model, wine_eval_data) ## 1 1.8622171 ## 2 3.8813495 ## 3 2.6839066 ## 4 2.5606788 ## 5 0.7080849 ## 6 5.5746434`

APPENDIX:

Data_621_project_5

Dan Wigodsky

November 29, 2018

Project 5 Wine Quality and Quantity Purchased: a Count Regression Model

Wine Quality and Quantity Purchased:

a Count Regression Model

.

.

Our dataset consists of 15 variables about different qualities. A wine producer might be able to use this data, along with the target variable, number of cases purchased by restaurants, to determine what qualities consumers are looking for in wines and to be able to plan accordingly.

Restaurants purchased from 0 to 8 cases of each wine. Many purchased 0. After that, the distribution is centered around 4. The count distribution for our target may be best modelled in 2 parts based on its multimodal structure. We’ll explore that probability later in our exploration. .

Our single variable augmentedSTARS model serves as a benchmark to compare other models to. The Root Mean Square Error for this model is: ## [1] 1.39657

Our first few predictions for the test set are: ## predict(zero_inf_model, wine_eval_data) ## 1 1.8622171 ## 2 3.8813495 ## 3 2.6839066 ## 4 2.5606788 ## 5 0.7080849 ## 6 5.5746434

APPENDIX:

Project 5
Wine Quality and Quantity Purchased:

a Count Regression Model

Restaurants purchased from 0 to 8 cases of each wine. Many purchased 0. After that, the distribution is centered around 4. The count distribution for our target may be best modelled in 2 parts based on its multimodal structure. We’ll explore that probability later in our exploration.
.

Our single variable augmentedSTARS model serves as a benchmark to compare other models to. The Root Mean Square Error for this model is:

`## [1] 1.39657`

Our first few predictions for the test set are:

`## predict(zero_inf_model, wine_eval_data) ## 1 1.8622171 ## 2 3.8813495 ## 3 2.6839066 ## 4 2.5606788 ## 5 0.7080849 ## 6 5.5746434`