Restaurants purchased from 0 to 8 cases of each wine. Many purchased 0. After that, the distribution is centered around 4. The count distribution for our target may be best modelled in 2 parts based on its multimodal structure. We’ll explore that probability later in our exploration.
.
The correlations between many of our various variables are quite low for most of our variables. In our corrplot, only STARS, label index, alcohol and acid index have any visible correlation with our target. Acid index is negatively correlated, indicting that consumers don’t like acidic wines. Consumers appear to be more moved by advertising qualities than by other qualities. A flashy rating and a nice label do more to make restaurants buy than taste.
.
Looking at the individual distributions of our variables, most of them appear to be fairly normal. Acid index is right-skewed. When we log it, its distribution appears normal. We maintain acid index as a logged variable. STARS appears in the dataset with a lot of NA entries. With the assumption that an unrated wine is overlooked and is likely to remain overlooked, we assume that rated wines would be more desirable. We imputed all NAs as 0 in our dataset. Below, we see that a simple linear model based on STARS as the independent variable can be improved when the STARS variable is augmented by 0s in place of NA. Above, we can see that there are a lot of 0s and our model may be more complex than a simple distribution. STARS is the variable most apparently correlated to number purchased in our corrplot. The adjusted R2 for he regular STARS variable is .3123 Augmented by 0, it becomes .4739.
##
## Call:
## lm(formula = TARGET ~ STARS, data = wine_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6459 -0.6459 0.3153 0.4319 3.3930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.72362 0.03278 52.58 <2e-16 ***
## STARS 0.96112 0.01469 65.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.287 on 9434 degrees of freedom
## (3359 observations deleted due to missingness)
## Multiple R-squared: 0.3123, Adjusted R-squared: 0.3122
## F-statistic: 4283 on 1 and 9434 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET ~ augmentedSTARS, data = wine_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6780 -1.1057 0.1795 0.8943 6.8943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.963108 0.015849 123.9 <2e-16 ***
## augmentedSTARS 0.857423 0.007987 107.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.397 on 12793 degrees of freedom
## Multiple R-squared: 0.4739, Adjusted R-squared: 0.4739
## F-statistic: 1.152e+04 on 1 and 12793 DF, p-value: < 2.2e-16
Looking at boxplots of our variables confirms our other look at the data. We do see, however, that some variables have interesting features that may help our model. A lot of outliers will hamper our ability to create a model with a high R2. Chlorides appear to be elevated for less desirable wines. Density, sulphates and pH appear to be higher in more desirable wines.
Our single variable augmentedSTARS model serves as a benchmark to compare other models to. The Root Mean Square Error for this model is:
## [1] 1.39657
We create a negative binomial count model using the 4 variables: augmentedSTARS, LabelAppeal, AcidIndex, Alcohol. Alcohol was not statistically significant within our model. Dropping it left the AIC unchanged. We leave this model with 3 variables. We also plot a ROC curve and calculate an AUC for this model. At .5105, we see that this model, with a high RMSE, does not do a great job of predicting our count.
##
## Call:
## glm.nb(formula = TARGET ~ augmentedSTARS + LabelAppeal + AcidIndex,
## data = training_set, init.theta = 46220.14491, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0443 -0.7031 0.0255 0.5355 3.3292
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.942763 0.084754 22.92 <2e-16 ***
## augmentedSTARS 0.276564 0.004440 62.29 <2e-16 ***
## LabelAppeal 0.138255 0.006982 19.80 <2e-16 ***
## AcidIndex -0.636685 0.041207 -15.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(46220.14) family taken to be 1)
##
## Null deviance: 17155 on 9596 degrees of freedom
## Residual deviance: 10639 on 9593 degrees of freedom
## AIC: 34606
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 46220
## Std. Err.: 50729
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -34595.7
## [1] "RMSE for negative binomial model:"
## [1] 2.596958
## Area under the curve: 0.5105
We also create a Poisson model for our count data. It has a similar RMSE. It has a nearly identical AIC. With an AUC of .5046, it is also not capturing our data very fully.
##
## Call:
## glm(formula = TARGET ~ augmentedSTARS + LabelAppeal + AcidIndex +
## Alcohol, family = "poisson", data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0455 -0.7065 0.0216 0.5327 3.3341
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.912004 0.087340 21.891 <2e-16 ***
## augmentedSTARS 0.276160 0.004448 62.089 <2e-16 ***
## LabelAppeal 0.138301 0.006981 19.811 <2e-16 ***
## AcidIndex -0.633486 0.041255 -15.355 <2e-16 ***
## Alcohol 0.002362 0.001626 1.452 0.146
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 17156 on 9596 degrees of freedom
## Residual deviance: 10637 on 9592 degrees of freedom
## AIC: 34603
##
## Number of Fisher Scoring iterations: 5
## [1] "RMSE for Poisson model"
## [1] 2.596937
## [1] "actual mean"
## [1] 3.03127
## [1] "prediction mean"
## [1] 0.9831358
## [1] "poisson model"
## Area under the curve: 0.5046
We now try to create a multiregression model from a forward stepwise regression process. We begin with our augmentedSTARS model. The model created by this process chooses 11 variables as predictors. This model produces an R2 of .5368. It is much better than the original STARS model. It is an improvement on the augmentedSTARS model. It produces a RMSE of 1.308914. This is far better than the Poisson model. The AUC for this model is 0.5231.
##
## Call:
## lm(formula = TARGET ~ augmentedSTARS, data = wine_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6780 -1.1057 0.1795 0.8943 6.8943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.963108 0.015849 123.9 <2e-16 ***
## augmentedSTARS 0.857423 0.007987 107.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.397 on 12793 degrees of freedom
## Multiple R-squared: 0.4739, Adjusted R-squared: 0.4739
## F-statistic: 1.152e+04 on 1 and 12793 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET ~ ., data = wine_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7007 -0.8493 0.0238 0.8456 6.2022
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.194e+00 4.631e-01 13.376 < 2e-16 ***
## FixedAcidity 1.067e-04 1.866e-03 0.057 0.954400
## VolatileAcidity -9.730e-02 1.484e-02 -6.558 5.67e-11 ***
## CitricAcid 1.728e-02 1.349e-02 1.280 0.200418
## ResidualSugar 2.506e-04 3.525e-04 0.711 0.477079
## Chlorides -1.152e-01 3.741e-02 -3.079 0.002081 **
## FreeSulfurDioxide 2.846e-04 8.016e-05 3.550 0.000387 ***
## TotalSulfurDioxide 2.303e-04 5.151e-05 4.471 7.84e-06 ***
## Density -8.035e-01 4.377e-01 -1.836 0.066452 .
## pH -3.224e-02 1.738e-02 -1.855 0.063559 .
## Sulphates -3.165e-02 1.309e-02 -2.418 0.015632 *
## Alcohol 1.228e-02 3.203e-03 3.833 0.000127 ***
## LabelAppeal 4.690e-01 1.342e-02 34.962 < 2e-16 ***
## AcidIndex -1.629e+00 7.673e-02 -21.226 < 2e-16 ***
## augmentedSTARS 7.577e-01 7.877e-03 96.189 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.312 on 12780 degrees of freedom
## Multiple R-squared: 0.5369, Adjusted R-squared: 0.5364
## F-statistic: 1058 on 14 and 12780 DF, p-value: < 2.2e-16
## [1] "The forward stepwise regression chose the model below:"
##
## Call:
## lm(formula = TARGET ~ augmentedSTARS + LabelAppeal + AcidIndex +
## VolatileAcidity + TotalSulfurDioxide + Alcohol + FreeSulfurDioxide +
## Chlorides + Sulphates + Density + pH, data = wine_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7055 -0.8505 0.0230 0.8464 6.2003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.193e+00 4.629e-01 13.380 < 2e-16 ***
## augmentedSTARS 7.579e-01 7.874e-03 96.249 < 2e-16 ***
## LabelAppeal 4.691e-01 1.342e-02 34.966 < 2e-16 ***
## AcidIndex -1.621e+00 7.544e-02 -21.492 < 2e-16 ***
## VolatileAcidity -9.769e-02 1.483e-02 -6.586 4.70e-11 ***
## TotalSulfurDioxide 2.315e-04 5.149e-05 4.497 6.94e-06 ***
## Alcohol 1.231e-02 3.202e-03 3.845 0.000121 ***
## FreeSulfurDioxide 2.864e-04 8.014e-05 3.573 0.000354 ***
## Chlorides -1.158e-01 3.741e-02 -3.094 0.001977 **
## Sulphates -3.194e-02 1.309e-02 -2.440 0.014684 *
## Density -8.112e-01 4.377e-01 -1.854 0.063832 .
## pH -3.218e-02 1.738e-02 -1.852 0.064064 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.312 on 12783 degrees of freedom
## Multiple R-squared: 0.5368, Adjusted R-squared: 0.5364
## F-statistic: 1347 on 11 and 12783 DF, p-value: < 2.2e-16
## [1] 1.308914
## [1] "stepwise multi-regression model"
## Area under the curve: 0.5231
We return to a Poisson model. This time, we use the variables chosen by our forward stepwise regression. This model achieves a RMSE of 2.596436. It’s clear that the choice of variables is not likely to help our Poisson or negative binomial models. We need to find a way to account for the large set of zero values. We want to create a two part model. We want to separate zero vlues from the others and then model the Poisson nature of the rest of the data. We turn to a zero-inflated model to incorporate this aspect of the data.
## [1] "RMSE for Poisson forward stepwise regression model"
## [1] 2.596436
##
## Call:
## zeroinfl(formula = TARGET ~ augmentedSTARS + LabelAppeal + AcidIndex +
## VolatileAcidity + TotalSulfurDioxide + Alcohol + FreeSulfurDioxide +
## Chlorides + Sulphates + Density + pH, data = training_set)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.023793 -0.403177 -0.002897 0.365911 5.488482
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.737e+00 2.455e-01 7.076 1.48e-12 ***
## augmentedSTARS 8.437e-02 5.173e-03 16.308 < 2e-16 ***
## LabelAppeal 2.407e-01 7.294e-03 32.999 < 2e-16 ***
## AcidIndex -1.682e-01 4.519e-02 -3.723 0.000197 ***
## VolatileAcidity -1.083e-02 7.757e-03 -1.396 0.162761
## TotalSulfurDioxide -6.613e-06 2.593e-05 -0.255 0.798710
## Alcohol 7.073e-03 1.667e-03 4.243 2.20e-05 ***
## FreeSulfurDioxide 1.297e-05 4.070e-05 0.319 0.749976
## Chlorides -1.600e-02 1.954e-02 -0.819 0.412777
## Sulphates 1.159e-03 6.921e-03 0.167 0.867004
## Density -3.448e-01 2.304e-01 -1.497 0.134454
## pH 5.561e-03 9.131e-03 0.609 0.542482
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.5458687 1.6180352 -5.900 3.64e-09 ***
## augmentedSTARS -1.3753840 0.0360098 -38.195 < 2e-16 ***
## LabelAppeal 0.7249666 0.0507035 14.298 < 2e-16 ***
## AcidIndex 3.5708259 0.2535387 14.084 < 2e-16 ***
## VolatileAcidity 0.1933707 0.0514577 3.758 0.000171 ***
## TotalSulfurDioxide -0.0008810 0.0001753 -5.027 4.99e-07 ***
## Alcohol 0.0297627 0.0111317 2.674 0.007502 **
## FreeSulfurDioxide -0.0006363 0.0002784 -2.286 0.022278 *
## Chlorides 0.1020357 0.1274027 0.801 0.423195
## Sulphates 0.1031398 0.0449920 2.292 0.021883 *
## Density 0.2376958 1.5272531 0.156 0.876320
## pH 0.2356841 0.0608324 3.874 0.000107 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 32
## Log-likelihood: -1.541e+04 on 24 Df
The Zero Inflation model produces an R2 of 1.279552 with our validation set. It is the best model of all the models we’ve tested. The AUC for this model is .6909. This model does the best job of describing our data.
## [1] 1.279552
## [1] "zero-inflated poisson model"
## Area under the curve: 0.6909
## [1] "mean of data:"
## [1] 3.03127
## [1] "mean of prediction:"
## [1] 2.993857
Our residuals plot shows a more normal distribution. The stripe of missing residuals is now barely noticable. Residuals are still positively skewed, but much more tame. The means of the predicted values and of the actual values are now closer than .04 to each other. The plot comparing the empirical distribution to the fitted distribution now appears closer to a normal q-q plot (given discrete data).
## Vuong Non-Nested Hypothesis Test-Statistic:
## (test-statistic is asymptotically distributed N(0,1) under the
## null that the models are indistinguishible)
## -------------------------------------------------------------
## Vuong z-statistic H_A p-value
## Raw -40.77643 model2 > model1 < 2.22e-16
## AIC-corrected -40.51415 model2 > model1 < 2.22e-16
## BIC-corrected -39.57397 model2 > model1 < 2.22e-16
To test our Poisson models, we use a Vuong test. The Vuong test is a likelihood ratio test based on the weighted sum of a chi-square distribution. It shows whether one model or another is closer to the true model. (https://www.jstor.org/stable/1912557?seq=1#page_scan_tab_contents) In our case, a BIC-corrected Z-score of -39.57397 shows that it is unlikely that the regular model is better that the zero-inflated model. We chose the zero-inflated model. . . .
Our first few predictions for the test set are:
## predict(zero_inf_model, wine_eval_data)
## 1 1.8622171
## 2 3.8813495
## 3 2.6839066
## 4 2.5606788
## 5 0.7080849
## 6 5.5746434
suppressWarnings(suppressMessages(library(e1071))) suppressWarnings(suppressMessages(library(MASS))) suppressWarnings(suppressMessages(library(car))) suppressWarnings(suppressMessages(library(corrplot))) suppressWarnings(suppressMessages(library(pROC))) suppressWarnings(suppressMessages(library(caret))) suppressWarnings(suppressMessages(library(tidyr))) suppressWarnings(suppressMessages(library(ggplot2))) suppressWarnings(suppressMessages(library(dplyr))) suppressWarnings(suppressMessages(library(corrplot))) suppressWarnings(suppressMessages(library(kableExtra))) suppressWarnings(suppressMessages(library(gridExtra)))
wine_data<-as_data_frame(read.csv(‘https://raw.githubusercontent.com/WigodskyD/data-sets/master/wine-training-data.csv’),stringsAsFactors=FALSE) wine_data<-wine_data[,-1] hist(wine_data$TARGET,main=‘Number Purchased’,xlab=‘Number Purchased’,col=‘darkmagenta’) correl.matrix<-cor(wine_data, use= “complete.obs”) corrplot(correl.matrix,method= “color” , type= “upper”)
wine_data\(augmentedSTARS<-wine_data\)STARS wine_data\(augmentedSTARS[is.na(wine_data\)augmentedSTARS)]<- -1 wine_data\(ResidualSugar[is.na(wine_data\)ResidualSugar)]<-mean(wine_data\(ResidualSugar,na.rm=TRUE) wine_data\)Chlorides[is.na(wine_data$Chlorides)]<-mean(wine_data\(Chlorides,na.rm=TRUE) wine_data\)FreeSulfurDioxide[is.na(wine_data$FreeSulfurDioxide)]<-mean(wine_data\(FreeSulfurDioxide,na.rm=TRUE) wine_data\)TotalSulfurDioxide[is.na(wine_data$TotalSulfurDioxide)]<-mean(wine_data\(TotalSulfurDioxide,na.rm=TRUE) wine_data\)Sulphates[is.na(wine_data$Sulphates)]<-mean(wine_data\(Sulphates,na.rm=TRUE) wine_data\)Alcohol[is.na(wine_data$Alcohol)]<-mean(wine_data\(Alcohol,na.rm=TRUE) wine_data\)pH[is.na(wine_data$pH)]<-mean(wine_data$pH,na.rm=TRUE)
par(mfrow=c(3,3)) par(bg = ‘azure’) hist(wine_data\(ResidualSugar,main="Residual Sugar",xlab='',col='paleturquoise') hist(wine_data\)Chlorides, main=“chlorides”,xlab=‘’,col=’paleturquoise’) hist(wine_data\(FreeSulfurDioxide,main="free sulphur dioxide",xlab='',col='paleturquoise') hist(wine_data\)Density,main=“density”,xlab=‘’,col=’paleturquoise’) hist(wine_data\(pH,main="pH",xlab='',col='paleturquoise') hist(wine_data\)Sulphates,main=“sulphates”,xlab=‘’,col=’paleturquoise’) hist(wine_data\(LabelAppeal,main="label appeal",xlab='',col='paleturquoise') hist(wine_data\)AcidIndex,main=“acid index”,xlab=‘’,col=’paleturquoise’) hist(wine_data$augmentedSTARS,main=“stars with 0 added”,xlab=‘’,col=’paleturquoise’)
par(mfrow=c(1,1)) layout(matrix(1),heights=1,widths=3) wine_data\(AcidIndex<-log(wine_data\)AcidIndex) hist(wine_data$AcidIndex,main=“acid index logged”,xlab=‘’,col=’paleturquoise’)
STARtest<-lm(data=wine_data,TARGETSTARS) AUG_STARtest<-lm(data=wine_data,TARGETaugmentedSTARS) summary(STARtest) summary(AUG_STARtest)
plota<-ggplot()+geom_boxplot(data=wine_data, y=wine_data\(ResidualSugar,aes(y=wine_data\)ResidualSugar ,x=wine_data\(TARGET,group=wine_data\)TARGET))+labs(x=‘target’,y=‘residual sugar’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotb<-ggplot()+geom_boxplot(data=wine_data, y=wine_data\(Chlorides,aes(y=wine_data\)Chlorides ,x=wine_data\(TARGET,group=wine_data\)TARGET))+labs(x=‘target’,y=‘chlorides’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotc<-ggplot()+geom_boxplot(data=wine_data, y=wine_data\(FreeSulfurDioxide,aes(y=wine_data\)FreeSulfurDioxide ,x=wine_data\(TARGET,group=wine_data\)TARGET))+labs(x=‘’,y=’free sulfur dioxides’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotd<-ggplot()+geom_boxplot(data=wine_data, y=wine_data\(TotalSulfurDioxide,aes(y=wine_data\)TotalSulfurDioxide ,x=wine_data\(TARGET,group=wine_data\)TARGET))+labs(x=‘’,y=’total sulfur dioxide’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plote<-ggplot()+geom_boxplot(data=wine_data, y=wine_data\(Density,aes(y=wine_data\)Density ,x=wine_data\(TARGET,group=wine_data\)TARGET))+labs(x=‘target’,y=‘density’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotf<-ggplot()+geom_boxplot(data=wine_data, y=wine_data\(pH,aes(y=wine_data\)pH ,x=wine_data\(TARGET,group=wine_data\)TARGET))+labs(x=‘target’,y=‘pH’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotg<-ggplot()+geom_boxplot(data=wine_data, y=wine_data\(Sulphates,aes(y=wine_data\)Sulphates ,x=wine_data\(TARGET,group=wine_data\)TARGET))+labs(x=‘’,y=’sulphates’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) ploth<-ggplot()+geom_boxplot(data=wine_data, y=wine_data\(Alcohol,aes(y=wine_data\)Alcohol ,x=wine_data\(TARGET,group=wine_data\)TARGET))+labs(x=‘’,y=’alcohol’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) ploti<-ggplot()+geom_boxplot(data=wine_data, y=wine_data\(LabelAppeal,aes(y=wine_data\)LabelAppeal ,x=wine_data\(TARGET,group=wine_data\)TARGET))+labs(x=‘target’,y=‘label appeal’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotj<-ggplot()+geom_boxplot(data=wine_data, y=wine_data\(AcidIndex,aes(y=wine_data\)AcidIndex ,x=wine_data\(TARGET,group=wine_data\)TARGET))+labs(x=‘target’,y=‘acid index’)+ theme(panel.background = element_rect(fill = ‘Tomato’)) plotk<-ggplot()+geom_boxplot(data=wine_data, y=wine_data\(augmentedSTARS,aes(y=wine_data\)augmentedSTARS ,x=wine_data\(TARGET,group=wine_data\)TARGET))+labs(x=‘’,y=’stars with zeros’)+ theme(panel.background = element_rect(fill = ‘Tomato’))
set.seed(102) wine_data<-wine_data[-15] testing_indices<-sample.int(length(wine_data\(TARGET),size=.25*length(wine_data\)TARGET)) testing_set<-wine_data[testing_indices,] training_set<-wine_data[-testing_indices,]
grid.arrange( plota,plotb,plotc,plotd,plote,plotf,plotg,ploth,ploti,plotj,plotk,nrow = 6)
predictions<-as.data.frame(predict(AUG_STARtest,testing_set)) predictions<-cbind(predictions,testing_set\(TARGET) colnames(predictions)<-c('predictions','true_value') Metrics::rmse(predictions\)predictions,predictions$true_value)
negative_bin_model<-glm.nb(data=training_set,TARGET~augmentedSTARS+LabelAppeal+AcidIndex) summary(negative_bin_model) predictions<-as.data.frame(predict(negative_bin_model,testing_set)) predictions<-cbind(predictions,testing_set\(TARGET) colnames(predictions)<-c('predictions','true_value') print("RMSE for negative binomial model:") Metrics::rmse(predictions\)predictions,predictions\(true_value) roc_function_object<-roc(predictions\)true_value, predictions$predictions) auc(roc_function_object) plot(roc_function_object)
poisson_model<-glm(data=training_set,TARGET~augmentedSTARS+LabelAppeal+AcidIndex+Alcohol,family=‘poisson’) summary(poisson_model) predictions<-as.data.frame(predict(poisson_model,testing_set)) predictions<-cbind(predictions,testing_set\(TARGET) colnames(predictions)<-c('predictions','true_value') print('RMSE for Poisson model') Metrics::rmse(predictions\)predictions,predictions\(true_value) roc_function_object<-roc(predictions\)true_value,predictions\(predictions) print('actual mean') mean(predictions\)true_value) print(‘prediction mean’) mean(predictions$predictions) plot(roc_function_object) print(‘poisson model’) auc(roc_function_object) plot(residuals(poisson_model))
qq1<-quantile(predictions\(true_value,probs=seq(0,1,.001)) qq2<-quantile(predictions\)predictions,probs=seq(0,1,.001)) plot(y= qq1,x= qq2,ylab=‘empirical distribution’,xlab=‘prediction distribution’)
aug_star_model<-lm(data = wine_data,TARGET~augmentedSTARS) full_model<-lm(data = wine_data,TARGET~.) summary(aug_star_model) summary(full_model) print(‘The forward stepwise regression chose the model below:’) step(aug_star_model,scope=list(lower=aug_star_model,upper=full_model) ,direction=“forward”)
forward_step_model<-lm(formula = TARGET ~ augmentedSTARS + LabelAppeal + AcidIndex + VolatileAcidity + TotalSulfurDioxide + Alcohol + FreeSulfurDioxide + Chlorides + Sulphates + Density + pH, data = wine_data) summary(forward_step_model) predictions<-as.data.frame(predict(forward_step_model,testing_set)) predictions<-cbind(predictions,testing_set\(TARGET) colnames(predictions)<-c('predictions','true_value') Metrics::rmse(predictions\)predictions,predictions\(true_value) roc_function_object<-roc(predictions\)true_value,predictions$predictions) plot(roc_function_object) print(‘stepwise multi-regression model’) auc(roc_function_object)
poisson_step_model<-glm(data=training_set,TARGET~augmentedSTARS + LabelAppeal + AcidIndex + VolatileAcidity + TotalSulfurDioxide + Alcohol + FreeSulfurDioxide + Chlorides + Sulphates + Density + pH,family=poisson(link = “log”)) predictions<-as.data.frame(predict(poisson_step_model,testing_set)) predictions<-cbind(predictions,testing_set\(TARGET) colnames(predictions)<-c('predictions','true_value') print('RMSE for Poisson forward stepwise regression model') Metrics::rmse(predictions\)predictions,predictions$true_value)
suppressWarnings(suppressMessages(library(pscl))) zero_inf_model <- zeroinfl(TARGET ~ augmentedSTARS + LabelAppeal + AcidIndex + VolatileAcidity + TotalSulfurDioxide + Alcohol + FreeSulfurDioxide + Chlorides + Sulphates + Density + pH, data = training_set) summary(zero_inf_model)
predictions<-as.data.frame(predict(zero_inf_model,testing_set)) predictions<-cbind(predictions,testing_set\(TARGET) colnames(predictions)<-c('predictions','true_value') Metrics::rmse(predictions\)predictions,predictions\(true_value) roc_function_object<-roc(predictions\)true_value,predictions\(predictions) plot(roc_function_object) print('zero-inflated poisson model') auc(roc_function_object) print('mean of data:') mean(predictions\)true_value) print(‘mean of prediction:’) mean(predictions$predictions)
plot(residuals(zero_inf_model)) qq1<-quantile(predictions\(true_value,probs=seq(0,1,.005)) qq2<-quantile(predictions\)predictions,probs=seq(0,1,.005)) plot(y= qq1,x= qq2,ylab=‘empirical distribution’,xlab=‘prediction distribution’) vuong(poisson_step_model, zero_inf_model)
wine_eval_data<-as_data_frame(read.csv(‘C:/Users/dawig/Desktop/Data621/Homework_5/wine-evaluation-data.csv’)) wine_eval_data<-wine_eval_data[,-1] wine_eval_data\(augmentedSTARS<-wine_eval_data\)STARS wine_eval_data\(augmentedSTARS[is.na(wine_eval_data\)augmentedSTARS)]<- -1 wine_eval_data\(ResidualSugar[is.na(wine_eval_data\)ResidualSugar)]<-mean(wine_eval_data\(ResidualSugar,na.rm=TRUE) wine_eval_data\)Chlorides[is.na(wine_eval_data$Chlorides)]<-mean(wine_eval_data\(Chlorides,na.rm=TRUE) wine_eval_data\)FreeSulfurDioxide[is.na(wine_eval_data$FreeSulfurDioxide)]<-mean(wine_eval_data\(FreeSulfurDioxide,na.rm=TRUE) wine_eval_data\)TotalSulfurDioxide[is.na(wine_eval_data$TotalSulfurDioxide)]<-mean(wine_eval_data\(TotalSulfurDioxide,na.rm=TRUE) wine_eval_data\)Sulphates[is.na(wine_eval_data$Sulphates)]<-mean(wine_eval_data\(Sulphates,na.rm=TRUE) wine_eval_data\)Alcohol[is.na(wine_eval_data$Alcohol)]<-mean(wine_eval_data\(Alcohol,na.rm=TRUE) wine_eval_data\)pH[is.na(wine_eval_data$pH)]<-mean(wine_eval_data\(pH,na.rm=TRUE) wine_eval_data\)AcidIndex<-log(wine_eval_data$AcidIndex) wine_case_predictions<-as.data.frame(predict(zero_inf_model,wine_eval_data)) head(wine_case_predictions) write.csv(wine_case_predictions,‘C:/Users/dawig/Desktop/Data621/Homework_5/WigodskyDanpredictions5.csv’)