In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.
Assigned predictions (number of cases of wine sold) for the evaluation data set.
Include your R statistical programming code in an Appendix.
I will start by loading the data from the repository and loading many of the R packages that I will be using.
loc_train = "https://raw.githubusercontent.com/chrisestevez/DataAnalytics/master/Data/HW5621/wine-training-data.csv"
loc_test = "https://raw.githubusercontent.com/chrisestevez/DataAnalytics/master/Data/HW5621/wine-evaluation-data.csv"
train_df = read.csv(loc_train, stringsAsFactors = FALSE)
test_df = read.csv(loc_test, stringsAsFactors = FALSE)
#library("psych")
library("caret")
library("MASS")
library("kableExtra")
library('DataExplorer')
library('mice')
library("Hmisc")
library("dplyr")
library("tidyr")
library("magrittr")
library("pscl")
Exploring the data reveals many Nas that will either need to be imputed with zero or replaced.
str(train_df)
## 'data.frame': 12795 obs. of 16 variables:
## $ ï..INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET : int 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : int 2 3 3 1 2 NA NA 3 NA 4 ...
summary(train_df)
## ï..INDEX TARGET FixedAcidity VolatileAcidity
## Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
## 1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
## Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
## Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
## 3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
## Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
##
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00
## 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00
## Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00
## Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85
## 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00
## Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00
## NA's :616 NA's :638 NA's :647
## TotalSulfurDioxide Density pH Sulphates
## Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
## 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800
## Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
## Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
## 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600
## Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
## NA's :682 NA's :395 NA's :1210
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
## Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :653 NA's :3359
The correlation matrix reveals some positive correlation between STARS and TARGET variables.
library("corrplot")
cor_mx = cor(dplyr::select_if(train_df, is.numeric) ,use="pairwise.complete.obs", method = "pearson")
corrplot(cor_mx, method = "color",
type = "upper", order = "original", number.cex = .7,
addCoef.col = "black", # Add coefficient of correlation
tl.col = "black", tl.srt = 90, # Text label color and rotation
# hide correlation coefficient on the principal diagonal
diag = TRUE)
A histogram of the variable reveals some skewness in certain variables. The variables that should be transformed are AcidIndex and STARS.
hist.data.frame(train_df,n.unique=1, mtitl = "Wine")
To prepare the data for modeling, I will first start by replacing Na star ratings with zeros. I considered the fact that these wines might not have been rated.
The next step in the data transformation process is to convert the star rating into a dummy variable. I will use the general rule of leaving one out K-1. In this case, the value left out are wines with no ratings or zero.
Variable that will be removed from the data set are Index and STARS since these have become dummy variables.
The third step in data preparation is replacing remaining Nas. The replacement will be achieved using the mice package. After the process is finished, a plot indicating the distribution will be plotted for the test and training set.
PlotMissing(train_df)
train_df$STARS[is.na(train_df$STARS)] = 0
test_df$STARS[is.na(test_df$STARS)] = 0
train_df$one_Star =ifelse(train_df$STARS==1,1,0)
train_df$two_Star=ifelse(train_df$STARS==2,1,0)
train_df$three_Star=ifelse(train_df$STARS==3,1,0)
train_df$four_Star=ifelse(train_df$STARS==4,1,0)
test_df$one_Star =ifelse(test_df$STARS==1,1,0)
test_df$two_Star=ifelse(test_df$STARS==2,1,0)
test_df$three_Star=ifelse(test_df$STARS==3,1,0)
test_df$four_Star=ifelse(test_df$STARS==4,1,0)
#Removed first column
train_df = train_df %>% select(-1,-STARS)
test_df =test_df %>% select(-1,-STARS)
mice_imputes = mice(train_df, m = 2, maxit = 2, print = FALSE,seed = 143)
densityplot(mice_imputes)
imputed_train = mice::complete(mice_imputes)
mice_imputes_test = mice(test_df, m = 2, maxit = 2, print = FALSE,seed = 143)
#density for final test data set
densityplot(mice_imputes_test)
imp_Eval_test=mice::complete(mice_imputes_test)
#data cleaning
set.seed(143)
sample = sample.int(n = nrow(imputed_train), size = floor(.70*nrow(imputed_train)), replace = F)
train_F = imputed_train[sample, ]
test_F_train = imputed_train[-sample,]
rm(mice_imputes, mice_imputes_test,imputed_train,test_df,train_df,cor_mx,loc_test,loc_train,sample)
Using the training data set, build at least two different poisson regression models, at least two different negative binomial regression models, and at least two multiple linear regression models, using different variables (or the same variables with different transformations). Sometimes poisson and negative binomial regression models give the same results. If that is the case, comment on that. Consider changing the input variables if that occurs so that you get different models. Although not covered in class, you may also want to consider building zero-inflated poisson and negative binomial regression models. You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
The models built were the recommended ones in the instructions. The first model being the complete model and model 2 is the statistically significant model. These models are Poisson, Negative Binomial, Linear regression, and a zero-inflated model.
set.seed(143)
Poiss_M1 = glm(TARGET ~ . , data=train_F, family="poisson")
summary(Poiss_M1)
##
## Call:
## glm(formula = TARGET ~ ., family = "poisson", data = train_F)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2604 -0.7066 -0.0060 0.4464 3.7918
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.899e-01 2.334e-01 4.242 2.21e-05 ***
## FixedAcidity -2.500e-04 9.889e-04 -0.253 0.80038
## VolatileAcidity -3.046e-02 7.819e-03 -3.895 9.81e-05 ***
## CitricAcid 7.666e-03 6.993e-03 1.096 0.27298
## ResidualSugar 5.783e-05 1.813e-04 0.319 0.74973
## Chlorides -3.980e-02 1.941e-02 -2.051 0.04030 *
## FreeSulfurDioxide 1.123e-04 4.128e-05 2.720 0.00653 **
## TotalSulfurDioxide 6.319e-05 2.639e-05 2.394 0.01665 *
## Density -1.772e-01 2.285e-01 -0.775 0.43808
## pH -9.669e-03 9.056e-03 -1.068 0.28567
## Sulphates -1.570e-02 6.577e-03 -2.387 0.01697 *
## Alcohol 3.641e-03 1.635e-03 2.227 0.02595 *
## LabelAppeal 1.566e-01 7.391e-03 21.192 < 2e-16 ***
## AcidIndex -7.992e-02 5.458e-03 -14.643 < 2e-16 ***
## one_Star 7.823e-01 2.353e-02 33.252 < 2e-16 ***
## two_Star 1.110e+00 2.189e-02 50.711 < 2e-16 ***
## three_Star 1.232e+00 2.309e-02 53.351 < 2e-16 ***
## four_Star 1.349e+00 2.919e-02 46.195 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 16163.6 on 8955 degrees of freedom
## Residual deviance: 9630.1 on 8938 degrees of freedom
## AIC: 31901
##
## Number of Fisher Scoring iterations: 6
plot(Poiss_M1)
Poiss_M1
##
## Call: glm(formula = TARGET ~ ., family = "poisson", data = train_F)
##
## Coefficients:
## (Intercept) FixedAcidity VolatileAcidity
## 9.899e-01 -2.500e-04 -3.046e-02
## CitricAcid ResidualSugar Chlorides
## 7.666e-03 5.783e-05 -3.980e-02
## FreeSulfurDioxide TotalSulfurDioxide Density
## 1.123e-04 6.319e-05 -1.772e-01
## pH Sulphates Alcohol
## -9.669e-03 -1.570e-02 3.641e-03
## LabelAppeal AcidIndex one_Star
## 1.566e-01 -7.992e-02 7.823e-01
## two_Star three_Star four_Star
## 1.110e+00 1.232e+00 1.349e+00
##
## Degrees of Freedom: 8955 Total (i.e. Null); 8938 Residual
## Null Deviance: 16160
## Residual Deviance: 9630 AIC: 31900
Poiss_M2_df= train_F %>% select(-FixedAcidity,-ResidualSugar,-Density,-pH,-CitricAcid)
set.seed(143)
Poiss_M2 = glm(TARGET ~ . ,data=Poiss_M2_df, family="poisson")
summary(Poiss_M2)
##
## Call:
## glm(formula = TARGET ~ ., family = "poisson", data = Poiss_M2_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2837 -0.7019 -0.0060 0.4473 3.7883
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.790e-01 5.091e-02 15.303 < 2e-16 ***
## VolatileAcidity -3.084e-02 7.816e-03 -3.946 7.93e-05 ***
## Chlorides -4.005e-02 1.939e-02 -2.065 0.03889 *
## FreeSulfurDioxide 1.131e-04 4.127e-05 2.740 0.00614 **
## TotalSulfurDioxide 6.267e-05 2.637e-05 2.377 0.01746 *
## Sulphates -1.577e-02 6.573e-03 -2.398 0.01647 *
## Alcohol 3.708e-03 1.634e-03 2.269 0.02328 *
## LabelAppeal 1.567e-01 7.390e-03 21.197 < 2e-16 ***
## AcidIndex -7.946e-02 5.377e-03 -14.777 < 2e-16 ***
## one_Star 7.829e-01 2.352e-02 33.285 < 2e-16 ***
## two_Star 1.111e+00 2.188e-02 50.780 < 2e-16 ***
## three_Star 1.233e+00 2.309e-02 53.403 < 2e-16 ***
## four_Star 1.350e+00 2.919e-02 46.240 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 16163.6 on 8955 degrees of freedom
## Residual deviance: 9633.2 on 8943 degrees of freedom
## AIC: 31894
##
## Number of Fisher Scoring iterations: 6
plot(Poiss_M2)
Poiss_M2
##
## Call: glm(formula = TARGET ~ ., family = "poisson", data = Poiss_M2_df)
##
## Coefficients:
## (Intercept) VolatileAcidity Chlorides
## 7.790e-01 -3.084e-02 -4.005e-02
## FreeSulfurDioxide TotalSulfurDioxide Sulphates
## 1.131e-04 6.267e-05 -1.577e-02
## Alcohol LabelAppeal AcidIndex
## 3.708e-03 1.567e-01 -7.946e-02
## one_Star two_Star three_Star
## 7.829e-01 1.111e+00 1.233e+00
## four_Star
## 1.350e+00
##
## Degrees of Freedom: 8955 Total (i.e. Null); 8943 Residual
## Null Deviance: 16160
## Residual Deviance: 9633 AIC: 31890
rm(Poiss_M2_df)
set.seed(143)
NB_M1 = glm.nb(TARGET ~ . , data=train_F)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(NB_M1)
##
## Call:
## glm.nb(formula = TARGET ~ ., data = train_F, init.theta = 39358.98214,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2602 -0.7066 -0.0060 0.4464 3.7917
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.900e-01 2.334e-01 4.242 2.21e-05 ***
## FixedAcidity -2.501e-04 9.890e-04 -0.253 0.80037
## VolatileAcidity -3.046e-02 7.819e-03 -3.895 9.81e-05 ***
## CitricAcid 7.666e-03 6.993e-03 1.096 0.27298
## ResidualSugar 5.784e-05 1.813e-04 0.319 0.74970
## Chlorides -3.980e-02 1.941e-02 -2.051 0.04030 *
## FreeSulfurDioxide 1.123e-04 4.129e-05 2.720 0.00653 **
## TotalSulfurDioxide 6.320e-05 2.639e-05 2.394 0.01665 *
## Density -1.772e-01 2.285e-01 -0.775 0.43810
## pH -9.670e-03 9.057e-03 -1.068 0.28564
## Sulphates -1.570e-02 6.577e-03 -2.387 0.01697 *
## Alcohol 3.641e-03 1.635e-03 2.227 0.02596 *
## LabelAppeal 1.566e-01 7.391e-03 21.191 < 2e-16 ***
## AcidIndex -7.992e-02 5.458e-03 -14.643 < 2e-16 ***
## one_Star 7.823e-01 2.353e-02 33.251 < 2e-16 ***
## two_Star 1.110e+00 2.189e-02 50.710 < 2e-16 ***
## three_Star 1.232e+00 2.309e-02 53.349 < 2e-16 ***
## four_Star 1.349e+00 2.920e-02 46.193 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(39358.98) family taken to be 1)
##
## Null deviance: 16162.7 on 8955 degrees of freedom
## Residual deviance: 9629.7 on 8938 degrees of freedom
## AIC: 31903
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 39359
## Std. Err.: 39281
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -31865.16
plot(NB_M1)
NB_M1
##
## Call: glm.nb(formula = TARGET ~ ., data = train_F, init.theta = 39358.98214,
## link = log)
##
## Coefficients:
## (Intercept) FixedAcidity VolatileAcidity
## 9.900e-01 -2.501e-04 -3.046e-02
## CitricAcid ResidualSugar Chlorides
## 7.666e-03 5.784e-05 -3.980e-02
## FreeSulfurDioxide TotalSulfurDioxide Density
## 1.123e-04 6.320e-05 -1.772e-01
## pH Sulphates Alcohol
## -9.670e-03 -1.570e-02 3.641e-03
## LabelAppeal AcidIndex one_Star
## 1.566e-01 -7.992e-02 7.823e-01
## two_Star three_Star four_Star
## 1.110e+00 1.232e+00 1.349e+00
##
## Degrees of Freedom: 8955 Total (i.e. Null); 8938 Residual
## Null Deviance: 16160
## Residual Deviance: 9630 AIC: 31900
NB_M2_df = train_F %>% select(-FixedAcidity,-ResidualSugar,-pH,-Density,-CitricAcid)
set.seed(143)
NB_M2 = glm.nb(TARGET ~ . , data=NB_M2_df )
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(NB_M2)
##
## Call:
## glm.nb(formula = TARGET ~ ., data = NB_M2_df, init.theta = 39335.26786,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2835 -0.7019 -0.0060 0.4472 3.7881
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.790e-01 5.091e-02 15.303 < 2e-16 ***
## VolatileAcidity -3.085e-02 7.816e-03 -3.946 7.94e-05 ***
## Chlorides -4.005e-02 1.939e-02 -2.065 0.03889 *
## FreeSulfurDioxide 1.131e-04 4.127e-05 2.740 0.00614 **
## TotalSulfurDioxide 6.267e-05 2.637e-05 2.377 0.01746 *
## Sulphates -1.577e-02 6.574e-03 -2.398 0.01647 *
## Alcohol 3.708e-03 1.634e-03 2.269 0.02329 *
## LabelAppeal 1.567e-01 7.391e-03 21.196 < 2e-16 ***
## AcidIndex -7.947e-02 5.378e-03 -14.777 < 2e-16 ***
## one_Star 7.829e-01 2.352e-02 33.284 < 2e-16 ***
## two_Star 1.111e+00 2.188e-02 50.778 < 2e-16 ***
## three_Star 1.233e+00 2.309e-02 53.401 < 2e-16 ***
## four_Star 1.350e+00 2.919e-02 46.238 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(39335.27) family taken to be 1)
##
## Null deviance: 16162.7 on 8955 degrees of freedom
## Residual deviance: 9632.8 on 8943 degrees of freedom
## AIC: 31896
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 39335
## Std. Err.: 39241
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -31868.27
plot(NB_M2)
NB_M2
##
## Call: glm.nb(formula = TARGET ~ ., data = NB_M2_df, init.theta = 39335.26786,
## link = log)
##
## Coefficients:
## (Intercept) VolatileAcidity Chlorides
## 7.790e-01 -3.085e-02 -4.005e-02
## FreeSulfurDioxide TotalSulfurDioxide Sulphates
## 1.131e-04 6.267e-05 -1.577e-02
## Alcohol LabelAppeal AcidIndex
## 3.708e-03 1.567e-01 -7.947e-02
## one_Star two_Star three_Star
## 7.829e-01 1.111e+00 1.233e+00
## four_Star
## 1.350e+00
##
## Degrees of Freedom: 8955 Total (i.e. Null); 8943 Residual
## Null Deviance: 16160
## Residual Deviance: 9633 AIC: 31900
rm(NB_M2_df)
library("caret")
control = trainControl(method = "repeatedcv",
number = 10,
repeats = 6)
set.seed(143)
Linear_M1 = train(TARGET ~ . , data = train_F, method='lm',
trControl = control, tuneLength = 10)
summary(Linear_M1)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8107 -0.8679 0.0239 0.8297 6.1792
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.248e+00 5.269e-01 6.165 7.33e-10 ***
## FixedAcidity -1.543e-04 2.237e-03 -0.069 0.945019
## VolatileAcidity -9.767e-02 1.772e-02 -5.514 3.61e-08 ***
## CitricAcid 2.748e-02 1.596e-02 1.722 0.085093 .
## ResidualSugar 2.244e-04 4.092e-04 0.548 0.583536
## Chlorides -1.292e-01 4.388e-02 -2.944 0.003243 **
## FreeSulfurDioxide 3.260e-04 9.350e-05 3.487 0.000492 ***
## TotalSulfurDioxide 1.783e-04 5.939e-05 3.002 0.002690 **
## Density -4.116e-01 5.192e-01 -0.793 0.427981
## pH -2.598e-02 2.040e-02 -1.273 0.202975
## Sulphates -4.155e-02 1.492e-02 -2.785 0.005358 **
## Alcohol 1.236e-02 3.694e-03 3.347 0.000820 ***
## LabelAppeal 4.566e-01 1.637e-02 27.883 < 2e-16 ***
## AcidIndex -1.981e-01 1.080e-02 -18.337 < 2e-16 ***
## one_Star 1.372e+00 3.941e-02 34.829 < 2e-16 ***
## two_Star 2.420e+00 3.811e-02 63.496 < 2e-16 ***
## three_Star 2.996e+00 4.440e-02 67.493 < 2e-16 ***
## four_Star 3.688e+00 7.088e-02 52.034 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.307 on 8938 degrees of freedom
## Multiple R-squared: 0.5419, Adjusted R-squared: 0.5411
## F-statistic: 622.1 on 17 and 8938 DF, p-value: < 2.2e-16
Linear_M1
## Linear Regression
##
## 8956 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 6 times)
## Summary of sample sizes: 8059, 8061, 8061, 8060, 8060, 8061, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.307882 0.5405671 1.025453
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Linear_M2_df = train_F %>% select(-ResidualSugar,-FixedAcidity,-Density,-pH,-CitricAcid)
set.seed(143)
Linear_M2 = train(TARGET ~ . , data = Linear_M2_df, method='lm',
trControl = control, tuneLength = 10)
summary(Linear_M2)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8453 -0.8703 0.0258 0.8256 6.1704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.748e+00 1.000e-01 27.482 < 2e-16 ***
## VolatileAcidity -9.868e-02 1.771e-02 -5.572 2.59e-08 ***
## Chlorides -1.299e-01 4.386e-02 -2.962 0.003064 **
## FreeSulfurDioxide 3.290e-04 9.347e-05 3.520 0.000434 ***
## TotalSulfurDioxide 1.787e-04 5.936e-05 3.010 0.002617 **
## Sulphates -4.170e-02 1.491e-02 -2.797 0.005170 **
## Alcohol 1.250e-02 3.692e-03 3.387 0.000711 ***
## LabelAppeal 4.564e-01 1.637e-02 27.873 < 2e-16 ***
## AcidIndex -1.964e-01 1.059e-02 -18.547 < 2e-16 ***
## one_Star 1.373e+00 3.939e-02 34.868 < 2e-16 ***
## two_Star 2.422e+00 3.809e-02 63.601 < 2e-16 ***
## three_Star 2.999e+00 4.438e-02 67.569 < 2e-16 ***
## four_Star 3.691e+00 7.087e-02 52.082 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.307 on 8943 degrees of freedom
## Multiple R-squared: 0.5417, Adjusted R-squared: 0.541
## F-statistic: 880.7 on 12 and 8943 DF, p-value: < 2.2e-16
Linear_M2
## Linear Regression
##
## 8956 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 6 times)
## Summary of sample sizes: 8059, 8061, 8061, 8060, 8060, 8061, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.307476 0.5408498 1.024746
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
rm(Linear_M2_df,control)
require("pscl")
set.seed(143)
ZeroInf_M1 = zeroinfl(TARGET ~ . , data = train_F, dist = "negbin")
summary(ZeroInf_M1)
##
## Call:
## zeroinfl(formula = TARGET ~ ., data = train_F, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.3195 -0.4238 -0.0087 0.3782 4.8048
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.457e+00 2.400e-01 6.069 1.28e-09 ***
## FixedAcidity 3.633e-04 1.015e-03 0.358 0.72030
## VolatileAcidity -1.235e-02 8.061e-03 -1.532 0.12561
## CitricAcid 8.602e-04 7.120e-03 0.121 0.90384
## ResidualSugar -1.359e-04 1.854e-04 -0.733 0.46376
## Chlorides -2.463e-02 1.985e-02 -1.241 0.21476
## FreeSulfurDioxide 2.682e-05 4.163e-05 0.644 0.51932
## TotalSulfurDioxide -6.251e-06 2.630e-05 -0.238 0.81215
## Density -2.745e-01 2.350e-01 -1.168 0.24281
## pH 8.074e-03 9.305e-03 0.868 0.38558
## Sulphates -4.347e-03 6.774e-03 -0.642 0.52103
## Alcohol 6.438e-03 1.664e-03 3.868 0.00011 ***
## LabelAppeal 2.311e-01 7.613e-03 30.351 < 2e-16 ***
## AcidIndex -1.889e-02 5.838e-03 -3.236 0.00121 **
## one_Star 5.263e-02 2.543e-02 2.070 0.03845 *
## two_Star 1.872e-01 2.367e-02 7.908 2.62e-15 ***
## three_Star 2.840e-01 2.486e-02 11.421 < 2e-16 ***
## four_Star 3.752e-01 3.077e-02 12.192 < 2e-16 ***
## Log(theta) 1.745e+01 2.220e+00 7.864 3.72e-15 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.365e+00 1.573e+00 -2.139 0.032453 *
## FixedAcidity 3.057e-03 6.593e-03 0.464 0.642936
## VolatileAcidity 2.008e-01 5.243e-02 3.829 0.000129 ***
## CitricAcid -4.644e-02 4.717e-02 -0.985 0.324857
## ResidualSugar -1.757e-03 1.189e-03 -1.478 0.139494
## Chlorides 8.358e-02 1.289e-01 0.649 0.516631
## FreeSulfurDioxide -8.514e-04 2.774e-04 -3.069 0.002145 **
## TotalSulfurDioxide -6.003e-04 1.743e-04 -3.443 0.000574 ***
## Density -4.642e-01 1.552e+00 -0.299 0.764876
## pH 1.814e-01 5.936e-02 3.056 0.002246 **
## Sulphates 1.065e-01 4.466e-02 2.384 0.017104 *
## Alcohol 2.168e-02 1.080e-02 2.006 0.044805 *
## LabelAppeal 7.377e-01 5.104e-02 14.453 < 2e-16 ***
## AcidIndex 4.342e-01 3.137e-02 13.840 < 2e-16 ***
## one_Star -2.115e+00 9.160e-02 -23.089 < 2e-16 ***
## two_Star -5.717e+00 3.775e-01 -15.147 < 2e-16 ***
## three_Star -2.030e+01 3.984e+02 -0.051 0.959357
## four_Star -2.041e+01 7.475e+02 -0.027 0.978219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 38043630.1953
## Number of iterations in BFGS optimization: 71
## Log-likelihood: -1.421e+04 on 37 Df
ZeroInf_M1
##
## Call:
## zeroinfl(formula = TARGET ~ ., data = train_F, dist = "negbin")
##
## Count model coefficients (negbin with log link):
## (Intercept) FixedAcidity VolatileAcidity
## 1.457e+00 3.633e-04 -1.235e-02
## CitricAcid ResidualSugar Chlorides
## 8.602e-04 -1.359e-04 -2.463e-02
## FreeSulfurDioxide TotalSulfurDioxide Density
## 2.682e-05 -6.251e-06 -2.745e-01
## pH Sulphates Alcohol
## 8.074e-03 -4.347e-03 6.438e-03
## LabelAppeal AcidIndex one_Star
## 2.311e-01 -1.889e-02 5.263e-02
## two_Star three_Star four_Star
## 1.872e-01 2.840e-01 3.752e-01
## Theta = 38043630.1953
##
## Zero-inflation model coefficients (binomial with logit link):
## (Intercept) FixedAcidity VolatileAcidity
## -3.365e+00 3.057e-03 2.008e-01
## CitricAcid ResidualSugar Chlorides
## -4.644e-02 -1.757e-03 8.358e-02
## FreeSulfurDioxide TotalSulfurDioxide Density
## -8.514e-04 -6.003e-04 -4.642e-01
## pH Sulphates Alcohol
## 1.814e-01 1.065e-01 2.168e-02
## LabelAppeal AcidIndex one_Star
## 7.377e-01 4.342e-01 -2.115e+00
## two_Star three_Star four_Star
## -5.717e+00 -2.030e+01 -2.041e+01
#https://stats.stackexchange.com/questions/45262/zero-inflated-count-models-in-r-what-is-the-real-advantage
require("pscl")
ZeroInf_df = train_F %>% select(-CitricAcid,-TotalSulfurDioxide,-FixedAcidity,-FreeSulfurDioxide,-Sulphates,-ResidualSugar,-pH,-Density,-Chlorides,-VolatileAcidity)
set.seed(143)
ZeroInf_M2 = zeroinfl(TARGET ~ . , data = ZeroInf_df, dist = "negbin")
summary(ZeroInf_M2)
##
## Call:
## zeroinfl(formula = TARGET ~ ., data = ZeroInf_df, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.338945 -0.426800 -0.003664 0.386177 4.945980
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.212980 0.052371 23.161 < 2e-16 ***
## Alcohol 0.006411 0.001664 3.852 0.000117 ***
## LabelAppeal 0.231359 0.007609 30.407 < 2e-16 ***
## AcidIndex -0.019958 0.005753 -3.469 0.000522 ***
## one_Star 0.052655 0.025425 2.071 0.038360 *
## two_Star 0.186832 0.023646 7.901 2.76e-15 ***
## three_Star 0.284962 0.024838 11.473 < 2e-16 ***
## four_Star 0.376006 0.030742 12.231 < 2e-16 ***
## Log(theta) 10.750099 2.138335 5.027 4.97e-07 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.20780 0.26947 -11.904 <2e-16 ***
## Alcohol 0.02149 0.01065 2.018 0.0436 *
## LabelAppeal 0.73595 0.05033 14.622 <2e-16 ***
## AcidIndex 0.43719 0.03036 14.402 <2e-16 ***
## one_Star -2.12462 0.09080 -23.399 <2e-16 ***
## two_Star -5.82723 0.40293 -14.462 <2e-16 ***
## three_Star -20.31891 403.43223 -0.050 0.9598
## four_Star -20.42411 753.06099 -0.027 0.9784
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 46634.6327
## Number of iterations in BFGS optimization: 38
## Log-likelihood: -1.425e+04 on 17 Df
#https://stats.stackexchange.com/questions/45262/zero-inflated-count-models-in-r-what-is-the-real-advantage
rm(ZeroInf_df)
The model with the lowest RMSE is the ZeroInf_M1 model. This model contains all the variables. I find this behavior strange considering that full model with insignificant statistically variable tends to perform erratically. I will proceed to predict the test set.
test_F_train$Poiss_M1= predict(Poiss_M1,test_F_train,type = "response")
test_F_train$Poiss_M2= predict(Poiss_M2,test_F_train,type = "response")
test_F_train$NB_M1= predict(NB_M1,test_F_train,type = "response")
test_F_train$NB_M2= predict(NB_M2,test_F_train,type = "response")
test_F_train$Linear_M1= predict(Linear_M1,test_F_train)
test_F_train$Linear_M2= predict(Linear_M2,test_F_train)
test_F_train$ZeroInf_M1= predict(ZeroInf_M1,test_F_train,type = "response")
test_F_train$ZeroInf_M2= predict(ZeroInf_M2,test_F_train,type = "response")
Final_test_df = data.frame(
Models =c("Poiss_M1","Poiss_M2","NB_M1","NB_M2","Linear_M1","Linear_M2","ZeroInf_M1","ZeroInf_M2"),
RMSE=c( sqrt(mean((test_F_train$TARGET - test_F_train$Poiss_M1)^2)),
sqrt(mean((test_F_train$TARGET - test_F_train$Poiss_M2)^2)),
sqrt(mean((test_F_train$TARGET - test_F_train$NB_M1)^2)),
sqrt(mean((test_F_train$TARGET - test_F_train$NB_M2)^2)),
sqrt(mean((test_F_train$TARGET - test_F_train$Linear_M1)^2)),
sqrt(mean((test_F_train$TARGET - test_F_train$Linear_M2)^2)),
sqrt(mean((test_F_train$TARGET - test_F_train$ZeroInf_M1)^2)),
sqrt(mean((test_F_train$TARGET - test_F_train$ZeroInf_M2)^2))
))
Final_test_df
imp_Eval_test$TARGET= predict(ZeroInf_M1,imp_Eval_test,type = "response")
head(imp_Eval_test$TARGET,5)
## [1] 1.982232 3.865380 2.309302 2.382840 0.743786
write.csv(imp_Eval_test,"cestevez_HW5_Predictions.csv",row.names = F)