Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Deliverables:

A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.
Assigned predictions (number of cases of wine sold) for the evaluation data set.
Include your R statistical programming code in an Appendix.

I will start by loading the data from the repository and loading many of the R packages that I will be using.

loc_train = "https://raw.githubusercontent.com/chrisestevez/DataAnalytics/master/Data/HW5621/wine-training-data.csv"

loc_test = "https://raw.githubusercontent.com/chrisestevez/DataAnalytics/master/Data/HW5621/wine-evaluation-data.csv"

train_df = read.csv(loc_train, stringsAsFactors = FALSE)
test_df = read.csv(loc_test, stringsAsFactors = FALSE)


#library("psych")
library("caret")
library("MASS")
library("kableExtra")
library('DataExplorer')
library('mice')
library("Hmisc")
library("dplyr")
library("tidyr")
library("magrittr")
library("pscl")

DATA EXPLORATION

Exploring the data reveals many Nas that will either need to be imputed with zero or replaced.

str(train_df)

## 'data.frame':    12795 obs. of  16 variables:
##  $ ï..INDEX          : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...

summary(train_df)

##     ï..INDEX         TARGET       FixedAcidity     VolatileAcidity  
##  Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
##  1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
##  Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
##  Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
##  3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
##  Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
##                                                                     
##    CitricAcid      ResidualSugar        Chlorides       FreeSulfurDioxide
##  Min.   :-3.2400   Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00  
##  1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00  
##  Median : 0.3100   Median :   3.900   Median : 0.0460   Median :  30.00  
##  Mean   : 0.3084   Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85  
##  3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00  
##  Max.   : 3.8600   Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00  
##                    NA's   :616        NA's   :638       NA's   :647      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
##  1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
##  Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
##  Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
##  3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
##  Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
##  NA's   :682                         NA's   :395     NA's   :1210     
##     Alcohol       LabelAppeal          AcidIndex          STARS      
##  Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##  NA's   :653                                          NA's   :3359

The correlation matrix reveals some positive correlation between STARS and TARGET variables.

library("corrplot")
cor_mx = cor(dplyr::select_if(train_df, is.numeric) ,use="pairwise.complete.obs", method = "pearson")
corrplot(cor_mx, method = "color", 
         type = "upper", order = "original", number.cex = .7,
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "black", tl.srt = 90, # Text label color and rotation
                  # hide correlation coefficient on the principal diagonal
         diag = TRUE)

A histogram of the variable reveals some skewness in certain variables. The variables that should be transformed are AcidIndex and STARS.

hist.data.frame(train_df,n.unique=1, mtitl = "Wine")

DATA PREPARATION

To prepare the data for modeling, I will first start by replacing Na star ratings with zeros. I considered the fact that these wines might not have been rated.

The next step in the data transformation process is to convert the star rating into a dummy variable. I will use the general rule of leaving one out K-1. In this case, the value left out are wines with no ratings or zero.

Variable that will be removed from the data set are Index and STARS since these have become dummy variables.

The third step in data preparation is replacing remaining Nas. The replacement will be achieved using the mice package. After the process is finished, a plot indicating the distribution will be plotted for the test and training set.

PlotMissing(train_df)

train_df$STARS[is.na(train_df$STARS)] = 0
test_df$STARS[is.na(test_df$STARS)] = 0

train_df$one_Star =ifelse(train_df$STARS==1,1,0)
train_df$two_Star=ifelse(train_df$STARS==2,1,0)
train_df$three_Star=ifelse(train_df$STARS==3,1,0)
train_df$four_Star=ifelse(train_df$STARS==4,1,0)

test_df$one_Star =ifelse(test_df$STARS==1,1,0)
test_df$two_Star=ifelse(test_df$STARS==2,1,0)
test_df$three_Star=ifelse(test_df$STARS==3,1,0)
test_df$four_Star=ifelse(test_df$STARS==4,1,0)


#Removed first column
train_df = train_df %>%  select(-1,-STARS)
test_df =test_df %>%  select(-1,-STARS)


mice_imputes = mice(train_df, m = 2, maxit = 2, print = FALSE,seed = 143)
densityplot(mice_imputes)

imputed_train = mice::complete(mice_imputes)

mice_imputes_test = mice(test_df, m = 2, maxit = 2, print = FALSE,seed = 143)
#density for final test data set
densityplot(mice_imputes_test)

imp_Eval_test=mice::complete(mice_imputes_test)

#data cleaning
set.seed(143) 
sample = sample.int(n = nrow(imputed_train), size = floor(.70*nrow(imputed_train)), replace = F)
train_F = imputed_train[sample, ]
test_F_train  = imputed_train[-sample,] 

rm(mice_imputes, mice_imputes_test,imputed_train,test_df,train_df,cor_mx,loc_test,loc_train,sample)

BUILD MODELS

Using the training data set, build at least two different poisson regression models, at least two different negative binomial regression models, and at least two multiple linear regression models, using different variables (or the same variables with different transformations). Sometimes poisson and negative binomial regression models give the same results. If that is the case, comment on that. Consider changing the input variables if that occurs so that you get different models. Although not covered in class, you may also want to consider building zero-inflated poisson and negative binomial regression models. You may select the variables manually, use an approach such as Forward or Stepwise, use a different approach such as trees, or use a combination of techniques. Describe the techniques you used. If you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

The models built were the recommended ones in the instructions. The first model being the complete model and model 2 is the statistically significant model. These models are Poisson, Negative Binomial, Linear regression, and a zero-inflated model.

Poisson

Model 1

set.seed(143)
Poiss_M1 = glm(TARGET ~ . , data=train_F, family="poisson")

summary(Poiss_M1)

## 
## Call:
## glm(formula = TARGET ~ ., family = "poisson", data = train_F)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2604  -0.7066  -0.0060   0.4464   3.7918  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         9.899e-01  2.334e-01   4.242 2.21e-05 ***
## FixedAcidity       -2.500e-04  9.889e-04  -0.253  0.80038    
## VolatileAcidity    -3.046e-02  7.819e-03  -3.895 9.81e-05 ***
## CitricAcid          7.666e-03  6.993e-03   1.096  0.27298    
## ResidualSugar       5.783e-05  1.813e-04   0.319  0.74973    
## Chlorides          -3.980e-02  1.941e-02  -2.051  0.04030 *  
## FreeSulfurDioxide   1.123e-04  4.128e-05   2.720  0.00653 ** 
## TotalSulfurDioxide  6.319e-05  2.639e-05   2.394  0.01665 *  
## Density            -1.772e-01  2.285e-01  -0.775  0.43808    
## pH                 -9.669e-03  9.056e-03  -1.068  0.28567    
## Sulphates          -1.570e-02  6.577e-03  -2.387  0.01697 *  
## Alcohol             3.641e-03  1.635e-03   2.227  0.02595 *  
## LabelAppeal         1.566e-01  7.391e-03  21.192  < 2e-16 ***
## AcidIndex          -7.992e-02  5.458e-03 -14.643  < 2e-16 ***
## one_Star            7.823e-01  2.353e-02  33.252  < 2e-16 ***
## two_Star            1.110e+00  2.189e-02  50.711  < 2e-16 ***
## three_Star          1.232e+00  2.309e-02  53.351  < 2e-16 ***
## four_Star           1.349e+00  2.919e-02  46.195  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 16163.6  on 8955  degrees of freedom
## Residual deviance:  9630.1  on 8938  degrees of freedom
## AIC: 31901
## 
## Number of Fisher Scoring iterations: 6

plot(Poiss_M1)

Poiss_M1

## 
## Call:  glm(formula = TARGET ~ ., family = "poisson", data = train_F)
## 
## Coefficients:
##        (Intercept)        FixedAcidity     VolatileAcidity  
##          9.899e-01          -2.500e-04          -3.046e-02  
##         CitricAcid       ResidualSugar           Chlorides  
##          7.666e-03           5.783e-05          -3.980e-02  
##  FreeSulfurDioxide  TotalSulfurDioxide             Density  
##          1.123e-04           6.319e-05          -1.772e-01  
##                 pH           Sulphates             Alcohol  
##         -9.669e-03          -1.570e-02           3.641e-03  
##        LabelAppeal           AcidIndex            one_Star  
##          1.566e-01          -7.992e-02           7.823e-01  
##           two_Star          three_Star           four_Star  
##          1.110e+00           1.232e+00           1.349e+00  
## 
## Degrees of Freedom: 8955 Total (i.e. Null);  8938 Residual
## Null Deviance:       16160 
## Residual Deviance: 9630  AIC: 31900

Model 2

Poiss_M2_df= train_F %>% select(-FixedAcidity,-ResidualSugar,-Density,-pH,-CitricAcid) 
set.seed(143)
Poiss_M2 = glm(TARGET ~ . ,data=Poiss_M2_df, family="poisson")

summary(Poiss_M2)

## 
## Call:
## glm(formula = TARGET ~ ., family = "poisson", data = Poiss_M2_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2837  -0.7019  -0.0060   0.4473   3.7883  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         7.790e-01  5.091e-02  15.303  < 2e-16 ***
## VolatileAcidity    -3.084e-02  7.816e-03  -3.946 7.93e-05 ***
## Chlorides          -4.005e-02  1.939e-02  -2.065  0.03889 *  
## FreeSulfurDioxide   1.131e-04  4.127e-05   2.740  0.00614 ** 
## TotalSulfurDioxide  6.267e-05  2.637e-05   2.377  0.01746 *  
## Sulphates          -1.577e-02  6.573e-03  -2.398  0.01647 *  
## Alcohol             3.708e-03  1.634e-03   2.269  0.02328 *  
## LabelAppeal         1.567e-01  7.390e-03  21.197  < 2e-16 ***
## AcidIndex          -7.946e-02  5.377e-03 -14.777  < 2e-16 ***
## one_Star            7.829e-01  2.352e-02  33.285  < 2e-16 ***
## two_Star            1.111e+00  2.188e-02  50.780  < 2e-16 ***
## three_Star          1.233e+00  2.309e-02  53.403  < 2e-16 ***
## four_Star           1.350e+00  2.919e-02  46.240  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 16163.6  on 8955  degrees of freedom
## Residual deviance:  9633.2  on 8943  degrees of freedom
## AIC: 31894
## 
## Number of Fisher Scoring iterations: 6

plot(Poiss_M2)

Poiss_M2

## 
## Call:  glm(formula = TARGET ~ ., family = "poisson", data = Poiss_M2_df)
## 
## Coefficients:
##        (Intercept)     VolatileAcidity           Chlorides  
##          7.790e-01          -3.084e-02          -4.005e-02  
##  FreeSulfurDioxide  TotalSulfurDioxide           Sulphates  
##          1.131e-04           6.267e-05          -1.577e-02  
##            Alcohol         LabelAppeal           AcidIndex  
##          3.708e-03           1.567e-01          -7.946e-02  
##           one_Star            two_Star          three_Star  
##          7.829e-01           1.111e+00           1.233e+00  
##          four_Star  
##          1.350e+00  
## 
## Degrees of Freedom: 8955 Total (i.e. Null);  8943 Residual
## Null Deviance:       16160 
## Residual Deviance: 9633  AIC: 31890

rm(Poiss_M2_df)

Negative Binomial

Source

Model 1

set.seed(143)
NB_M1 = glm.nb(TARGET ~ . , data=train_F)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

summary(NB_M1)

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = train_F, init.theta = 39358.98214, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2602  -0.7066  -0.0060   0.4464   3.7917  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         9.900e-01  2.334e-01   4.242 2.21e-05 ***
## FixedAcidity       -2.501e-04  9.890e-04  -0.253  0.80037    
## VolatileAcidity    -3.046e-02  7.819e-03  -3.895 9.81e-05 ***
## CitricAcid          7.666e-03  6.993e-03   1.096  0.27298    
## ResidualSugar       5.784e-05  1.813e-04   0.319  0.74970    
## Chlorides          -3.980e-02  1.941e-02  -2.051  0.04030 *  
## FreeSulfurDioxide   1.123e-04  4.129e-05   2.720  0.00653 ** 
## TotalSulfurDioxide  6.320e-05  2.639e-05   2.394  0.01665 *  
## Density            -1.772e-01  2.285e-01  -0.775  0.43810    
## pH                 -9.670e-03  9.057e-03  -1.068  0.28564    
## Sulphates          -1.570e-02  6.577e-03  -2.387  0.01697 *  
## Alcohol             3.641e-03  1.635e-03   2.227  0.02596 *  
## LabelAppeal         1.566e-01  7.391e-03  21.191  < 2e-16 ***
## AcidIndex          -7.992e-02  5.458e-03 -14.643  < 2e-16 ***
## one_Star            7.823e-01  2.353e-02  33.251  < 2e-16 ***
## two_Star            1.110e+00  2.189e-02  50.710  < 2e-16 ***
## three_Star          1.232e+00  2.309e-02  53.349  < 2e-16 ***
## four_Star           1.349e+00  2.920e-02  46.193  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(39358.98) family taken to be 1)
## 
##     Null deviance: 16162.7  on 8955  degrees of freedom
## Residual deviance:  9629.7  on 8938  degrees of freedom
## AIC: 31903
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  39359 
##           Std. Err.:  39281 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -31865.16

plot(NB_M1)

NB_M1

## 
## Call:  glm.nb(formula = TARGET ~ ., data = train_F, init.theta = 39358.98214, 
##     link = log)
## 
## Coefficients:
##        (Intercept)        FixedAcidity     VolatileAcidity  
##          9.900e-01          -2.501e-04          -3.046e-02  
##         CitricAcid       ResidualSugar           Chlorides  
##          7.666e-03           5.784e-05          -3.980e-02  
##  FreeSulfurDioxide  TotalSulfurDioxide             Density  
##          1.123e-04           6.320e-05          -1.772e-01  
##                 pH           Sulphates             Alcohol  
##         -9.670e-03          -1.570e-02           3.641e-03  
##        LabelAppeal           AcidIndex            one_Star  
##          1.566e-01          -7.992e-02           7.823e-01  
##           two_Star          three_Star           four_Star  
##          1.110e+00           1.232e+00           1.349e+00  
## 
## Degrees of Freedom: 8955 Total (i.e. Null);  8938 Residual
## Null Deviance:       16160 
## Residual Deviance: 9630  AIC: 31900

Model 2

NB_M2_df = train_F %>% select(-FixedAcidity,-ResidualSugar,-pH,-Density,-CitricAcid)

set.seed(143)
NB_M2 = glm.nb(TARGET ~ . , data=NB_M2_df )

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

summary(NB_M2)

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = NB_M2_df, init.theta = 39335.26786, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2835  -0.7019  -0.0060   0.4472   3.7881  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         7.790e-01  5.091e-02  15.303  < 2e-16 ***
## VolatileAcidity    -3.085e-02  7.816e-03  -3.946 7.94e-05 ***
## Chlorides          -4.005e-02  1.939e-02  -2.065  0.03889 *  
## FreeSulfurDioxide   1.131e-04  4.127e-05   2.740  0.00614 ** 
## TotalSulfurDioxide  6.267e-05  2.637e-05   2.377  0.01746 *  
## Sulphates          -1.577e-02  6.574e-03  -2.398  0.01647 *  
## Alcohol             3.708e-03  1.634e-03   2.269  0.02329 *  
## LabelAppeal         1.567e-01  7.391e-03  21.196  < 2e-16 ***
## AcidIndex          -7.947e-02  5.378e-03 -14.777  < 2e-16 ***
## one_Star            7.829e-01  2.352e-02  33.284  < 2e-16 ***
## two_Star            1.111e+00  2.188e-02  50.778  < 2e-16 ***
## three_Star          1.233e+00  2.309e-02  53.401  < 2e-16 ***
## four_Star           1.350e+00  2.919e-02  46.238  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(39335.27) family taken to be 1)
## 
##     Null deviance: 16162.7  on 8955  degrees of freedom
## Residual deviance:  9632.8  on 8943  degrees of freedom
## AIC: 31896
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  39335 
##           Std. Err.:  39241 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -31868.27

plot(NB_M2)

NB_M2

## 
## Call:  glm.nb(formula = TARGET ~ ., data = NB_M2_df, init.theta = 39335.26786, 
##     link = log)
## 
## Coefficients:
##        (Intercept)     VolatileAcidity           Chlorides  
##          7.790e-01          -3.085e-02          -4.005e-02  
##  FreeSulfurDioxide  TotalSulfurDioxide           Sulphates  
##          1.131e-04           6.267e-05          -1.577e-02  
##            Alcohol         LabelAppeal           AcidIndex  
##          3.708e-03           1.567e-01          -7.947e-02  
##           one_Star            two_Star          three_Star  
##          7.829e-01           1.111e+00           1.233e+00  
##          four_Star  
##          1.350e+00  
## 
## Degrees of Freedom: 8955 Total (i.e. Null);  8943 Residual
## Null Deviance:       16160 
## Residual Deviance: 9633  AIC: 31900

rm(NB_M2_df)

multiple linear regression

Model 1

library("caret")
control = trainControl(method = "repeatedcv", 
                        number = 10, 
                        repeats = 6)
set.seed(143)
Linear_M1 = train(TARGET ~ . , data = train_F, method='lm',
                  trControl = control, tuneLength = 10)

summary(Linear_M1)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8107 -0.8679  0.0239  0.8297  6.1792 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.248e+00  5.269e-01   6.165 7.33e-10 ***
## FixedAcidity       -1.543e-04  2.237e-03  -0.069 0.945019    
## VolatileAcidity    -9.767e-02  1.772e-02  -5.514 3.61e-08 ***
## CitricAcid          2.748e-02  1.596e-02   1.722 0.085093 .  
## ResidualSugar       2.244e-04  4.092e-04   0.548 0.583536    
## Chlorides          -1.292e-01  4.388e-02  -2.944 0.003243 ** 
## FreeSulfurDioxide   3.260e-04  9.350e-05   3.487 0.000492 ***
## TotalSulfurDioxide  1.783e-04  5.939e-05   3.002 0.002690 ** 
## Density            -4.116e-01  5.192e-01  -0.793 0.427981    
## pH                 -2.598e-02  2.040e-02  -1.273 0.202975    
## Sulphates          -4.155e-02  1.492e-02  -2.785 0.005358 ** 
## Alcohol             1.236e-02  3.694e-03   3.347 0.000820 ***
## LabelAppeal         4.566e-01  1.637e-02  27.883  < 2e-16 ***
## AcidIndex          -1.981e-01  1.080e-02 -18.337  < 2e-16 ***
## one_Star            1.372e+00  3.941e-02  34.829  < 2e-16 ***
## two_Star            2.420e+00  3.811e-02  63.496  < 2e-16 ***
## three_Star          2.996e+00  4.440e-02  67.493  < 2e-16 ***
## four_Star           3.688e+00  7.088e-02  52.034  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.307 on 8938 degrees of freedom
## Multiple R-squared:  0.5419, Adjusted R-squared:  0.5411 
## F-statistic: 622.1 on 17 and 8938 DF,  p-value: < 2.2e-16

Linear_M1

## Linear Regression 
## 
## 8956 samples
##   17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 6 times) 
## Summary of sample sizes: 8059, 8061, 8061, 8060, 8060, 8061, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1.307882  0.5405671  1.025453
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Model 2

Linear_M2_df = train_F %>% select(-ResidualSugar,-FixedAcidity,-Density,-pH,-CitricAcid)

set.seed(143)
Linear_M2 = train(TARGET ~ . , data = Linear_M2_df, method='lm',
                  trControl = control, tuneLength = 10)

summary(Linear_M2)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8453 -0.8703  0.0258  0.8256  6.1704 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.748e+00  1.000e-01  27.482  < 2e-16 ***
## VolatileAcidity    -9.868e-02  1.771e-02  -5.572 2.59e-08 ***
## Chlorides          -1.299e-01  4.386e-02  -2.962 0.003064 ** 
## FreeSulfurDioxide   3.290e-04  9.347e-05   3.520 0.000434 ***
## TotalSulfurDioxide  1.787e-04  5.936e-05   3.010 0.002617 ** 
## Sulphates          -4.170e-02  1.491e-02  -2.797 0.005170 ** 
## Alcohol             1.250e-02  3.692e-03   3.387 0.000711 ***
## LabelAppeal         4.564e-01  1.637e-02  27.873  < 2e-16 ***
## AcidIndex          -1.964e-01  1.059e-02 -18.547  < 2e-16 ***
## one_Star            1.373e+00  3.939e-02  34.868  < 2e-16 ***
## two_Star            2.422e+00  3.809e-02  63.601  < 2e-16 ***
## three_Star          2.999e+00  4.438e-02  67.569  < 2e-16 ***
## four_Star           3.691e+00  7.087e-02  52.082  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.307 on 8943 degrees of freedom
## Multiple R-squared:  0.5417, Adjusted R-squared:  0.541 
## F-statistic: 880.7 on 12 and 8943 DF,  p-value: < 2.2e-16

Linear_M2

## Linear Regression 
## 
## 8956 samples
##   12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 6 times) 
## Summary of sample sizes: 8059, 8061, 8061, 8060, 8060, 8061, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1.307476  0.5408498  1.024746
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

rm(Linear_M2_df,control)

zero-inflated

Model 1

require("pscl")

set.seed(143)
ZeroInf_M1 = zeroinfl(TARGET ~ . , data = train_F, dist = "negbin")

summary(ZeroInf_M1)

## 
## Call:
## zeroinfl(formula = TARGET ~ ., data = train_F, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3195 -0.4238 -0.0087  0.3782  4.8048 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.457e+00  2.400e-01   6.069 1.28e-09 ***
## FixedAcidity        3.633e-04  1.015e-03   0.358  0.72030    
## VolatileAcidity    -1.235e-02  8.061e-03  -1.532  0.12561    
## CitricAcid          8.602e-04  7.120e-03   0.121  0.90384    
## ResidualSugar      -1.359e-04  1.854e-04  -0.733  0.46376    
## Chlorides          -2.463e-02  1.985e-02  -1.241  0.21476    
## FreeSulfurDioxide   2.682e-05  4.163e-05   0.644  0.51932    
## TotalSulfurDioxide -6.251e-06  2.630e-05  -0.238  0.81215    
## Density            -2.745e-01  2.350e-01  -1.168  0.24281    
## pH                  8.074e-03  9.305e-03   0.868  0.38558    
## Sulphates          -4.347e-03  6.774e-03  -0.642  0.52103    
## Alcohol             6.438e-03  1.664e-03   3.868  0.00011 ***
## LabelAppeal         2.311e-01  7.613e-03  30.351  < 2e-16 ***
## AcidIndex          -1.889e-02  5.838e-03  -3.236  0.00121 ** 
## one_Star            5.263e-02  2.543e-02   2.070  0.03845 *  
## two_Star            1.872e-01  2.367e-02   7.908 2.62e-15 ***
## three_Star          2.840e-01  2.486e-02  11.421  < 2e-16 ***
## four_Star           3.752e-01  3.077e-02  12.192  < 2e-16 ***
## Log(theta)          1.745e+01  2.220e+00   7.864 3.72e-15 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -3.365e+00  1.573e+00  -2.139 0.032453 *  
## FixedAcidity        3.057e-03  6.593e-03   0.464 0.642936    
## VolatileAcidity     2.008e-01  5.243e-02   3.829 0.000129 ***
## CitricAcid         -4.644e-02  4.717e-02  -0.985 0.324857    
## ResidualSugar      -1.757e-03  1.189e-03  -1.478 0.139494    
## Chlorides           8.358e-02  1.289e-01   0.649 0.516631    
## FreeSulfurDioxide  -8.514e-04  2.774e-04  -3.069 0.002145 ** 
## TotalSulfurDioxide -6.003e-04  1.743e-04  -3.443 0.000574 ***
## Density            -4.642e-01  1.552e+00  -0.299 0.764876    
## pH                  1.814e-01  5.936e-02   3.056 0.002246 ** 
## Sulphates           1.065e-01  4.466e-02   2.384 0.017104 *  
## Alcohol             2.168e-02  1.080e-02   2.006 0.044805 *  
## LabelAppeal         7.377e-01  5.104e-02  14.453  < 2e-16 ***
## AcidIndex           4.342e-01  3.137e-02  13.840  < 2e-16 ***
## one_Star           -2.115e+00  9.160e-02 -23.089  < 2e-16 ***
## two_Star           -5.717e+00  3.775e-01 -15.147  < 2e-16 ***
## three_Star         -2.030e+01  3.984e+02  -0.051 0.959357    
## four_Star          -2.041e+01  7.475e+02  -0.027 0.978219    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 38043630.1953 
## Number of iterations in BFGS optimization: 71 
## Log-likelihood: -1.421e+04 on 37 Df

ZeroInf_M1

## 
## Call:
## zeroinfl(formula = TARGET ~ ., data = train_F, dist = "negbin")
## 
## Count model coefficients (negbin with log link):
##        (Intercept)        FixedAcidity     VolatileAcidity  
##          1.457e+00           3.633e-04          -1.235e-02  
##         CitricAcid       ResidualSugar           Chlorides  
##          8.602e-04          -1.359e-04          -2.463e-02  
##  FreeSulfurDioxide  TotalSulfurDioxide             Density  
##          2.682e-05          -6.251e-06          -2.745e-01  
##                 pH           Sulphates             Alcohol  
##          8.074e-03          -4.347e-03           6.438e-03  
##        LabelAppeal           AcidIndex            one_Star  
##          2.311e-01          -1.889e-02           5.263e-02  
##           two_Star          three_Star           four_Star  
##          1.872e-01           2.840e-01           3.752e-01  
## Theta = 38043630.1953 
## 
## Zero-inflation model coefficients (binomial with logit link):
##        (Intercept)        FixedAcidity     VolatileAcidity  
##         -3.365e+00           3.057e-03           2.008e-01  
##         CitricAcid       ResidualSugar           Chlorides  
##         -4.644e-02          -1.757e-03           8.358e-02  
##  FreeSulfurDioxide  TotalSulfurDioxide             Density  
##         -8.514e-04          -6.003e-04          -4.642e-01  
##                 pH           Sulphates             Alcohol  
##          1.814e-01           1.065e-01           2.168e-02  
##        LabelAppeal           AcidIndex            one_Star  
##          7.377e-01           4.342e-01          -2.115e+00  
##           two_Star          three_Star           four_Star  
##         -5.717e+00          -2.030e+01          -2.041e+01

#https://stats.stackexchange.com/questions/45262/zero-inflated-count-models-in-r-what-is-the-real-advantage

Model 2

require("pscl")

ZeroInf_df = train_F %>% select(-CitricAcid,-TotalSulfurDioxide,-FixedAcidity,-FreeSulfurDioxide,-Sulphates,-ResidualSugar,-pH,-Density,-Chlorides,-VolatileAcidity)

set.seed(143)
ZeroInf_M2 = zeroinfl(TARGET ~ . , data = ZeroInf_df, dist = "negbin")

summary(ZeroInf_M2)

## 
## Call:
## zeroinfl(formula = TARGET ~ ., data = ZeroInf_df, dist = "negbin")
## 
## Pearson residuals:
##       Min        1Q    Median        3Q       Max 
## -2.338945 -0.426800 -0.003664  0.386177  4.945980 
## 
## Count model coefficients (negbin with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.212980   0.052371  23.161  < 2e-16 ***
## Alcohol      0.006411   0.001664   3.852 0.000117 ***
## LabelAppeal  0.231359   0.007609  30.407  < 2e-16 ***
## AcidIndex   -0.019958   0.005753  -3.469 0.000522 ***
## one_Star     0.052655   0.025425   2.071 0.038360 *  
## two_Star     0.186832   0.023646   7.901 2.76e-15 ***
## three_Star   0.284962   0.024838  11.473  < 2e-16 ***
## four_Star    0.376006   0.030742  12.231  < 2e-16 ***
## Log(theta)  10.750099   2.138335   5.027 4.97e-07 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.20780    0.26947 -11.904   <2e-16 ***
## Alcohol       0.02149    0.01065   2.018   0.0436 *  
## LabelAppeal   0.73595    0.05033  14.622   <2e-16 ***
## AcidIndex     0.43719    0.03036  14.402   <2e-16 ***
## one_Star     -2.12462    0.09080 -23.399   <2e-16 ***
## two_Star     -5.82723    0.40293 -14.462   <2e-16 ***
## three_Star  -20.31891  403.43223  -0.050   0.9598    
## four_Star   -20.42411  753.06099  -0.027   0.9784    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 46634.6327 
## Number of iterations in BFGS optimization: 38 
## Log-likelihood: -1.425e+04 on 17 Df

#https://stats.stackexchange.com/questions/45262/zero-inflated-count-models-in-r-what-is-the-real-advantage

rm(ZeroInf_df)

SELECT MODELS

The model with the lowest RMSE is the ZeroInf_M1 model. This model contains all the variables. I find this behavior strange considering that full model with insignificant statistically variable tends to perform erratically. I will proceed to predict the test set.

test_F_train$Poiss_M1= predict(Poiss_M1,test_F_train,type = "response")
test_F_train$Poiss_M2= predict(Poiss_M2,test_F_train,type = "response")
test_F_train$NB_M1= predict(NB_M1,test_F_train,type = "response")
test_F_train$NB_M2= predict(NB_M2,test_F_train,type = "response")
test_F_train$Linear_M1= predict(Linear_M1,test_F_train)
test_F_train$Linear_M2= predict(Linear_M2,test_F_train)
test_F_train$ZeroInf_M1= predict(ZeroInf_M1,test_F_train,type = "response")
test_F_train$ZeroInf_M2= predict(ZeroInf_M2,test_F_train,type = "response")

Final_test_df = data.frame(
  Models =c("Poiss_M1","Poiss_M2","NB_M1","NB_M2","Linear_M1","Linear_M2","ZeroInf_M1","ZeroInf_M2"),
RMSE=c( sqrt(mean((test_F_train$TARGET - test_F_train$Poiss_M1)^2)),
 sqrt(mean((test_F_train$TARGET - test_F_train$Poiss_M2)^2)),
 sqrt(mean((test_F_train$TARGET - test_F_train$NB_M1)^2)),
 sqrt(mean((test_F_train$TARGET - test_F_train$NB_M2)^2)),
 sqrt(mean((test_F_train$TARGET - test_F_train$Linear_M1)^2)),
 sqrt(mean((test_F_train$TARGET - test_F_train$Linear_M2)^2)),
sqrt(mean((test_F_train$TARGET - test_F_train$ZeroInf_M1)^2)),
 sqrt(mean((test_F_train$TARGET - test_F_train$ZeroInf_M2)^2))

))



Final_test_df

imp_Eval_test$TARGET= predict(ZeroInf_M1,imp_Eval_test,type = "response")


head(imp_Eval_test$TARGET,5)

## [1] 1.982232 3.865380 2.309302 2.382840 0.743786

write.csv(imp_Eval_test,"cestevez_HW5_Predictions.csv",row.names = F)

Data 621

HW 5

Christopher Estevez

Overview

Deliverables:

DATA EXPLORATION

DATA PREPARATION

BUILD MODELS

Poisson

Model 1

Model 2

Negative Binomial

Model 1

Model 2

multiple linear regression

Model 1

Model 2

zero-inflated

Model 1

Model 2

SELECT MODELS