Data 621 - HW5

Loading of Libraries

Data Dictionary

Loading of files

setwd("/Users/dpong/Data 621/HW5/")

# Load Wine dataset
wine_train <- read.csv('https://raw.githubusercontent.com/metis-macys-66898/Data_621/main/HW5/wine-training-data.csv', fileEncoding="UTF-8-BOM")
wine_eval <- read.csv('https://raw.githubusercontent.com/metis-macys-66898/Data_621/main/HW5/wine-evaluation-data.csv')

As we know the INDEX variable isn’t going to be useful for this modeling exercise, we decide to drop it from the dataset altogether.

Given that the Index column had no impact on the target variable, number of wines, it was dropped.

Data Exploration

Summary

summary(wine_train)
##      TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
##  Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
##  1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
##  Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
##  Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
##  3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
##  Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
##                                                                       
##  ResidualSugar        Chlorides       FreeSulfurDioxide TotalSulfurDioxide
##  Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00   Min.   :-823.0    
##  1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00   1st Qu.:  27.0    
##  Median :   3.900   Median : 0.0460   Median :  30.00   Median : 123.0    
##  Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85   Mean   : 120.7    
##  3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00   3rd Qu.: 208.0    
##  Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00   Max.   :1057.0    
##  NA's   :616        NA's   :638       NA's   :647       NA's   :682       
##     Density             pH          Sulphates          Alcohol     
##  Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
##  1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
##  Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
##  Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
##  3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
##  Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
##                   NA's   :395     NA's   :1210      NA's   :653    
##   LabelAppeal          AcidIndex          STARS      
##  Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##                                       NA's   :3359

There is a total of 8 features that has NA’s (missing) values. Our target variable ranges between 0 and 8, which makes sense, because the target variable is the number of cases purchased. Even tho’ it might not make sense, some of these features measuring the quantity of chemical in the wine does have negative values. It might be due to the fact these variables had been transformed beforehand. We decided to leave them as is.

Histograms

# Create gather_df for ggplot()
gather_df <- wine_train %>% gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(gather_df) + 
  geom_histogram(aes(x=value, y = ..density..), bins=30) + 
  geom_density(aes(x=value), color='blue') +
  facet_wrap(. ~ variable, scales='free', ncol=4)
## Warning: Removed 8200 rows containing non-finite values (stat_bin).
## Warning: Removed 8200 rows containing non-finite values (stat_density).

We see that most of the distributions has approximately normal distributions except for STARS and AcidIndex, which are both right skewed.

Boxplots

Next, we’re going to run some boxplots to visualize the spreads of each variable.

# Create gather_df for the input of ggplot
gather_df <- wine_train %>% gather(key = 'variable', value = 'value')
# Boxplots for each variable
ggplot(gather_df, aes(variable, value)) + 
  geom_boxplot() +  
  facet_wrap(. ~variable, scales='free', ncol=4)
## Warning: Removed 8200 rows containing non-finite values (stat_boxplot).

df_pivot_wide <- wine_train %>% 
  dplyr::select(STARS, LabelAppeal, AcidIndex, TARGET ) %>%
  pivot_longer(cols = -TARGET, names_to="variable", values_to="value") %>%
  arrange(variable, value)
df_pivot_wide %>% 
  ggplot(mapping = aes(x = factor(value), y = TARGET)) +
    geom_boxplot() + 
    facet_wrap(.~variable, scales="free") +
    theme_minimal()

Commentaries:

There aren’t too many outliners for AcidIndex. You can tell there are a lot of zeros for AcidIndex 12, 16, and 17. There is no clear pattern in relation to TARGET. As for LabelAppeal, I do see there is positive correlation with TARGET. The higher the LabelAppeal, the higher volume of TARGET you get. As for STARS, there is an obvious positive correlation with TARGET. TARGET = NA seems to be distribute across all spectrum of STARS. In order to satisfy some of the requirements for the model, I’d impute NA with 0. The overall trend with the existing values is still the same where the higher the value of STARS will naturally net a higher volume in TARGET, which is cases of wine sold.

Scatter Plots

featurePlot(wine_train[,2:ncol(wine_train)], wine_train[,1], pch = 18)

What I am looking for is some irregular gaps for some values in a given variable. I do not see any irregular distribution against the the TARGET variable, which is shown in the y-axis.

Missing values & Imputations

With that said, we do need to check the missing values.

missing <- colSums(wine_train %>% sapply(is.na))
missing_pct <- round(missing / nrow(wine_train) * 100, 2)

stack(sort(missing_pct, decreasing = TRUE))
##    values                ind
## 1   26.25              STARS
## 2    9.46          Sulphates
## 3    5.33 TotalSulfurDioxide
## 4    5.10            Alcohol
## 5    5.06  FreeSulfurDioxide
## 6    4.99          Chlorides
## 7    4.81      ResidualSugar
## 8    3.09                 pH
## 9    0.00             TARGET
## 10   0.00       FixedAcidity
## 11   0.00    VolatileAcidity
## 12   0.00         CitricAcid
## 13   0.00            Density
## 14   0.00        LabelAppeal
## 15   0.00          AcidIndex

As you can see, there are 7 additional variables that need to be imputed in addition to the STARS variable.

Data Preparations

Strategies:

  1. impute STARS to 0

  2. Use knnImpute and BoxCox to impute all the remaining 7 columns

training_x <- wine_train %>% dplyr::select(-TARGET)
training_y <- wine_train$TARGET

eval_x <- wine_eval %>% dplyr::select(-TARGET)
eval_y <- wine_eval$TARGET

create_na_dummy <- function(vector) {
  as.integer(vector %>% is.na())
}

impute_missing <- function(data) {
  # Replace missing STARS with 0 
  data$STARS <- data$STARS %>%
    tidyr::replace_na(0)
  return(data)
}
# Replace missing STARS with 'unknown' and convert STASR to a factor
training_x <- impute_missing(training_x)
eval_x <- impute_missing(eval_x)
imputation <- caret::preProcess(training_x, method = c("knnImpute", 'BoxCox'))
# summary(imputation)
training_x_imp <- predict(imputation, training_x)
eval_x_imp <- predict(imputation, eval_x)
clean_df <- cbind(training_y, training_x_imp) %>% 
  as.data.frame() %>%
  rename(TARGET = training_y)
clean_eval_df <- cbind(eval_y, eval_x_imp) %>% 
  as.data.frame() %>%
  rename(TARGET = eval_y)

Feature-Target Correlations

stack(sort(cor(clean_df[,1], clean_df[,2:ncol(clean_df)])[,], decreasing=TRUE))
##          values                ind
## 1   0.685381473              STARS
## 2   0.356500469        LabelAppeal
## 3   0.062030498            Alcohol
## 4   0.051730323 TotalSulfurDioxide
## 5   0.043996542  FreeSulfurDioxide
## 6   0.016187709      ResidualSugar
## 7   0.008684633         CitricAcid
## 8  -0.009081197                 pH
## 9  -0.035589560            Density
## 10 -0.039072231          Chlorides
## 11 -0.039917146          Sulphates
## 12 -0.049010939       FixedAcidity
## 13 -0.088793212    VolatileAcidity
## 14 -0.221991949          AcidIndex

Only STARS is considered borderline highly correlated with the TARGET variable. Note that this is after the imputation.

Multi-collinearity

The best way to check for multi-collinearity is to use correlation coefficients among variables, or predictors.

correlation = cor(clean_df, use = 'pairwise.complete.obs')
corrplot(correlation, 'ellipse', type = 'lower',  order = 'hclust', col=brewer.pal(n=6, name="RdYlBu"))

The correlation coefficients among predictors are quite low. With that said, we checked all the assumptions for linear regressions.

Final steps to data prep. I have to create a data partition separating out train set and test set. 80% train 20% test.

y_mat <- as.matrix(clean_df$TARGET)
# Create a train_vect 
train_vect <- createDataPartition(y_mat, p=0.8, list=FALSE)
# Build train sets 
trainX <- clean_df[train_vect,] %>% dplyr::select(-TARGET)
trainY <- clean_df[train_vect,] %>% dplyr::select(TARGET)
# Output test sets
testX <- clean_df[-train_vect,] %>% dplyr::select(-TARGET)
testY <- clean_df[-train_vect,] %>% dplyr::select(TARGET)
# Build a DF for both train and test
train_df <- as.data.frame(trainX)
train_df$TARGET <- trainY$TARGET
print(paste('Size of Training data frame: ', dim(train_df)[1]))
## [1] "Size of Training data frame:  10238"
test_df <- as.data.frame(testX)
test_df$TARGET <- testY$TARGET
print(paste('Size of Testing data frame: ', dim(test_df)[1]))
## [1] "Size of Testing data frame:  2557"
model_perf_metrics <- function(model, trainX, trainY, testX, testY) {
  # Evaluate Model with testing data set
  predY <- predict(model, newdata=trainX)
  model_results <- data.frame(obs = trainY, pred=predY)
  colnames(model_results) = c('obs', 'pred')
  
  # defaultSummary includes RMSE, Rsquared, and MAE by default
  model_eval <- defaultSummary(model_results)
  
  # Add AIC score to the model_eval results
  model_eval[4] <- AIC(model)
  names(model_eval)[4] <- 'AIC'
 
  # Add BIC score to the model_eval results
  model_eval[5] <- BIC(model)
  names(model_eval)[5] <- 'BIC'
   
  return(model_eval)
}

Model Building

variableImportancePlot <- function(model=NULL, chart_title='Variable Importance Plot') {
  # Make sure a model was passed
  if (is.null(model)) {
    return
  }
  
  # use caret and gglot to print a variable importance plot
  caret::varImp(model) %>% as.data.frame() %>% 
    ggplot(aes(x = reorder(rownames(.), desc(Overall)), y = Overall)) +
    geom_col(aes(fill = Overall)) +
    theme(panel.background = element_blank(),
          panel.grid = element_blank(),
          axis.text.x = element_text(angle = 90)) +
    scale_fill_gradient() +
    labs(title = chart_title,
         x = "Parameter",
         y = "Relative Importance")
}

Poisson Model 1 (full model w/ 14 predictors)

pois1 <- glm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
              Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
              pH + Sulphates + Alcohol + 
              as.factor(LabelAppeal) +
              as.factor(AcidIndex) +
              as.factor(STARS),
              data=train_df, 
              family=poisson
            )
summary(pois1)
## 
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = poisson, 
##     data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2539  -0.6371  -0.0063   0.4424   3.6752  
## 
## Coefficients:
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                              -11.687853 172.654518  -0.068 0.946028
## FixedAcidity                               0.005250   0.005789   0.907 0.364535
## VolatileAcidity                           -0.021188   0.005709  -3.711 0.000206
## CitricAcid                                 0.005663   0.005682   0.997 0.318868
## ResidualSugar                             -0.003967   0.005788  -0.685 0.493128
## Chlorides                                 -0.008474   0.005818  -1.457 0.145247
## FreeSulfurDioxide                          0.012187   0.005784   2.107 0.035124
## TotalSulfurDioxide                         0.018498   0.005839   3.168 0.001534
## Density                                   -0.009423   0.005702  -1.652 0.098434
## pH                                        -0.009472   0.005754  -1.646 0.099700
## Sulphates                                 -0.012102   0.005971  -2.027 0.042676
## Alcohol                                    0.020384   0.005885   3.464 0.000532
## as.factor(LabelAppeal)-1.11204793733397    0.245210   0.042964   5.707 1.15e-08
## as.factor(LabelAppeal)0.0101741115806247   0.444586   0.041960  10.595  < 2e-16
## as.factor(LabelAppeal)1.13239616049522     0.578377   0.042654  13.560  < 2e-16
## as.factor(LabelAppeal)2.25461820940981     0.732594   0.047787  15.330  < 2e-16
## as.factor(AcidIndex)-3.59682937695875     11.538890 172.654526   0.067 0.946715
## as.factor(AcidIndex)-1.79176983045029     11.590080 172.654514   0.067 0.946479
## as.factor(AcidIndex)-0.545318540973785    11.556816 172.654514   0.067 0.946633
## as.factor(AcidIndex)0.362910765511677     11.519185 172.654514   0.067 0.946806
## as.factor(AcidIndex)1.05172974217783      11.411107 172.654515   0.066 0.947304
## as.factor(AcidIndex)1.59059728918163      11.231692 172.654517   0.065 0.948132
## as.factor(AcidIndex)2.02271372429848      10.909721 172.654525   0.063 0.949617
## as.factor(AcidIndex)2.37629509167962      10.949795 172.654538   0.063 0.949432
## as.factor(AcidIndex)2.67051656830802      11.106215 172.654545   0.064 0.948710
## as.factor(AcidIndex)2.9188445277671       10.988729 172.654580   0.064 0.949252
## as.factor(AcidIndex)3.13100139587667      11.454743 172.654695   0.066 0.947103
## as.factor(AcidIndex)3.31417429494859      11.017790 172.655094   0.064 0.949118
## as.factor(AcidIndex)3.47378568897179      10.878778 172.655096   0.063 0.949760
## as.factor(STARS)-0.42623524866846          0.746259   0.021899  34.078  < 2e-16
## as.factor(STARS)0.416552574962037          1.065055   0.020459  52.059  < 2e-16
## as.factor(STARS)1.25934039859254           1.179472   0.021518  54.812  < 2e-16
## as.factor(STARS)2.10212822222303           1.302543   0.027050  48.154  < 2e-16
##                                             
## (Intercept)                                 
## FixedAcidity                                
## VolatileAcidity                          ***
## CitricAcid                                  
## ResidualSugar                               
## Chlorides                                   
## FreeSulfurDioxide                        *  
## TotalSulfurDioxide                       ** 
## Density                                  .  
## pH                                       .  
## Sulphates                                *  
## Alcohol                                  ***
## as.factor(LabelAppeal)-1.11204793733397  ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522   ***
## as.factor(LabelAppeal)2.25461820940981   ***
## as.factor(AcidIndex)-3.59682937695875       
## as.factor(AcidIndex)-1.79176983045029       
## as.factor(AcidIndex)-0.545318540973785      
## as.factor(AcidIndex)0.362910765511677       
## as.factor(AcidIndex)1.05172974217783        
## as.factor(AcidIndex)1.59059728918163        
## as.factor(AcidIndex)2.02271372429848        
## as.factor(AcidIndex)2.37629509167962        
## as.factor(AcidIndex)2.67051656830802        
## as.factor(AcidIndex)2.9188445277671         
## as.factor(AcidIndex)3.13100139587667        
## as.factor(AcidIndex)3.31417429494859        
## as.factor(AcidIndex)3.47378568897179        
## as.factor(STARS)-0.42623524866846        ***
## as.factor(STARS)0.416552574962037        ***
## as.factor(STARS)1.25934039859254         ***
## as.factor(STARS)2.10212822222303         ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18262  on 10237  degrees of freedom
## Residual deviance: 10784  on 10205  degrees of freedom
## AIC: 36436
## 
## Number of Fisher Scoring iterations: 9
# Evaluation and VarImp 

(pois1_eval <- model_perf_metrics(pois1, trainX, trainY, testX, testY))
##         RMSE     Rsquared          MAE          aic          bic 
##     2.593117     0.505956     2.228388 36435.917144 36674.634576
poi1VarImp <- variableImportancePlot(pois1, "Poisson Model 1 Variable Importance")

Poisson Model 2

Just picked the predictors that are statistical significant in model 1.

Predictors include:

  • VolatileAcidity

  • Chlorides

  • FreeSulfurDioxide

  • TotalSulfurDioxide

  • Sulphates

  • Alcohol

  • LabelAppeal

  • AcidIndex

  • STARS

pois2 <- glm(TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Sulphates +               Alcohol + 
              as.factor(LabelAppeal) + 
              as.factor(AcidIndex) + 
              as.factor(STARS),
              data=clean_df, 
              family=poisson
             )
summary(pois2)
## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), family = poisson, 
##     data = clean_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2357  -0.6510  -0.0058   0.4411   3.6662  
## 
## Coefficients:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               0.019048   0.318723   0.060  0.95234
## VolatileAcidity                          -0.023413   0.005122  -4.571 4.85e-06
## Chlorides                                -0.012368   0.005217  -2.371  0.01776
## FreeSulfurDioxide                         0.013273   0.005181   2.562  0.01041
## TotalSulfurDioxide                        0.016843   0.005244   3.212  0.00132
## Sulphates                                -0.010623   0.005312  -2.000  0.04554
## Alcohol                                   0.016006   0.005232   3.059  0.00222
## as.factor(LabelAppeal)-1.11204793733397   0.239676   0.037999   6.307 2.84e-10
## as.factor(LabelAppeal)0.0101741115806247  0.429898   0.037065  11.599  < 2e-16
## as.factor(LabelAppeal)1.13239616049522    0.563397   0.037712  14.940  < 2e-16
## as.factor(LabelAppeal)2.25461820940981    0.698247   0.042448  16.450  < 2e-16
## as.factor(AcidIndex)-3.59682937695875    -0.158946   0.322391  -0.493  0.62200
## as.factor(AcidIndex)-1.79176983045029    -0.114374   0.316934  -0.361  0.71819
## as.factor(AcidIndex)-0.545318540973785   -0.148098   0.316650  -0.468  0.64000
## as.factor(AcidIndex)0.362910765511677    -0.178792   0.316681  -0.565  0.57236
## as.factor(AcidIndex)1.05172974217783     -0.288387   0.316984  -0.910  0.36294
## as.factor(AcidIndex)1.59059728918163     -0.445135   0.318061  -1.400  0.16165
## as.factor(AcidIndex)2.02271372429848     -0.806669   0.321621  -2.508  0.01214
## as.factor(AcidIndex)2.37629509167962     -0.820516   0.327286  -2.507  0.01217
## as.factor(AcidIndex)2.67051656830802     -0.658949   0.330177  -1.996  0.04596
## as.factor(AcidIndex)2.9188445277671      -0.759674   0.342770  -2.216  0.02667
## as.factor(AcidIndex)3.13100139587667     -0.315477   0.403499  -0.782  0.43430
## as.factor(AcidIndex)3.31417429494859     -0.969335   0.548020  -1.769  0.07693
## as.factor(AcidIndex)3.47378568897179     -1.203509   0.548079  -2.196  0.02810
## as.factor(STARS)-0.42623524866846         0.755288   0.019570  38.594  < 2e-16
## as.factor(STARS)0.416552574962037         1.074130   0.018265  58.809  < 2e-16
## as.factor(STARS)1.25934039859254          1.191989   0.019240  61.953  < 2e-16
## as.factor(STARS)2.10212822222303          1.312474   0.024338  53.926  < 2e-16
##                                             
## (Intercept)                                 
## VolatileAcidity                          ***
## Chlorides                                *  
## FreeSulfurDioxide                        *  
## TotalSulfurDioxide                       ** 
## Sulphates                                *  
## Alcohol                                  ** 
## as.factor(LabelAppeal)-1.11204793733397  ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522   ***
## as.factor(LabelAppeal)2.25461820940981   ***
## as.factor(AcidIndex)-3.59682937695875       
## as.factor(AcidIndex)-1.79176983045029       
## as.factor(AcidIndex)-0.545318540973785      
## as.factor(AcidIndex)0.362910765511677       
## as.factor(AcidIndex)1.05172974217783        
## as.factor(AcidIndex)1.59059728918163        
## as.factor(AcidIndex)2.02271372429848     *  
## as.factor(AcidIndex)2.37629509167962     *  
## as.factor(AcidIndex)2.67051656830802     *  
## as.factor(AcidIndex)2.9188445277671      *  
## as.factor(AcidIndex)3.13100139587667        
## as.factor(AcidIndex)3.31417429494859     .  
## as.factor(AcidIndex)3.47378568897179     *  
## as.factor(STARS)-0.42623524866846        ***
## as.factor(STARS)0.416552574962037        ***
## as.factor(STARS)1.25934039859254         ***
## as.factor(STARS)2.10212822222303         ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 13534  on 12767  degrees of freedom
## AIC: 45532
## 
## Number of Fisher Scoring iterations: 6
# Evaluate Model 2 with testing data set
(pois2_eval <- model_perf_metrics(pois2, trainX, trainY, testX, testY))
##         RMSE     Rsquared          MAE          aic          bic 
## 2.591346e+00 5.210688e-01 2.226889e+00 4.553219e+04 4.574098e+04
poi2VarImp <- variableImportancePlot(pois2, "Poisson Model 2 Variable Importance")

Negative Binomial Model 1

nb1 <- glm.nb(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
                Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
                pH + Sulphates + Alcohol + 
                as.factor(LabelAppeal) +
                as.factor(AcidIndex) +
                as.factor(STARS),
              data=clean_df)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(nb1)
## 
## Call:
## glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), data = clean_df, 
##     init.theta = 40957.00204, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2219  -0.6496  -0.0055   0.4446   3.6790  
## 
## Coefficients:
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               0.0275198  0.3190721   0.086  0.93127
## FixedAcidity                              0.0010176  0.0051829   0.196  0.84435
## VolatileAcidity                          -0.0231944  0.0051231  -4.527 5.97e-06
## CitricAcid                                0.0039724  0.0050838   0.781  0.43458
## ResidualSugar                             0.0005767  0.0052060   0.111  0.91179
## Chlorides                                -0.0122552  0.0052206  -2.347  0.01890
## FreeSulfurDioxide                         0.0132531  0.0051832   2.557  0.01056
## TotalSulfurDioxide                        0.0170166  0.0052484   3.242  0.00119
## Density                                  -0.0074549  0.0050938  -1.464  0.14332
## pH                                       -0.0066655  0.0051875  -1.285  0.19882
## Sulphates                                -0.0106282  0.0053149  -2.000  0.04554
## Alcohol                                   0.0157805  0.0052355   3.014  0.00258
## as.factor(LabelAppeal)-1.11204793733397   0.2398534  0.0380017   6.312 2.76e-10
## as.factor(LabelAppeal)0.0101741115806247  0.4300463  0.0370666  11.602  < 2e-16
## as.factor(LabelAppeal)1.13239616049522    0.5633460  0.0377142  14.937  < 2e-16
## as.factor(LabelAppeal)2.25461820940981    0.6992610  0.0424548  16.471  < 2e-16
## as.factor(AcidIndex)-3.59682937695875    -0.1651538  0.3226593  -0.512  0.60875
## as.factor(AcidIndex)-1.79176983045029    -0.1214774  0.3172467  -0.383  0.70179
## as.factor(AcidIndex)-0.545318540973785   -0.1558113  0.3169954  -0.492  0.62305
## as.factor(AcidIndex)0.362910765511677    -0.1871144  0.3170413  -0.590  0.55506
## as.factor(AcidIndex)1.05172974217783     -0.2972766  0.3173791  -0.937  0.34893
## as.factor(AcidIndex)1.59059728918163     -0.4542592  0.3184811  -1.426  0.15377
## as.factor(AcidIndex)2.02271372429848     -0.8158352  0.3220684  -2.533  0.01131
## as.factor(AcidIndex)2.37629509167962     -0.8303961  0.3277299  -2.534  0.01128
## as.factor(AcidIndex)2.67051656830802     -0.6688133  0.3306330  -2.023  0.04309
## as.factor(AcidIndex)2.9188445277671      -0.7687131  0.3432641  -2.239  0.02513
## as.factor(AcidIndex)3.13100139587667     -0.3297889  0.4038365  -0.817  0.41413
## as.factor(AcidIndex)3.31417429494859     -0.9814037  0.5484760  -1.789  0.07356
## as.factor(AcidIndex)3.47378568897179     -1.2022430  0.5486104  -2.191  0.02842
## as.factor(STARS)-0.42623524866846         0.7548590  0.0195728  38.567  < 2e-16
## as.factor(STARS)0.416552574962037         1.0732229  0.0182738  58.730  < 2e-16
## as.factor(STARS)1.25934039859254          1.1910222  0.0192473  61.880  < 2e-16
## as.factor(STARS)2.10212822222303          1.3117031  0.0243440  53.882  < 2e-16
##                                             
## (Intercept)                                 
## FixedAcidity                                
## VolatileAcidity                          ***
## CitricAcid                                  
## ResidualSugar                               
## Chlorides                                *  
## FreeSulfurDioxide                        *  
## TotalSulfurDioxide                       ** 
## Density                                     
## pH                                          
## Sulphates                                *  
## Alcohol                                  ** 
## as.factor(LabelAppeal)-1.11204793733397  ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522   ***
## as.factor(LabelAppeal)2.25461820940981   ***
## as.factor(AcidIndex)-3.59682937695875       
## as.factor(AcidIndex)-1.79176983045029       
## as.factor(AcidIndex)-0.545318540973785      
## as.factor(AcidIndex)0.362910765511677       
## as.factor(AcidIndex)1.05172974217783        
## as.factor(AcidIndex)1.59059728918163        
## as.factor(AcidIndex)2.02271372429848     *  
## as.factor(AcidIndex)2.37629509167962     *  
## as.factor(AcidIndex)2.67051656830802     *  
## as.factor(AcidIndex)2.9188445277671      *  
## as.factor(AcidIndex)3.13100139587667        
## as.factor(AcidIndex)3.31417429494859     .  
## as.factor(AcidIndex)3.47378568897179     *  
## as.factor(STARS)-0.42623524866846        ***
## as.factor(STARS)0.416552574962037        ***
## as.factor(STARS)1.25934039859254         ***
## as.factor(STARS)2.10212822222303         ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40957) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 13529  on 12762  degrees of freedom
## AIC: 45540
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40957 
##           Std. Err.:  34344 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -45472.13
(nb1_eval <- model_perf_metrics(nb1, trainX, trainY, testX, testY))
##         RMSE     Rsquared          MAE          aic          bic 
## 2.591223e+00 5.215761e-01 2.226684e+00 4.554013e+04 4.579366e+04
nb1VarImp <- variableImportancePlot(nb1, "Negative Binomial 1 Variable Importance")

Negative Binomial Model 2 (full model w/ 14 predictors)

Just picked the predictors that are statistical significant in model 1.

Predictors include:

  • VolatileAcidity

  • Chlorides

  • FreeSulfurDioxide

  • TotalSulfurDioxide

  • Sulphates

  • Alcohol

  • LabelAppeal

  • AcidIndex

  • STARS

nb2 <- glm.nb(TARGET~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Sulphates 
              + 
              Alcohol + 
              as.factor(LabelAppeal) + 
              as.factor(AcidIndex) + 
              as.factor(STARS),
              data=clean_df)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary (nb2)
## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), data = clean_df, 
##     init.theta = 40946.13456, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2356  -0.6510  -0.0058   0.4411   3.6661  
## 
## Coefficients:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               0.019070   0.318742   0.060  0.95229
## VolatileAcidity                          -0.023413   0.005122  -4.571 4.85e-06
## Chlorides                                -0.012368   0.005217  -2.371  0.01776
## FreeSulfurDioxide                         0.013274   0.005181   2.562  0.01041
## TotalSulfurDioxide                        0.016844   0.005245   3.212  0.00132
## Sulphates                                -0.010623   0.005313  -2.000  0.04554
## Alcohol                                   0.016005   0.005233   3.059  0.00222
## as.factor(LabelAppeal)-1.11204793733397   0.239676   0.038000   6.307 2.84e-10
## as.factor(LabelAppeal)0.0101741115806247  0.429897   0.037066  11.598  < 2e-16
## as.factor(LabelAppeal)1.13239616049522    0.563394   0.037712  14.939  < 2e-16
## as.factor(LabelAppeal)2.25461820940981    0.698243   0.042449  16.449  < 2e-16
## as.factor(AcidIndex)-3.59682937695875    -0.158966   0.322411  -0.493  0.62198
## as.factor(AcidIndex)-1.79176983045029    -0.114392   0.316953  -0.361  0.71817
## as.factor(AcidIndex)-0.545318540973785   -0.148117   0.316669  -0.468  0.63998
## as.factor(AcidIndex)0.362910765511677    -0.178811   0.316700  -0.565  0.57234
## as.factor(AcidIndex)1.05172974217783     -0.288410   0.317003  -0.910  0.36293
## as.factor(AcidIndex)1.59059728918163     -0.445160   0.318080  -1.400  0.16166
## as.factor(AcidIndex)2.02271372429848     -0.806699   0.321640  -2.508  0.01214
## as.factor(AcidIndex)2.37629509167962     -0.820547   0.327305  -2.507  0.01218
## as.factor(AcidIndex)2.67051656830802     -0.658978   0.330196  -1.996  0.04596
## as.factor(AcidIndex)2.9188445277671      -0.759701   0.342789  -2.216  0.02668
## as.factor(AcidIndex)3.13100139587667     -0.315502   0.403519  -0.782  0.43429
## as.factor(AcidIndex)3.31417429494859     -0.969370   0.548038  -1.769  0.07693
## as.factor(AcidIndex)3.47378568897179     -1.203547   0.548095  -2.196  0.02810
## as.factor(STARS)-0.42623524866846         0.755287   0.019570  38.593  < 2e-16
## as.factor(STARS)0.416552574962037         1.074130   0.018265  58.808  < 2e-16
## as.factor(STARS)1.25934039859254          1.191989   0.019241  61.951  < 2e-16
## as.factor(STARS)2.10212822222303          1.312476   0.024340  53.924  < 2e-16
##                                             
## (Intercept)                                 
## VolatileAcidity                          ***
## Chlorides                                *  
## FreeSulfurDioxide                        *  
## TotalSulfurDioxide                       ** 
## Sulphates                                *  
## Alcohol                                  ** 
## as.factor(LabelAppeal)-1.11204793733397  ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522   ***
## as.factor(LabelAppeal)2.25461820940981   ***
## as.factor(AcidIndex)-3.59682937695875       
## as.factor(AcidIndex)-1.79176983045029       
## as.factor(AcidIndex)-0.545318540973785      
## as.factor(AcidIndex)0.362910765511677       
## as.factor(AcidIndex)1.05172974217783        
## as.factor(AcidIndex)1.59059728918163        
## as.factor(AcidIndex)2.02271372429848     *  
## as.factor(AcidIndex)2.37629509167962     *  
## as.factor(AcidIndex)2.67051656830802     *  
## as.factor(AcidIndex)2.9188445277671      *  
## as.factor(AcidIndex)3.13100139587667        
## as.factor(AcidIndex)3.31417429494859     .  
## as.factor(AcidIndex)3.47378568897179     *  
## as.factor(STARS)-0.42623524866846        ***
## as.factor(STARS)0.416552574962037        ***
## as.factor(STARS)1.25934039859254         ***
## as.factor(STARS)2.10212822222303         ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40946.13) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 13534  on 12767  degrees of freedom
## AIC: 45535
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40946 
##           Std. Err.:  34332 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -45476.61
(nb2_eval <- model_perf_metrics(nb2, trainX, trainY, testX, testY))
##         RMSE     Rsquared          MAE          aic          bic 
## 2.591346e+00 5.210685e-01 2.226888e+00 4.553461e+04 4.575086e+04
nb2VarImp <- variableImportancePlot(nb2, "Negative Binomial 2 Variable Importance")

Linear Model 1 (full model w/ 14 predictors)

lm1 <- lm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
                Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
                pH + Sulphates + Alcohol + 
                as.factor(LabelAppeal) +
                as.factor(AcidIndex) +
                as.factor(STARS),
              data=clean_df)
summary(lm1)
## 
## Call:
## lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + 
##     as.factor(AcidIndex) + as.factor(STARS), data = clean_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9635 -0.8591  0.0325  0.8384  6.0750 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               0.995095   0.755491   1.317  0.18781
## FixedAcidity                              0.004930   0.011720   0.421  0.67401
## VolatileAcidity                          -0.073667   0.011565  -6.370 1.96e-10
## CitricAcid                                0.014455   0.011558   1.251  0.21109
## ResidualSugar                             0.004027   0.011782   0.342  0.73251
## Chlorides                                -0.038485   0.011776  -3.268  0.00109
## FreeSulfurDioxide                         0.039825   0.011786   3.379  0.00073
## TotalSulfurDioxide                        0.049447   0.011816   4.185 2.87e-05
## Density                                  -0.022475   0.011545  -1.947  0.05160
## pH                                       -0.018619   0.011704  -1.591  0.11167
## Sulphates                                -0.028248   0.012015  -2.351  0.01873
## Alcohol                                   0.050812   0.011830   4.295 1.76e-05
## as.factor(LabelAppeal)-1.11204793733397   0.367639   0.062729   5.861 4.72e-09
## as.factor(LabelAppeal)0.0101741115806247  0.835185   0.061168  13.654  < 2e-16
## as.factor(LabelAppeal)1.13239616049522    1.302062   0.063917  20.371  < 2e-16
## as.factor(LabelAppeal)2.25461820940981    1.889951   0.084169  22.454  < 2e-16
## as.factor(AcidIndex)-3.59682937695875    -0.334854   0.767938  -0.436  0.66281
## as.factor(AcidIndex)-1.79176983045029    -0.220221   0.754101  -0.292  0.77027
## as.factor(AcidIndex)-0.545318540973785   -0.322349   0.753472  -0.428  0.66879
## as.factor(AcidIndex)0.362910765511677    -0.429363   0.753539  -0.570  0.56883
## as.factor(AcidIndex)1.05172974217783     -0.732560   0.754117  -0.971  0.33136
## as.factor(AcidIndex)1.59059728918163     -1.041297   0.755385  -1.378  0.16807
## as.factor(AcidIndex)2.02271372429848     -1.513113   0.757752  -1.997  0.04586
## as.factor(AcidIndex)2.37629509167962     -1.533481   0.762156  -2.012  0.04424
## as.factor(AcidIndex)2.67051656830802     -1.552014   0.769606  -2.017  0.04375
## as.factor(AcidIndex)2.9188445277671      -1.400154   0.777299  -1.801  0.07168
## as.factor(AcidIndex)3.13100139587667     -0.692206   0.883131  -0.784  0.43317
## as.factor(AcidIndex)3.31417429494859     -1.772148   0.952843  -1.860  0.06293
## as.factor(AcidIndex)3.47378568897179     -1.920432   0.900840  -2.132  0.03304
## as.factor(STARS)-0.42623524866846         1.346560   0.032920  40.904  < 2e-16
## as.factor(STARS)0.416552574962037         2.381720   0.032021  74.381  < 2e-16
## as.factor(STARS)1.25934039859254          2.942287   0.037079  79.352  < 2e-16
## as.factor(STARS)2.10212822222303          3.629958   0.059150  61.368  < 2e-16
##                                             
## (Intercept)                                 
## FixedAcidity                                
## VolatileAcidity                          ***
## CitricAcid                                  
## ResidualSugar                               
## Chlorides                                ** 
## FreeSulfurDioxide                        ***
## TotalSulfurDioxide                       ***
## Density                                  .  
## pH                                          
## Sulphates                                *  
## Alcohol                                  ***
## as.factor(LabelAppeal)-1.11204793733397  ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522   ***
## as.factor(LabelAppeal)2.25461820940981   ***
## as.factor(AcidIndex)-3.59682937695875       
## as.factor(AcidIndex)-1.79176983045029       
## as.factor(AcidIndex)-0.545318540973785      
## as.factor(AcidIndex)0.362910765511677       
## as.factor(AcidIndex)1.05172974217783        
## as.factor(AcidIndex)1.59059728918163        
## as.factor(AcidIndex)2.02271372429848     *  
## as.factor(AcidIndex)2.37629509167962     *  
## as.factor(AcidIndex)2.67051656830802     *  
## as.factor(AcidIndex)2.9188445277671      .  
## as.factor(AcidIndex)3.13100139587667        
## as.factor(AcidIndex)3.31417429494859     .  
## as.factor(AcidIndex)3.47378568897179     *  
## as.factor(STARS)-0.42623524866846        ***
## as.factor(STARS)0.416552574962037        ***
## as.factor(STARS)1.25934039859254         ***
## as.factor(STARS)2.10212822222303         ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.302 on 12762 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5429 
## F-statistic: 475.9 on 32 and 12762 DF,  p-value: < 2.2e-16
(lm1_eval <- model_perf_metrics(lm1, trainX, trainY, testX, testY))
##         RMSE     Rsquared          MAE          aic          bic 
## 1.297132e+00 5.467544e-01 1.014874e+00 4.310630e+04 4.335983e+04
lm1VarImp <- variableImportancePlot(lm1, "Linear Model 1 Variable Importance")

Linear Model 2

For this linear model, we opted to use StepAIC to step thru’ the variable selection algorithm.

lm2 <- stepAIC(lm1, direction = "both",
               scope = list(upper = lm1, lower = ~ 1),
               scale = 0, trace = FALSE)
summary(lm2)
## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Density + pH + Sulphates + Alcohol + 
##     as.factor(LabelAppeal) + as.factor(AcidIndex) + as.factor(STARS), 
##     data = clean_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9616 -0.8590  0.0352  0.8399  6.0675 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               0.99019    0.75488   1.312 0.189639
## VolatileAcidity                          -0.07393    0.01156  -6.394 1.67e-10
## Chlorides                                -0.03864    0.01177  -3.282 0.001034
## FreeSulfurDioxide                         0.04009    0.01178   3.403 0.000669
## TotalSulfurDioxide                        0.04960    0.01181   4.200 2.69e-05
## Density                                  -0.02270    0.01154  -1.967 0.049224
## pH                                       -0.01862    0.01170  -1.591 0.111585
## Sulphates                                -0.02837    0.01201  -2.362 0.018191
## Alcohol                                   0.05100    0.01183   4.313 1.62e-05
## as.factor(LabelAppeal)-1.11204793733397   0.36722    0.06272   5.854 4.90e-09
## as.factor(LabelAppeal)0.0101741115806247  0.83483    0.06116  13.649  < 2e-16
## as.factor(LabelAppeal)1.13239616049522    1.30161    0.06391  20.367  < 2e-16
## as.factor(LabelAppeal)2.25461820940981    1.88998    0.08416  22.456  < 2e-16
## as.factor(AcidIndex)-3.59682937695875    -0.33511    0.76744  -0.437 0.662366
## as.factor(AcidIndex)-1.79176983045029    -0.21824    0.75353  -0.290 0.772104
## as.factor(AcidIndex)-0.545318540973785   -0.31837    0.75287  -0.423 0.672396
## as.factor(AcidIndex)0.362910765511677    -0.42442    0.75293  -0.564 0.572976
## as.factor(AcidIndex)1.05172974217783     -0.72561    0.75343  -0.963 0.335525
## as.factor(AcidIndex)1.59059728918163     -1.03390    0.75463  -1.370 0.170687
## as.factor(AcidIndex)2.02271372429848     -1.50407    0.75695  -1.987 0.046941
## as.factor(AcidIndex)2.37629509167962     -1.52217    0.76131  -1.999 0.045584
## as.factor(AcidIndex)2.67051656830802     -1.54018    0.76867  -2.004 0.045124
## as.factor(AcidIndex)2.9188445277671      -1.38512    0.77635  -1.784 0.074426
## as.factor(AcidIndex)3.13100139587667     -0.67937    0.88243  -0.770 0.441383
## as.factor(AcidIndex)3.31417429494859     -1.75343    0.95178  -1.842 0.065461
## as.factor(AcidIndex)3.47378568897179     -1.89498    0.89978  -2.106 0.035220
## as.factor(STARS)-0.42623524866846         1.34682    0.03291  40.918  < 2e-16
## as.factor(STARS)0.416552574962037         2.38256    0.03201  74.442  < 2e-16
## as.factor(STARS)1.25934039859254          2.94276    0.03707  79.374  < 2e-16
## as.factor(STARS)2.10212822222303          3.63105    0.05914  61.397  < 2e-16
##                                             
## (Intercept)                                 
## VolatileAcidity                          ***
## Chlorides                                ** 
## FreeSulfurDioxide                        ***
## TotalSulfurDioxide                       ***
## Density                                  *  
## pH                                          
## Sulphates                                *  
## Alcohol                                  ***
## as.factor(LabelAppeal)-1.11204793733397  ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522   ***
## as.factor(LabelAppeal)2.25461820940981   ***
## as.factor(AcidIndex)-3.59682937695875       
## as.factor(AcidIndex)-1.79176983045029       
## as.factor(AcidIndex)-0.545318540973785      
## as.factor(AcidIndex)0.362910765511677       
## as.factor(AcidIndex)1.05172974217783        
## as.factor(AcidIndex)1.59059728918163        
## as.factor(AcidIndex)2.02271372429848     *  
## as.factor(AcidIndex)2.37629509167962     *  
## as.factor(AcidIndex)2.67051656830802     *  
## as.factor(AcidIndex)2.9188445277671      .  
## as.factor(AcidIndex)3.13100139587667        
## as.factor(AcidIndex)3.31417429494859     .  
## as.factor(AcidIndex)3.47378568897179     *  
## as.factor(STARS)-0.42623524866846        ***
## as.factor(STARS)0.416552574962037        ***
## as.factor(STARS)1.25934039859254         ***
## as.factor(STARS)2.10212822222303         ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.302 on 12765 degrees of freedom
## Multiple R-squared:  0.544,  Adjusted R-squared:  0.543 
## F-statistic: 525.1 on 29 and 12765 DF,  p-value: < 2.2e-16
(lm2_eval <- model_perf_metrics(lm2, trainX, trainY, testX, testY))
##         RMSE     Rsquared          MAE          aic          bic 
## 1.297291e+00 5.466431e-01 1.014944e+00 4.310216e+04 4.333332e+04
lm2VarImp <- variableImportancePlot(lm2, "Linear Model 2 Variable Importance")

Zero-inflated Poisson

ziPois <- pscl::zeroinfl(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
                          Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
                          pH + Sulphates + Alcohol + 
                          as.factor(LabelAppeal) +
                          as.factor(AcidIndex) |  STARS,
                          data=clean_df, 
                          dist = "poisson", 
                          model = TRUE
                        )
summary(ziPois)
## 
## Call:
## pscl::zeroinfl(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) | 
##     STARS, data = clean_df, dist = "poisson", model = TRUE)
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.33788 -0.43775  0.06112  0.41658  2.78301 
## 
## Count model coefficients (poisson with log link):
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               0.5761849  0.3224912   1.787   0.0740
## FixedAcidity                              0.0031044  0.0052893   0.587   0.5573
## VolatileAcidity                          -0.0132550  0.0052469  -2.526   0.0115
## CitricAcid                                0.0007563  0.0051733   0.146   0.8838
## ResidualSugar                            -0.0004410  0.0053152  -0.083   0.9339
## Chlorides                                -0.0079476  0.0053495  -1.486   0.1374
## FreeSulfurDioxide                         0.0023806  0.0052703   0.452   0.6515
## TotalSulfurDioxide                       -0.0019097  0.0053409  -0.358   0.7207
## Density                                  -0.0077893  0.0052460  -1.485   0.1376
## pH                                        0.0026663  0.0053087   0.502   0.6155
## Sulphates                                -0.0022225  0.0054476  -0.408   0.6833
## Alcohol                                   0.0336745  0.0052965   6.358 2.05e-10
## as.factor(LabelAppeal)-1.11204793733397   0.4215144  0.0389940  10.810  < 2e-16
## as.factor(LabelAppeal)0.0101741115806247  0.7405092  0.0378355  19.572  < 2e-16
## as.factor(LabelAppeal)1.13239616049522    0.9780913  0.0382070  25.600  < 2e-16
## as.factor(LabelAppeal)2.25461820940981    1.1707270  0.0427426  27.390  < 2e-16
## as.factor(AcidIndex)-3.59682937695875     0.0163092  0.3260926   0.050   0.9601
## as.factor(AcidIndex)-1.79176983045029     0.0634500  0.3206707   0.198   0.8431
## as.factor(AcidIndex)-0.545318540973785    0.0364686  0.3204104   0.114   0.9094
## as.factor(AcidIndex)0.362910765511677     0.0141717  0.3204609   0.044   0.9647
## as.factor(AcidIndex)1.05172974217783     -0.0326591  0.3208062  -0.102   0.9189
## as.factor(AcidIndex)1.59059728918163     -0.1262213  0.3220907  -0.392   0.6951
## as.factor(AcidIndex)2.02271372429848     -0.2064558  0.3276252  -0.630   0.5286
## as.factor(AcidIndex)2.37629509167962     -0.1459902  0.3357785  -0.435   0.6637
## as.factor(AcidIndex)2.67051656830802      0.0032496  0.3384153   0.010   0.9923
## as.factor(AcidIndex)2.9188445277671      -0.0843479  0.3613793  -0.233   0.8154
## as.factor(AcidIndex)3.13100139587667      0.0405826  0.4187399   0.097   0.9228
## as.factor(AcidIndex)3.31417429494859      0.2528954  0.6182782   0.409   0.6825
## as.factor(AcidIndex)3.47378568897179     -0.1756369  0.6094718  -0.288   0.7732
##                                             
## (Intercept)                              .  
## FixedAcidity                                
## VolatileAcidity                          *  
## CitricAcid                                  
## ResidualSugar                               
## Chlorides                                   
## FreeSulfurDioxide                           
## TotalSulfurDioxide                          
## Density                                     
## pH                                          
## Sulphates                                   
## Alcohol                                  ***
## as.factor(LabelAppeal)-1.11204793733397  ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522   ***
## as.factor(LabelAppeal)2.25461820940981   ***
## as.factor(AcidIndex)-3.59682937695875       
## as.factor(AcidIndex)-1.79176983045029       
## as.factor(AcidIndex)-0.545318540973785      
## as.factor(AcidIndex)0.362910765511677       
## as.factor(AcidIndex)1.05172974217783        
## as.factor(AcidIndex)1.59059728918163        
## as.factor(AcidIndex)2.02271372429848        
## as.factor(AcidIndex)2.37629509167962        
## as.factor(AcidIndex)2.67051656830802        
## as.factor(AcidIndex)2.9188445277671         
## as.factor(AcidIndex)3.13100139587667        
## as.factor(AcidIndex)3.31417429494859        
## as.factor(AcidIndex)3.47378568897179        
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.91313    0.06701  -43.47   <2e-16 ***
## STARS       -2.62665    0.06167  -42.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 36 
## Log-likelihood: -2.098e+04 on 31 Df
(ziPois_eval <- model_perf_metrics(ziPois, trainX, trainY, testX, testY))
##         RMSE     Rsquared          MAE          aic          bic 
## 1.364038e+00 5.006595e-01 1.045664e+00 4.202855e+04 4.225971e+04

Zero-inflated Negative Binomial

ziNB <- pscl::zeroinfl(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar + 
                          Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
                          pH + Sulphates + Alcohol + 
                          as.factor(LabelAppeal) +
                          as.factor(AcidIndex)  |  STARS,
                          data=clean_df, 
                          dist = "negbin", 
                          model = TRUE
                        )
summary(ziNB)
## 
## Call:
## pscl::zeroinfl(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + 
##     ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + 
##     Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) | 
##     STARS, data = clean_df, dist = "negbin", model = TRUE)
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -2.33788 -0.43775  0.06112  0.41658  2.78301 
## 
## Count model coefficients (negbin with log link):
##                                            Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               0.5763137  0.3224698   1.787   0.0739
## FixedAcidity                              0.0031043  0.0052893   0.587   0.5573
## VolatileAcidity                          -0.0132554  0.0052469  -2.526   0.0115
## CitricAcid                                0.0007564  0.0051733   0.146   0.8838
## ResidualSugar                            -0.0004406  0.0053152  -0.083   0.9339
## Chlorides                                -0.0079477  0.0053495  -1.486   0.1374
## FreeSulfurDioxide                         0.0023806  0.0052703   0.452   0.6515
## TotalSulfurDioxide                       -0.0019095  0.0053409  -0.358   0.7207
## Density                                  -0.0077897  0.0052460  -1.485   0.1376
## pH                                        0.0026663  0.0053087   0.502   0.6155
## Sulphates                                -0.0022225  0.0054476  -0.408   0.6833
## Alcohol                                   0.0336745  0.0052965   6.358 2.05e-10
## as.factor(LabelAppeal)-1.11204793733397   0.4215149  0.0389940  10.810  < 2e-16
## as.factor(LabelAppeal)0.0101741115806247  0.7405098  0.0378355  19.572  < 2e-16
## as.factor(LabelAppeal)1.13239616049522    0.9780918  0.0382070  25.600  < 2e-16
## as.factor(LabelAppeal)2.25461820940981    1.1707275  0.0427426  27.390  < 2e-16
## as.factor(AcidIndex)-3.59682937695875     0.0161787  0.3260715   0.050   0.9604
## as.factor(AcidIndex)-1.79176983045029     0.0633208  0.3206492   0.197   0.8435
## as.factor(AcidIndex)-0.545318540973785    0.0363393  0.3203889   0.113   0.9097
## as.factor(AcidIndex)0.362910765511677     0.0140424  0.3204394   0.044   0.9650
## as.factor(AcidIndex)1.05172974217783     -0.0327884  0.3207847  -0.102   0.9186
## as.factor(AcidIndex)1.59059728918163     -0.1263506  0.3220693  -0.392   0.6948
## as.factor(AcidIndex)2.02271372429848     -0.2065840  0.3276042  -0.631   0.5283
## as.factor(AcidIndex)2.37629509167962     -0.1461198  0.3357580  -0.435   0.6634
## as.factor(AcidIndex)2.67051656830802      0.0031210  0.3383949   0.009   0.9926
## as.factor(AcidIndex)2.9188445277671      -0.0844816  0.3613605  -0.234   0.8151
## as.factor(AcidIndex)3.13100139587667      0.0404510  0.4187236   0.097   0.9230
## as.factor(AcidIndex)3.31417429494859      0.2526745  0.6183067   0.409   0.6828
## as.factor(AcidIndex)3.47378568897179     -0.1757520  0.6094565  -0.288   0.7731
## Log(theta)                               17.8553302 13.4640303   1.326   0.1848
##                                             
## (Intercept)                              .  
## FixedAcidity                                
## VolatileAcidity                          *  
## CitricAcid                                  
## ResidualSugar                               
## Chlorides                                   
## FreeSulfurDioxide                           
## TotalSulfurDioxide                          
## Density                                     
## pH                                          
## Sulphates                                   
## Alcohol                                  ***
## as.factor(LabelAppeal)-1.11204793733397  ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522   ***
## as.factor(LabelAppeal)2.25461820940981   ***
## as.factor(AcidIndex)-3.59682937695875       
## as.factor(AcidIndex)-1.79176983045029       
## as.factor(AcidIndex)-0.545318540973785      
## as.factor(AcidIndex)0.362910765511677       
## as.factor(AcidIndex)1.05172974217783        
## as.factor(AcidIndex)1.59059728918163        
## as.factor(AcidIndex)2.02271372429848        
## as.factor(AcidIndex)2.37629509167962        
## as.factor(AcidIndex)2.67051656830802        
## as.factor(AcidIndex)2.9188445277671         
## as.factor(AcidIndex)3.13100139587667        
## as.factor(AcidIndex)3.31417429494859        
## as.factor(AcidIndex)3.47378568897179        
## Log(theta)                                  
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.91313    0.06701  -43.47   <2e-16 ***
## STARS       -2.62665    0.06167  -42.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 56816092.9572 
## Number of iterations in BFGS optimization: 37 
## Log-likelihood: -2.098e+04 on 32 Df
(ziNB_eval <- model_perf_metrics(ziNB, trainX, trainY, testX, testY))
##         RMSE     Rsquared          MAE          aic          bic 
## 1.364038e+00 5.006596e-01 1.045664e+00 4.203055e+04 4.226917e+04

Model Selection

Any of the linear models (full or reduced) appears to be winning model of choice.

models_summary <- rbind(pois1_eval, pois2_eval, nb1_eval, nb2_eval, lm1_eval, lm2_eval, ziPois_eval, ziNB_eval)
kable(models_summary) %>% 
  kable_paper(full_width = F) %>%
  column_spec(1, bold = T, border_right = T) %>%
  row_spec(5:6, bold = T, color = "white", background = "purple")
RMSE Rsquared MAE aic bic
pois1_eval 2.593116 0.5059560 2.228388 36435.92 36674.63
pois2_eval 2.591346 0.5210688 2.226889 45532.19 45740.98
nb1_eval 2.591223 0.5215761 2.226684 45540.13 45793.66
nb2_eval 2.591346 0.5210685 2.226888 45534.61 45750.86
lm1_eval 1.297132 0.5467544 1.014874 43106.30 43359.83
lm2_eval 1.297291 0.5466431 1.014944 43102.16 43333.32
ziPois_eval 1.364038 0.5006595 1.045664 42028.55 42259.71
ziNB_eval 1.364038 0.5006596 1.045664 42030.55 42269.17

Variable Importance

From the Variable Importance point of view, the top 4 most important features across the board is always consistently a derivative of a factor of STARS.

grid.arrange(poi1VarImp, poi2VarImp, nb1VarImp, nb2VarImp, lm1VarImp, lm2VarImp, ncol = 2)