OVERVIEW

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build linear & count regression models to predict the number of cases of wine that will be sold given certain properties of the wine.

DATA EXPLORATION

# import data
train_data <- 
  read.csv("https://raw.githubusercontent.com/kylegilde/D621-Data-Mining/master/HW5%20Wine%20Predication/wine-training-data.csv") %>% 
  mutate(STARS = ifelse(is.na(STARS), 0, STARS)) %>% 
  dplyr::select(-1)
  

eval_data <- 
  read.csv("https://raw.githubusercontent.com/kylegilde/D621-Data-Mining/master/HW5%20Wine%20Predication/wine-evaluation-data.csv") %>% 
  mutate(STARS = ifelse(is.na(STARS), 0, STARS)) %>% 
  dplyr::select(-1)

Examine the cases & variables

After removing the INDEX column, the data set contains 15 numerical variables and 12,795 observations. Given that the NAs in the STARS variable are meaningful, we have changed those instances to zero to represent a very poor rating.

str(train_data)

## 'data.frame':    12795 obs. of  15 variables:
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : num  2 3 3 1 2 0 0 3 0 4 ...

Data Dictionary

From the variable descriptions, we would expect that higher LabelAppeal and STARS values correspond with greater numbers of cases purchased. Simply by the variable names, we suspect that several of the predictor variables will be correlated with each other including:

AcidIndex, CitricAcid, FixedAcidity & VolatileAcidity
FreeSulfurDioxide & TotalSulfurDioxide
FreeSulfurDioxide, Sulphates & TotalSulfurDioxide

Statistical Summary

In the statistical summary table below, we notice that we may have missing values in about half the variables. Along with AcidIndex, LabelAppeal & STARS, the response variable TARGET is discrete, which makes this data set a good candidate for count regression. Up to 9 of the variables have unexpectedly negative values. It seems unlikely that these are valid measurements of the variable, and we will address them in with our variable transformations. Only one variable AcidIndex has more kurtosis than a normal distribution.

summary_metrics <- function(df){
  ###Creates summary metrics table
  metrics_only <- df[, sapply(df, is.numeric)]
   
  df_metrics <- psych::describe(metrics_only, quant = c(.25,.75))
  df_metrics$unique_values = rapply(metrics_only, function(x) length(unique(x)))
  df_metrics <- 
    dplyr::select(df_metrics, n, unique_values, min, Q.1st = Q0.25, median, mean, Q.3rd = Q0.75, 
    max, range, sd, skew, kurtosis
  )
  return(df_metrics)
}

metrics_df <- summary_metrics(train_data)

#datatable(round(metrics_df, 2), options = list(searching = F, paging = F))

kable(metrics_df, digits = 1, format.args = list(big.mark = ',', scientific = F, drop0trailing = T))

	n	unique_values	min	Q.1st	median	mean	Q.3rd	max	range	sd	skew	kurtosis
TARGET	12,795	9	0	2	3	3	4	8	8	1.9	-0.3	-0.9
FixedAcidity	12,795	470	-18.1	5.2	6.9	7.1	9.5	34.4	52.5	6.3	0	1.7
VolatileAcidity	12,795	815	-2.8	0.1	0.3	0.3	0.6	3.7	6.5	0.8	0	1.8
CitricAcid	12,795	602	-3.2	0	0.3	0.3	0.6	3.9	7.1	0.9	-0.1	1.8
ResidualSugar	12,179	2,078	-127.8	-2	3.9	5.4	15.9	141.2	268.9	33.7	-0.1	1.9
Chlorides	12,157	1,664	-1.2	0	0	0.1	0.2	1.4	2.5	0.3	0	1.8
FreeSulfurDioxide	12,148	1,000	-555	0	30	30.8	70	623	1,178	148.7	0	1.8
TotalSulfurDioxide	12,113	1,371	-823	27	123	120.7	208	1,057	1,880	231.9	0	1.7
Density	12,795	5,933	0.9	1	1	1	1	1.1	0.2	0	0	1.9
pH	12,400	498	0.5	3	3.2	3.2	3.5	6.1	5.7	0.7	0	1.6
Sulphates	11,585	631	-3.1	0.3	0.5	0.5	0.9	4.2	7.4	0.9	0	1.8
Alcohol	12,142	402	-4.7	9	10.4	10.5	12.4	26.5	31.2	3.7	0	1.5
LabelAppeal	12,795	5	-2	-1	0	0	1	2	4	0.9	0	-0.3
AcidIndex	12,795	14	4	7	8	7.8	8	17	13	1.3	1.6	5.2
STARS	12,795	5	0	0	1	1.5	2	4	4	1.2	0.3	-0.9

Visualizations

Discrete Variable Frequencies

In the barplots below, we notice that more than 20% of the TARGET values are zero, which indicates that the data may be a good candidate for a zero-inflated Poisson regression model.

###Discrete variables Frequencies
discrete_vars_freq <- 
  train_data %>% 
  dplyr::select(rownames(metrics_df)[metrics_df$unique_values < 15]) %>% 
  gather("var", "value") %>% 
  group_by(var) %>% 
  count(var, value) %>%
  mutate(prop = prop.table(n))

ggplot(data = discrete_vars_freq, 
       aes(x = value, #reorder(value, prop)
       y = prop)) + 
  geom_bar(stat = "identity", fill = "darkgreen") + 
  facet_wrap(~var, scales = "free") +
  coord_flip() + 
  ggthemes::theme_fivethirtyeight()

# https://stackoverflow.com/questions/34860535/how-to-use-dplyr-to-generate-a-frequency-table?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

Discrete Variable Boxplots

The box plots contain the TARGET distributions for each of the discrete variable values. As we expected, higher values of LabelAppeal and STARS are associated with more wine being purchased. Additionally, smaller AcidIndex values appear to be associated with more wine purchases.

####Side-by-Side Boxplots
boxplot_data <- 
  train_data %>% 
  dplyr::select(rownames(metrics_df)[metrics_df$unique_values < 15]) %>% 
  reshape2::melt(id.vars = "TARGET")


### Side-by-Side Boxplots
ggplot(data = boxplot_data, aes(x = factor(value), y = TARGET)) +
  geom_boxplot() +
  facet_wrap( ~ variable, scales = "free") +
  coord_flip() +
  ggthemes::theme_fivethirtyeight()

#Reference: https://stackoverflow.com/questions/14604439/plot-multiple-boxplot-in-one-graph?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

Histograms of Continuous Variables

The distributions of the continuous predictor variables all share a strong resemblance. They are platykurtic, meaning they have smaller tails and greater peaks than the normal distribution.

Correlations

Very few of the predictors are correlated with the response variable. As we would expect from the previous boxplots, STARS and LabelAppeal have moderate positive correlations with TARGET, and AcidIndex has a slight negative correlation. Counter to our expectations, very few of the predictor variables are correlated with each other, suggesting that our models will not have much multicollinearity. STARS is only slightly correlated with LabelAppeal and AcidIndex.

In the table below, we see the correlations between the response variable and several transformations and interactions. We created log-, square- and square-root-transformed versions of both the predictors and the response and created all the possible interactions between them. Then we calculated the correlation coefficients between the different response transformations and the transformed predictors and their interactions. None of these combinations are more correlated than the original variables themselves.

### CORRELATIONS WITH RESPONSE
pred_vars <- dplyr::select(train_data, -TARGET)

#squared variables
squared_vars <-
  apply(pred_vars, 2, function(x) x^2) %>%
  as.data.frame()
colnames(squared_vars) <- paste0(names(squared_vars), "2")

#squart root variables
sqrt_vars <-
  apply(pred_vars, 2, function(x) x^2) %>%
  as.data.frame()
colnames(sqrt_vars) <- paste0(names(sqrt_vars), "_sqrt")

#log variables
log_vars <-
  apply(pred_vars, 2, function(x) log(x + .01)) %>%
  as.data.frame()
colnames(log_vars) <- paste0(names(log_vars), "_log")


individual_vars <- cbind(squared_vars, sqrt_vars, log_vars, pred_vars)

#interaction variables
all_interactions <- data.frame(t(apply(individual_vars, 1, combn, 2, prod)))
colnames(all_interactions) <- combn(names(individual_vars), 2, paste, collapse=":")


all_predictors <- cbind(pred_vars, all_interactions)

# response variable transformations
dep_vars <- 
  train_data %>% 
  transmute(
    TARGET = TARGET,
    TARGET2 = TARGET^2,
    TARGET_sqrt = sqrt(TARGET),
    TARGET_log = log(TARGET + .01)
  )

# create correlation df
all_corr <-  
  cor(dep_vars, all_predictors, use = "pairwise.complete.obs") %>% 
  correlation_df() %>%
  filter((Var2 %like% "%TARGET%" | Var1  %like%  "%TARGET%")
         & !(Var2 %like% "%TARGET%" & Var1  %like%  "%TARGET%"))

##https://stackoverflow.com/questions/2080774/generating-interaction-variables-in-r-dataframes?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

rownames(all_corr) <- 1:nrow(all_corr)

kable(head(filter(all_corr, Var1 == "TARGET"), 20), digits = 3, row.names = T, caption = "Top Corrrelations with the Response Variable")

Top Corrrelations with the Response Variable
	Var1	Var2	Correlation	Rsquared
1	TARGET	STARS	0.685	0.470
2	TARGET	Density:STARS	0.684	0.468
3	TARGET	Density2:STARS	0.682	0.465
4	TARGET	Density_sqrt:STARS	0.682	0.465
5	TARGET	AcidIndex_log:STARS	0.677	0.458
6	TARGET	Alcohol_log:STARS	0.665	0.442
7	TARGET	TotalSulfurDioxide_log:STARS	0.663	0.439
8	TARGET	AcidIndex:STARS	0.657	0.431
9	TARGET	pH_log:STARS	0.652	0.425
10	TARGET	pH:STARS	0.650	0.423
11	TARGET	FreeSulfurDioxide_log:STARS	0.636	0.404
12	TARGET	AcidIndex_log:STARS_log	0.634	0.402
13	TARGET	STARS_log:AcidIndex	0.633	0.401
14	TARGET	STARS_log:Density	0.629	0.396
15	TARGET	Density2:STARS_log	0.629	0.395
16	TARGET	Density_sqrt:STARS_log	0.629	0.395
17	TARGET	Alcohol_log:STARS_log	0.620	0.384
18	TARGET	pH_log:STARS_log	0.616	0.380
19	TARGET	STARS_log:pH	0.616	0.379
20	TARGET	Alcohol:STARS	0.614	0.377

Missing Values

In the plots and table below, we can see that 7 variables are missing values and that only 68% of the observations are complete. Since most of the predictors are not correlated with each other, it may be difficult to accurately impute the missing values.

## Missing Values
options(scipen = 999)
missing_plot <- VIM::aggr(train_data,  
                      numbers = T, 
                      sortVars = T,
                      col = c("lightgreen", "darkred", "orange"),
                      labels=str_sub(names(train_data), 1, 8), 
                      ylab=c("Missing Value Counts"
                             , "Pattern"))

summary(missing_plot)

missing_plot$missings %>% 
  mutate(
    pct_missing = Count / nrow(train_data)
    ) %>% 
  arrange(-pct_missing) %>% 
  filter(pct_missing > 0) %>% 
  kable(digits = 3, row.names = T, caption = "Variables Missing Values")

Variables Missing Values
	Variable	Count	pct_missing
1	Sulphates	1210	0.095
2	TotalSulfurDioxide	682	0.053
3	Alcohol	653	0.051
4	FreeSulfurDioxide	647	0.051
5	Chlorides	638	0.050
6	ResidualSugar	616	0.048
7	pH	395	0.031

#options(scipen=0, digits=7)

DATA PREPARATION

Variable Transformations

While our “Top Corrrelations with the Response Variable” table did not indicate that we should transform our variables, our summary statistics did show that the following 9 varibles have impossibly negative values, and the table below shows that 8 variables contain more than 10% negative values. This many invalid variable values certainly raises concerns about the study’s quality of the data collection and measurement.

vars_neg_values <- 
  dplyr::select(train_data, 
              intersect(rownames(metrics_df)[metrics_df$unique_values > 15],
              rownames(metrics_df)[metrics_df$min < 0])
              )

neg_proportions <- t(apply(vars_neg_values, 2, function(x) prop.table(table(x < 0))))

data.frame(
  Var = rownames(neg_proportions),
  is_negative = neg_proportions[, 2]
) %>% arrange(-is_negative) %>% 
  kable(digits = 2)

Var	is_negative
Chlorides	0.26
ResidualSugar	0.26
FreeSulfurDioxide	0.25
CitricAcid	0.23
VolatileAcidity	0.22
TotalSulfurDioxide	0.21
Sulphates	0.20
FixedAcidity	0.13
Alcohol	0.01

In the side-by-side boxplots below, we see that if we were to take the absolute value of the negative numbers, the distributions of the transformed negative values are mostly similar to the positive value distributions. The 2 variables with dissimilar distributions FixedAcidity and Alcohol have the fewest negative values. Consequently, we will take the absolute values of these variables.

vars_neg_values_melted <- 
  vars_neg_values %>% 
  reshape::melt() %>% 
  na.omit() %>% 
  mutate(is_negative = as.factor(value < 0),  #relevel(, "1")
         abs_value = abs(value))

ggplot(data = vars_neg_values_melted, aes(x = variable, y = abs_value)) + 
  geom_boxplot(aes(fill = is_negative)) + 
  facet_wrap( ~ variable, scales = "free")

train_transformed <- 
  train_data %>% 
  mutate(
    FixedAcidity = abs(FixedAcidity), 
    VolatileAcidity = abs(VolatileAcidity), 
    CitricAcid = abs(CitricAcid),
    ResidualSugar = abs(ResidualSugar),
    Chlorides = abs(Chlorides),
    FreeSulfurDioxide = abs(FreeSulfurDioxide),
    TotalSulfurDioxide = abs(TotalSulfurDioxide),
    Sulphates = abs(Sulphates),
    Alcohol = abs(Alcohol))


eval_transformed <- 
  eval_data %>% 
  mutate(
    FixedAcidity = abs(FixedAcidity), 
    VolatileAcidity = abs(VolatileAcidity), 
    CitricAcid = abs(CitricAcid),
    ResidualSugar = abs(ResidualSugar),
    Chlorides = abs(Chlorides),
    FreeSulfurDioxide = abs(FreeSulfurDioxide),
    TotalSulfurDioxide = abs(TotalSulfurDioxide),
    Sulphates = abs(Sulphates),
    Alcohol = abs(Alcohol))

Imputing the Missing Values

Let’s use the missForest package to do nonparametric missing-value imputation using Random Forest on both the training and evaluation sets.

When we use the out-of-the-bag error in order to calculate the normalized root-mean-squares error, we see NRMSE values closer to zero than not, which indicates that we have well-fitted imputations. However, the NRMSE values are higher than previous imputations using this package, which may be the result of there being not much correlation between the predictor variables.

Variable	Count	MSE	NRMSE
ResidualSugar	616	648.15	0.18
FreeSulfurDioxide	647	12165.65	0.18
Chlorides	638	0.06	0.18
Sulphates	1210	0.45	0.16
TotalSulfurDioxide	682	27500.94	0.16
Alcohol	653	13.14	0.14
pH	395	0.48	0.12

BUILD MODELS

Using the training data set, build at least two different poisson regression models, at least two different negative binomial regression models, and at least two multiple linear regression models, using different variables

Linear Models

Backward Elimination

For our first model, let’s use all variables in our imputed data set with a backward elimination process that removes the predictor with the highest p-value until all of the remaining p-values are statistically significant at a .05 level. As we suspected, the STARS variable, followed by LabelAppeal, has the most practical significance in the model. With the other variables held constant, for every 1 point increase in the expert wine rating, we would expect an increase of nearly 1 wine case (.98) to be purchased.

train_imputed <- imputed_train$ximp

backward_elimination <- function(lmod){
  #performs backward elimination model selection 
  #removes variables until all remaining ones are stat-sig
  removed_vars <- c()
  removed_pvalues <- c()
  
  #handles category dummy variables
  cat_levels <- unlist(lmod$xlevels)
  cat_vars <- str_sub(names(cat_levels), 1, nchar(names(cat_levels)) - 1)
  cat_var_df <- data.frame(cat_vars, 
                           dummy_vars = str_c(cat_vars, cat_levels),
                           stringsAsFactors = F)
  # checks for p-values > .05 execpt for the intercept
  while (max(summary(lmod)$coefficients[2:length(summary(lmod)$coefficients[, 4]), 4]) > .05){  
    # find insignificant pvalue
    pvalues <- summary(lmod)$coefficients[2:length(summary(lmod)$coefficients[, 4]), 4]
    max_pvalue <- max(pvalues)
    remove <- names(which.max(pvalues))
    #if categorical dummy variable, remove the variable
    dummy_var <- dplyr::filter(cat_var_df, dummy_vars == remove)
    remove <- ifelse(nrow(dummy_var) > 0, dummy_var[, 1], remove)
    #record the removed variables
    removed_vars <- c(removed_vars, remove)
    removed_pvalues <- c(removed_pvalues, max_pvalue)   
    # update model
    lmod <- update(lmod, as.formula(paste0(".~.-`", remove, "`"))) 
  }
  
  print(kable(data.frame(removed_vars, removed_pvalues), digits = 3))
  return(lmod)
}


all_vars_lmod <- lm(TARGET ~ ., x = T, y = T, data = train_imputed)
                    
bkwd_elim_lmod <- backward_elimination(all_vars_lmod)

## 
## 
## removed_vars         removed_pvalues
## ------------------  ----------------
## FixedAcidity                   0.997
## ResidualSugar                  0.849
## FreeSulfurDioxide              0.080
## CitricAcid                     0.073
## Density                        0.056

options(scipen = 9)
summary(bkwd_elim_lmod)

## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + 
##     pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS, 
##     data = train_imputed, x = T, y = T)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5164 -0.9623  0.0611  0.9084  5.9849 
## 
## Coefficients:
##                       Estimate  Std. Error t value     Pr(>|t|)    
## (Intercept)         3.25688909  0.10707585  30.417      < 2e-16 ***
## VolatileAcidity    -0.11520183  0.02115523  -5.446 0.0000000526 ***
## Chlorides          -0.10236737  0.05135908  -1.993     0.046264 *  
## TotalSulfurDioxide  0.00027447  0.00007398   3.710     0.000208 ***
## pH                 -0.03491883  0.01755995  -1.989     0.046772 *  
## Sulphates          -0.04533164  0.01878802  -2.413     0.015845 *  
## Alcohol             0.01216317  0.00332349   3.660     0.000253 ***
## LabelAppeal         0.43224023  0.01368762  31.579      < 2e-16 ***
## AcidIndex          -0.21058093  0.00905249 -23.262      < 2e-16 ***
## STARS               0.98017371  0.01045697  93.734      < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.326 on 12785 degrees of freedom
## Multiple R-squared:  0.5263, Adjusted R-squared:  0.5259 
## F-statistic:  1578 on 9 and 12785 DF,  p-value: < 2.2e-16

In the model’s diagnostic plots, while the Residuals-vs-Fitted plot displays what appears to be constant variance given the discrete response variable, the fitted line in the standardized-residual plot reveals some nonconstant variance. The Normal Q-Q plot shows that the standardized residuals are close to normality. The Leverage plot shows that we do not have any influential points.

par(mfrow=c(2,2))
plot(bkwd_elim_lmod)

#http://data.library.virginia.edu/diagnostic-plots/
#http://analyticspro.org/2016/03/07/r-tutorial-how-to-use-diagnostic-plots-for-regression-models/

With a p-value near zero, this 9-variable model is statistically significant when compared to the null hypothesis. The moderate p-value for the Durbin-Watson test of independence indicates that we fail to reject the null hypothesis of no autocorrelation. However, the small p-values for the noncontant variance test from the car package and the Anderson-Darling test indicate that we would reject the null hypotheses of homoscedasticity and normality. None of the variables are exhibiting collinearity with a variance inflation factor greater than 4 (VIF_gt_4).

PRESS <- function(linear.model) {
  #source:  https://gist.github.com/tomhopper/8c204d978c4a0cbcb8c0#file-press-r
  #' calculate the predictive residuals
  pr <- residuals(linear.model)/(1 - lm.influence(linear.model)$hat)
  #' calculate the PRESS
  PRESS <- sum(pr^2)
  return(PRESS)
}
pred_r_squared <- function(linear.model) {
  #source: https://gist.github.com/tomhopper/8c204d978c4a0cbcb8c0#file-pred_r_squared-r
  #' Use anova() to get the sum of squares for the linear model
  lm.anova <- anova(linear.model)
  #' Calculate the total sum of squares
  tss <- sum(lm.anova$'Sum Sq')
  # Calculate the predictive R^2
  pred.r.squared <- 1 - PRESS(linear.model)/(tss)
  return(pred.r.squared)
}

lm_evaluation <- function(lmod) {
    ### Summarizes the model's key statistics in one row
    ### https://gist.github.com/stephenturner/722049#file-pvalue-from-lm-object-r-L5
    lm_summary <- summary(lmod)
    f <- as.numeric(lm_summary$fstatistic)

    df_summary <- 
      data.frame(
        model_name = deparse(substitute(lmod)), 
        n_vars = ncol(lmod$model) - 1,
        numdf = f[2],
        fstat = f[1],
        p.value = formatC(pf(f[1], f[2], f[3], lower.tail = F), format = "e", digits = 2),
        adj.r.squared = lm_summary$adj.r.squared,
        pre.r.squared = pred_r_squared(lmod),
        CV_RMSE = lmvar::cv.lm(lmod, k = 100)$MSE_sqrt$mean
        )
    return(df_summary)
}

lm_diagnotics <- function(lmod){
  diag_df <- data.frame(
    DW.test = car::durbinWatsonTest(lmod)$p,
    NCV.test = formatC(car::ncvTest(lmod)$p, format = "e", digits = 2),
    AD.test = formatC(nortest::ad.test(lmod$residuals)$p.value, format = "e", digits = 2),
    VIF_gt_4 = sum(car::vif(lmod) > 4)
  )
  return(diag_df)
}

#evaluate performance & diagnostics
kable(lm_results <- lm_evaluation(bkwd_elim_lmod), digits = 3, caption = "Model Summary Statistics")

Model Summary Statistics
model_name	n_vars	numdf	fstat	p.value	adj.r.squared	pre.r.squared	CV_RMSE
bkwd_elim_lmod	9	9	1578.156	0.00e+00	0.526	0.526	1.325

kable(lm_results_diagnostics <- lm_diagnotics(bkwd_elim_lmod), digits = 3, caption = "Model Diagnostic Statistics")

Model Diagnostic Statistics
DW.test	NCV.test	AD.test	VIF_gt_4
0.77	1.93e-132	2.07e-21	0

BIC Selection

Now let’s use the original variables of the imputed data set with a BIC selection. Due to its high predictor penalty, this process produced a more parsimonious model with 6 statistically significant variables. It removed 3 variables that were included in the backward elimination model.

n <- nrow(all_vars_lmod$model)

BIC_lmod <- step(all_vars_lmod, k = log(n), trace = 0)

removed_variables <- function(larger_mod, smaller_mod){
  #compares variables of 2 models
  #returns the variables not in the small model
  removed <- names(coef(larger_mod))[!names(coef(larger_mod)) %in%
names(coef(smaller_mod))]
    print(paste("removed variable(s):", length(removed)))
    print(removed)
}

removed_variables(bkwd_elim_lmod, BIC_lmod)

## [1] "removed variable(s): 3"
## [1] "Chlorides" "pH"        "Sulphates"

summary(BIC_lmod)

## 
## Call:
## lm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, data = train_imputed, 
##     x = T, y = T)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5919 -0.9626  0.0635  0.9111  6.0067 
## 
## Coefficients:
##                      Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)         3.0822844  0.0865818  35.600      < 2e-16 ***
## VolatileAcidity    -0.1165419  0.0211602  -5.508 0.0000000371 ***
## TotalSulfurDioxide  0.0002765  0.0000740   3.737     0.000187 ***
## Alcohol             0.0122037  0.0033248   3.671     0.000243 ***
## LabelAppeal         0.4319875  0.0136923  31.550      < 2e-16 ***
## AcidIndex          -0.2106406  0.0090302 -23.326      < 2e-16 ***
## STARS               0.9813136  0.0104567  93.846      < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.327 on 12788 degrees of freedom
## Multiple R-squared:  0.5258, Adjusted R-squared:  0.5255 
## F-statistic:  2363 on 6 and 12788 DF,  p-value: < 2.2e-16

The BIC model’s diagnostic plots closely resemble the plots from the backward elimination mode. While the Residuals-vs-Fitted plot displays what appears to be constant variance given the discrete response variable, the standardized-residual plot reveals some nonconstant variance. The Normal Q-Q plot shows that the standardized residuals are close to normality. The leverage plot shows that we do not have any influential points.

par(mfrow=c(2,2))
plot(BIC_lmod)

With only 6 variables, this more parsimonious model’s predicted R-squared is nearly as good as the first 9-variable model. The model passes Durbin-Watson test of independence, but fails the nonconstant variance test and the Anderson-Darling test of normality.

#evaluate performance & diagnostics
model_eval <- lm_evaluation(BIC_lmod)
model_diag <- lm_diagnotics(BIC_lmod) 
kable(lm_results <- rbind(lm_results, model_eval), digits = 3, caption = "Model Summary Statistics")

Model Summary Statistics
model_name	n_vars	numdf	fstat	p.value	adj.r.squared	pre.r.squared	CV_RMSE
bkwd_elim_lmod	9	9	1578.156	0.00e+00	0.526	0.526	1.325
BIC_lmod	6	6	2362.791	0.00e+00	0.526	0.525	1.325

kable(lm_results_diagnostics <- rbind(lm_results_diagnostics, model_diag), digits = 3, caption = "Model Diagnostic Statistics")

Model Diagnostic Statistics
DW.test	NCV.test	AD.test	VIF_gt_4
0.770	1.93e-132	2.07e-21	0
0.828	5.97e-134	1.30e-21	0

Poission Regression

Regular Poisson Model with BIC Selection

Since the response variable is a count, this data set is a good candidate for generalized linear regression with the Poisson distribution. Let’s use Poisson regression with the BIC variable selection process. This process removed 9 variables creating the most parsimonious model so far with 5 statistically significant features.

## Poission Regression
### Regular Poisson Model with BIC Selection
pois_mod <- glm(TARGET ~ ., family = "poisson", data = train_imputed)

BIC_pois_mod <- step(pois_mod, k = log(n), trace = 0)

removed_variables(pois_mod, BIC_pois_mod)

## [1] "removed variable(s): 9"
## [1] "FixedAcidity"      "CitricAcid"        "ResidualSugar"    
## [4] "Chlorides"         "FreeSulfurDioxide" "Density"          
## [7] "pH"                "Sulphates"         "Alcohol"

summary(BIC_pois_mod)

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
##     LabelAppeal + AcidIndex + STARS, family = "poisson", data = train_imputed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0072  -0.7180   0.0617   0.5745   3.2583  
## 
## Coefficients:
##                      Estimate Std. Error z value   Pr(>|z|)    
## (Intercept)         1.2249334  0.0377551  32.444    < 2e-16 ***
## VolatileAcidity    -0.0418041  0.0093917  -4.451 0.00000854 ***
## TotalSulfurDioxide  0.0001019  0.0000319   3.194     0.0014 ** 
## LabelAppeal         0.1329514  0.0060616  21.933    < 2e-16 ***
## AcidIndex          -0.0881767  0.0044690 -19.731    < 2e-16 ***
## STARS               0.3132407  0.0045127  69.413    < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 14773  on 12789  degrees of freedom
## AIC: 46727
## 
## Number of Fisher Scoring iterations: 5

Let’s exponentiate the model’s coefficients in order to make them interpretable in terms of wine cases. With the other variables held constant, for every 1 point increase in the expert wine rating STARS, we would expect on average an increase of 1.37 wine cases to be purchased.

exp(cbind(coef(BIC_pois_mod), confint(BIC_pois_mod)))

##                                  2.5 %    97.5 %
## (Intercept)        3.4039393 3.1613621 3.6656303
## VolatileAcidity    0.9590577 0.9415158 0.9768239
## TotalSulfurDioxide 1.0001019 1.0000392 1.0001642
## LabelAppeal        1.1421945 1.1287048 1.1558451
## AcidIndex          0.9155991 0.9075962 0.9236356
## STARS              1.3678507 1.3558108 1.3800078

#https://stats.stackexchange.com/questions/11096/how-to-interpret-coefficients-in-a-poisson-regression?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

With a p-value near zero, this 5-variable model is statistically significant when compared to the null hypothesis, but it only explains 35% of the deviance (ELMR p. 87). The p-value for the goodness-of-fit chi-squared test is near zero, which indicates that the model’s deviance is not small enough for a good fit. At .85, the model has underdispersion, which indicates that the data exhibit less variation than the Poisson distribution expects. None of the variables are exhibiting collinearity with a variance inflation factor greater than 4 (VIF_gt_4).

glm_performance <- function(model) {
  ### Summarizes the model's key statistics
  df_summary <- data.frame(

    model_name = deparse(substitute(model)),
    n_vars = length(coef(model)) - 1,

    # https://stats.stackexchange.com/questions/141177/test-glm-model-using-null-and-model-deviances?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
    pvalue = formatC(with(model, pchisq(null.deviance - deviance, df.null - df.residual, lower = F)), format = "e", digits = 2),
    
    # page 87 of ELMR
    deviance_explained = with(model, 1 - deviance/null.deviance),
    
    # https://stats.idre.ucla.edu/r/dae/poisson-regression/
    GoFtest = formatC(with(model, pchisq(deviance, df.residual, lower.tail=FALSE)), format = "e", digits = 2),
    
    # page 88 of ELMR
    dispersion_parameter = sum(residuals(model,type="pearson")^2)/model$df.res,
    VIF_gt_4 = sum(car::vif(model) > 4),
    
    # https://www.rdocumentation.org/packages/boot/versions/1.3-20/topics/cv.glm
    CV_RMSE = sqrt(boot::cv.glm(model$model, model, K = 100)$delta[1])
  )
  return(df_summary)
}
#https://stackoverflow.com/questions/38272150/compute-cross-validation-for-glms-with-negative-binomial-response

#https://stackoverflow.com/questions/20744224/k-fold-cross-validation-using-cv-lm?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa



glmod <- glm_performance(BIC_pois_mod)
kable(all_glmods <- glmod, digits = 3)

model_name	n_vars	pvalue	deviance_explained	GoFtest	dispersion_parameter	VIF_gt_4	CV_RMSE
BIC_pois_mod	5	0.00e+00	0.354	1.51e-32	0.854	0	1.406

Quasi-Poisson Model with BIC Selection

Since the previous Poisson model was underdispersed, let’s try applying the quasi-Poisson generalized linear model to those 5 variables.

### Quasi-Poisson Model with BIC Selection
quasi_pois_mod <- glm(BIC_pois_mod$formula, family = "quasipoisson", data = train_imputed)
options(scipen = 9)
summary(quasi_pois_mod)

## 
## Call:
## glm(formula = BIC_pois_mod$formula, family = "quasipoisson", 
##     data = train_imputed)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0072  -0.7180   0.0617   0.5745   3.2583  
## 
## Coefficients:
##                       Estimate  Std. Error t value   Pr(>|t|)    
## (Intercept)         1.22493338  0.03488467  35.114    < 2e-16 ***
## VolatileAcidity    -0.04180408  0.00867771  -4.817 0.00000147 ***
## TotalSulfurDioxide  0.00010188  0.00002947   3.457   0.000548 ***
## LabelAppeal         0.13295139  0.00560073  23.738    < 2e-16 ***
## AcidIndex          -0.08817668  0.00412921 -21.354    < 2e-16 ***
## STARS               0.31324066  0.00416959  75.125    < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 0.853724)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 14773  on 12789  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

The use of the underdispersed quasi-Poisson model has no effect on the model’s p-value, deviance explained or goodness-of-fit chi-squared test p-value.

glmod <- glm_performance(quasi_pois_mod)
kable(all_glmods <- rbind(all_glmods, glmod), digits = 3)

model_name	n_vars	pvalue	deviance_explained	GoFtest	dispersion_parameter	VIF_gt_4	CV_RMSE
BIC_pois_mod	5	0.00e+00	0.354	1.51e-32	0.854	0	1.406
quasi_pois_mod	5	0.00e+00	0.354	1.51e-32	0.854	0	1.406

In fact, the Poisson and quasi-Poisson model coefficients are exactly the same.

#the Poisson and quasi-Poisson model coefficients are exactly the same.

options(knitr.kable.NA = '')
#https://stats.stackexchange.com/questions/11096/how-to-interpret-coefficients-in-a-poisson-regression?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
compare_coefficients <- data.frame(var = rev(names(coef(all_vars_lmod))))
pois_models <- c("quasi_pois_mod", "BIC_pois_mod")

for (i in 1:length(pois_models)){
  model <- get(pois_models[i])
  model_name <- pois_models[i]
  is_lm_obj <- rep(class(model)[1] == "lm", length(coef(model)))
  df <- data.frame(var = as.character(names(coef(model))))
  df[, model_name] <- ifelse(is_lm_obj, coef(model), exp(coef(model)))
  compare_coefficients <- left_join(compare_coefficients, df)
}
ind <- apply(compare_coefficients[ , 2:3], 1, function(x) all(is.na(x)))
#https://stackoverflow.com/questions/6471689/remove-rows-in-r-matrix-where-all-data-is-na?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

compare_coefficients$is_equal <- with(compare_coefficients, quasi_pois_mod == BIC_pois_mod)
kable(compare_coefficients[!ind, ], digits = 5)

	var	quasi_pois_mod	BIC_pois_mod	is_equal
1	STARS	1.36785	1.36785	TRUE
2	AcidIndex	0.91560	0.91560	TRUE
3	LabelAppeal	1.14219	1.14219	TRUE
8	TotalSulfurDioxide	1.00010	1.00010	TRUE
13	VolatileAcidity	0.95906	0.95906	TRUE
15	(Intercept)	3.40394	3.40394	TRUE

# zi_pois_all_vars <- pscl::zeroinfl(TARGET ~ ., data = train_imputed)
# 
# if (!exists("zi_pois_pois_mod")){
#   zi_pois_pois_mod <- step(zi_pois_all_vars, k = log(n), trace = 0)
# }
# summary(zi_pois_pois_mod)
# glm_performance(zi_pois_pois_mod)
# data.frame(coef(zi_pois_pois_mod))
# exp(coef(zi_pois_pois_mod))

Negative Binomial Regression

BIC selection with Dispersion Parameter of 1

Now let’s try negative binomial regression, which can arise out of a generalized Poisson regression. We will start with a disperson parameter of 1, which corresponds to the geometric distribution and use imputed original variables with a BIC selection process. The result of this process is a 7-feature model with the addition of the Sulphates and pH variables.

## Negative Binomial Regression
### BIC selection with Dispersion Parameter = 1
nb1_mod_all_vars <- glm(TARGET ~ ., family = negative.binomial(1), data = train_imputed)


BIC_nb_k1_mod <- step(nb1_mod_all_vars, k = log(n), trace = 0)
summary(BIC_nb_k1_mod)

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
##     pH + Sulphates + LabelAppeal + AcidIndex + STARS, family = negative.binomial(1), 
##     data = train_imputed)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.82923  -0.39345   0.00025   0.29918   1.75915  
## 
## Coefficients:
##                      Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)         1.4709768  0.0511707  28.746     < 2e-16 ***
## VolatileAcidity    -0.0553022  0.0105887  -5.223 0.000000179 ***
## TotalSulfurDioxide  0.0001542  0.0000366   4.214 0.000025248 ***
## pH                 -0.0272945  0.0087236  -3.129     0.00176 ** 
## Sulphates          -0.0298069  0.0093827  -3.177     0.00149 ** 
## LabelAppeal         0.1185411  0.0068369  17.338     < 2e-16 ***
## AcidIndex          -0.1180306  0.0047711 -24.739     < 2e-16 ***
## STARS               0.3666216  0.0051649  70.984     < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1) family taken to be 0.3104753)
## 
##     Null deviance: 9042.5  on 12794  degrees of freedom
## Residual deviance: 6769.6  on 12787  degrees of freedom
## AIC: 55517
## 
## Number of Fisher Scoring iterations: 6

With a p-value near zero, this 7-variable model is statistically significant when compared to the null hypothesis, but it only explains 25% of the deviance. The p-value for the goodness-of-fit chi-squared test appears to be near 1, which indicates a good fit. None of the variables are exhibiting collinearity with a variance inflation factor greater than 4 (VIF_gt_4).

glmod <- glm_performance(BIC_nb_k1_mod)
kable(all_glmods <- rbind(all_glmods, glmod), digits = 3)

model_name	n_vars	pvalue	deviance_explained	GoFtest	dispersion_parameter	CV_RMSE
BIC_pois_mod	5	0.00e+00	0.354	1.51e-32	0.854	1.406
quasi_pois_mod	5	0.00e+00	0.354	1.51e-32	0.854	1.406
BIC_nb_k1_mod	7	0.00e+00	0.251	1.00e+00	0.310	1.485

BIC selection with Varying Dispersion Parameter

Finally, let’s run a BIC selection process on a negative binomial model where the dispersion parameter is allowed to vary. It will be estimated using the maximum likelihood. The result of this process is the 5-variable model below.

## BIC selection with Floating Dispersion Parameter
nb_mod_all_vars <- MASS::glm.nb(TARGET ~ ., data = train_imputed)

BIC_nb_mod <- step(nb_mod_all_vars, k = log(n), trace = 0)
summary(BIC_nb_mod)

## 
## Call:
## MASS::glm.nb(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
##     LabelAppeal + AcidIndex + STARS, data = train_imputed, init.theta = 48864.4877, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0071  -0.7179   0.0617   0.5744   3.2582  
## 
## Coefficients:
##                      Estimate Std. Error z value   Pr(>|z|)    
## (Intercept)         1.2249406  0.0377565  32.443    < 2e-16 ***
## VolatileAcidity    -0.0418050  0.0093921  -4.451 0.00000854 ***
## TotalSulfurDioxide  0.0001019  0.0000319   3.194     0.0014 ** 
## LabelAppeal         0.1329508  0.0060618  21.933    < 2e-16 ***
## AcidIndex          -0.0881786  0.0044691 -19.731    < 2e-16 ***
## STARS               0.3132447  0.0045129  69.412    < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(48864.49) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 14773  on 12789  degrees of freedom
## AIC: 46729
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  48864 
##           Std. Err.:  50635 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -46715.5

With a p-value near zero, this 5-variable model is statistically significant when compared to the null hypothesis, but it only explains 35% of the deviance. The p-value for the goodness-of-fit chi-squared test is near zero, which indicates that the model’s deviance is not small enough for a good fit. None of the variables are exhibiting collinearity with a variance inflation factor greater than 4 (VIF_gt_4).

glmod <- glm_performance(BIC_nb_mod)
kable(all_glmods <- rbind(all_glmods, glmod), digits = 3)

model_name	n_vars	pvalue	deviance_explained	GoFtest	dispersion_parameter	CV_RMSE
BIC_pois_mod	5	0.00e+00	0.354	1.51e-32	0.854	1.406
quasi_pois_mod	5	0.00e+00	0.354	1.51e-32	0.854	1.406
BIC_nb_k1_mod	7	0.00e+00	0.251	1.00e+00	0.310	1.485
BIC_nb_mod	5	0.00e+00	0.354	1.56e-32	0.854	1.406

SELECT MODELS

Coefficient Comparison

Let’s take a look at the coefficients from our models. In the table below, we see that the intercepts are all approximately 3 to 4 wines cases. The linear coefficients largely align, and the generalized linear coefficients largely align. Among the linear models, there are small differences, but not practical differences. Among the generalized linear models, the Poisson (BIC_pois_mod), quasi-Poisson (quasi_pois_mod) & negative binomial with the varying dispersion parameter (BIC_nb_mod) are nearly identical. This exhibits how closely the Poisson and negative binomial distributions can approximate each other. Only the negative binomial model with a dispersion parameter of 1 has some small differences. Interestingly, while the variables VolatileAcidity & AcidIndex had negative effects on the response variable for the linear models, they had positive ones in the generalized linear models.

options(knitr.kable.NA = '')
#https://stats.stackexchange.com/questions/11096/how-to-interpret-coefficients-in-a-poisson-regression?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
compare_coefficients <- data.frame(var = rev(names(coef(all_vars_lmod))))
all_models <- c(as.character(lm_results$model_name), as.character(all_glmods$model_name))

for (i in 1:length(all_models)){
  model <- get(all_models[i])
  model_name <- all_models[i]
  is_lm_obj <- rep(class(model)[1] == "lm", length(coef(model)))
  df <- data.frame(var = as.character(names(coef(model))))
  df[, model_name] <- ifelse(is_lm_obj, coef(model), exp(coef(model)))
  compare_coefficients <- left_join(compare_coefficients, df)
}
ind <- apply(compare_coefficients[ , 2:7], 1, function(x) all(is.na(x)))
#https://stackoverflow.com/questions/6471689/remove-rows-in-r-matrix-where-all-data-is-na?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa


compare_coefficients_df <- 
  compare_coefficients[!ind, ] %>% 
  arrange(-bkwd_elim_lmod)

kable(compare_coefficients_df, digits = 6)

var	bkwd_elim_lmod	BIC_lmod	BIC_pois_mod	quasi_pois_mod	BIC_nb_k1_mod	BIC_nb_mod
(Intercept)	3.256889	3.082284	3.403939	3.403939	4.353485	3.403964
STARS	0.980174	0.981314	1.367851	1.367851	1.442852	1.367856
LabelAppeal	0.432240	0.431988	1.142194	1.142194	1.125853	1.142194
Alcohol	0.012163	0.012204
TotalSulfurDioxide	0.000274	0.000277	1.000102	1.000102	1.000154	1.000102
pH	-0.034919				0.973075
Sulphates	-0.045332				0.970633
Chlorides	-0.102367
VolatileAcidity	-0.115202	-0.116542	0.959058	0.959058	0.946199	0.959057
AcidIndex	-0.210581	-0.210641	0.915599	0.915599	0.888669	0.915597

Best Model

In the next 2 tables, while we see that the 2 linear models actually have the smallest cross-validated root mean square error, these models also have a small number of negative fitted values, which demonstrates the limited application of linear models to count response variables. Among the remaining generalized linear models, while the negative binomial model with a dispersion parameter of 1 (BIC_nb_k1_mod) has the largest cross-validated root mean square error, the difference between 1.406 and 1.486 may not be practically significant. Additionally, it is the only model that had a good fit under the chi-squared distribution. Consequently, it is the best model.

negative_fits <- data.frame(
  bkwd_elim_lmod = sum(fitted(bkwd_elim_lmod) < 0),
  BIC_lmod = sum(fitted(BIC_lmod) < 0)
)
kable(negative_fits, caption = "The number of negative fitted values in the linear models")

The number of negative fitted values in the linear models
bkwd_elim_lmod	BIC_lmod
23	20

final_summary <- rbind(lm_results[, c("model_name", "n_vars", "CV_RMSE")],
                       all_glmods[, c("model_name", "n_vars", "CV_RMSE")]) 

kable(final_summary, digits = 3, caption = "Final Model Comparison")

Final Model Comparison
model_name	n_vars	CV_RMSE
bkwd_elim_lmod	9	1.325
BIC_lmod	6	1.325
BIC_pois_mod	5	1.406
quasi_pois_mod	5	1.406
BIC_nb_k1_mod	7	1.485
BIC_nb_mod	5	1.406

Evaluation Data Set Predictions

We used the negative binomial model with a dispersion parameter of 1 and made predictions on the evaluation data set. The following is statistical summary of the predicted responses.

summary(predict(BIC_nb_k1_mod, newdata = imputed_eval$ximp, type = "response"))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.497   1.792   2.671   3.114   3.937  11.833

DATA 621 Business Analytics & Data Mining

Homework 5: Linear, Poission & Negative Binomial Regression

Kyle Gilde

May 17, 2018