In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Your objective is to build linear & count regression models to predict the number of cases of wine that will be sold given certain properties of the wine.
# import data
train_data <-
read.csv("https://raw.githubusercontent.com/kylegilde/D621-Data-Mining/master/HW5%20Wine%20Predication/wine-training-data.csv") %>%
mutate(STARS = ifelse(is.na(STARS), 0, STARS)) %>%
dplyr::select(-1)
eval_data <-
read.csv("https://raw.githubusercontent.com/kylegilde/D621-Data-Mining/master/HW5%20Wine%20Predication/wine-evaluation-data.csv") %>%
mutate(STARS = ifelse(is.na(STARS), 0, STARS)) %>%
dplyr::select(-1)
After removing the INDEX column, the data set contains 15 numerical variables and 12,795 observations. Given that the NAs in the STARS variable are meaningful, we have changed those instances to zero to represent a very poor rating.
str(train_data)
## 'data.frame': 12795 obs. of 15 variables:
## $ TARGET : int 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : num 2 3 3 1 2 0 0 3 0 4 ...
From the variable descriptions, we would expect that higher LabelAppeal and STARS values correspond with greater numbers of cases purchased. Simply by the variable names, we suspect that several of the predictor variables will be correlated with each other including:
AcidIndex, CitricAcid, FixedAcidity & VolatileAcidity
FreeSulfurDioxide & TotalSulfurDioxide
FreeSulfurDioxide, Sulphates & TotalSulfurDioxide
In the statistical summary table below, we notice that we may have missing values in about half the variables. Along with AcidIndex, LabelAppeal & STARS, the response variable TARGET is discrete, which makes this data set a good candidate for count regression. Up to 9 of the variables have unexpectedly negative values. It seems unlikely that these are valid measurements of the variable, and we will address them in with our variable transformations. Only one variable AcidIndex has more kurtosis than a normal distribution.
summary_metrics <- function(df){
###Creates summary metrics table
metrics_only <- df[, sapply(df, is.numeric)]
df_metrics <- psych::describe(metrics_only, quant = c(.25,.75))
df_metrics$unique_values = rapply(metrics_only, function(x) length(unique(x)))
df_metrics <-
dplyr::select(df_metrics, n, unique_values, min, Q.1st = Q0.25, median, mean, Q.3rd = Q0.75,
max, range, sd, skew, kurtosis
)
return(df_metrics)
}
metrics_df <- summary_metrics(train_data)
#datatable(round(metrics_df, 2), options = list(searching = F, paging = F))
kable(metrics_df, digits = 1, format.args = list(big.mark = ',', scientific = F, drop0trailing = T))
| n | unique_values | min | Q.1st | median | mean | Q.3rd | max | range | sd | skew | kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TARGET | 12,795 | 9 | 0 | 2 | 3 | 3 | 4 | 8 | 8 | 1.9 | -0.3 | -0.9 |
| FixedAcidity | 12,795 | 470 | -18.1 | 5.2 | 6.9 | 7.1 | 9.5 | 34.4 | 52.5 | 6.3 | 0 | 1.7 |
| VolatileAcidity | 12,795 | 815 | -2.8 | 0.1 | 0.3 | 0.3 | 0.6 | 3.7 | 6.5 | 0.8 | 0 | 1.8 |
| CitricAcid | 12,795 | 602 | -3.2 | 0 | 0.3 | 0.3 | 0.6 | 3.9 | 7.1 | 0.9 | -0.1 | 1.8 |
| ResidualSugar | 12,179 | 2,078 | -127.8 | -2 | 3.9 | 5.4 | 15.9 | 141.2 | 268.9 | 33.7 | -0.1 | 1.9 |
| Chlorides | 12,157 | 1,664 | -1.2 | 0 | 0 | 0.1 | 0.2 | 1.4 | 2.5 | 0.3 | 0 | 1.8 |
| FreeSulfurDioxide | 12,148 | 1,000 | -555 | 0 | 30 | 30.8 | 70 | 623 | 1,178 | 148.7 | 0 | 1.8 |
| TotalSulfurDioxide | 12,113 | 1,371 | -823 | 27 | 123 | 120.7 | 208 | 1,057 | 1,880 | 231.9 | 0 | 1.7 |
| Density | 12,795 | 5,933 | 0.9 | 1 | 1 | 1 | 1 | 1.1 | 0.2 | 0 | 0 | 1.9 |
| pH | 12,400 | 498 | 0.5 | 3 | 3.2 | 3.2 | 3.5 | 6.1 | 5.7 | 0.7 | 0 | 1.6 |
| Sulphates | 11,585 | 631 | -3.1 | 0.3 | 0.5 | 0.5 | 0.9 | 4.2 | 7.4 | 0.9 | 0 | 1.8 |
| Alcohol | 12,142 | 402 | -4.7 | 9 | 10.4 | 10.5 | 12.4 | 26.5 | 31.2 | 3.7 | 0 | 1.5 |
| LabelAppeal | 12,795 | 5 | -2 | -1 | 0 | 0 | 1 | 2 | 4 | 0.9 | 0 | -0.3 |
| AcidIndex | 12,795 | 14 | 4 | 7 | 8 | 7.8 | 8 | 17 | 13 | 1.3 | 1.6 | 5.2 |
| STARS | 12,795 | 5 | 0 | 0 | 1 | 1.5 | 2 | 4 | 4 | 1.2 | 0.3 | -0.9 |
In the barplots below, we notice that more than 20% of the TARGET values are zero, which indicates that the data may be a good candidate for a zero-inflated Poisson regression model.
###Discrete variables Frequencies
discrete_vars_freq <-
train_data %>%
dplyr::select(rownames(metrics_df)[metrics_df$unique_values < 15]) %>%
gather("var", "value") %>%
group_by(var) %>%
count(var, value) %>%
mutate(prop = prop.table(n))
ggplot(data = discrete_vars_freq,
aes(x = value, #reorder(value, prop)
y = prop)) +
geom_bar(stat = "identity", fill = "darkgreen") +
facet_wrap(~var, scales = "free") +
coord_flip() +
ggthemes::theme_fivethirtyeight()
# https://stackoverflow.com/questions/34860535/how-to-use-dplyr-to-generate-a-frequency-table?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
The box plots contain the TARGET distributions for each of the discrete variable values. As we expected, higher values of LabelAppeal and STARS are associated with more wine being purchased. Additionally, smaller AcidIndex values appear to be associated with more wine purchases.
####Side-by-Side Boxplots
boxplot_data <-
train_data %>%
dplyr::select(rownames(metrics_df)[metrics_df$unique_values < 15]) %>%
reshape2::melt(id.vars = "TARGET")
### Side-by-Side Boxplots
ggplot(data = boxplot_data, aes(x = factor(value), y = TARGET)) +
geom_boxplot() +
facet_wrap( ~ variable, scales = "free") +
coord_flip() +
ggthemes::theme_fivethirtyeight()
#Reference: https://stackoverflow.com/questions/14604439/plot-multiple-boxplot-in-one-graph?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
Very few of the predictors are correlated with the response variable. As we would expect from the previous boxplots, STARS and LabelAppeal have moderate positive correlations with TARGET, and AcidIndex has a slight negative correlation. Counter to our expectations, very few of the predictor variables are correlated with each other, suggesting that our models will not have much multicollinearity. STARS is only slightly correlated with LabelAppeal and AcidIndex.
In the table below, we see the correlations between the response variable and several transformations and interactions. We created log-, square- and square-root-transformed versions of both the predictors and the response and created all the possible interactions between them. Then we calculated the correlation coefficients between the different response transformations and the transformed predictors and their interactions. None of these combinations are more correlated than the original variables themselves.
### CORRELATIONS WITH RESPONSE
pred_vars <- dplyr::select(train_data, -TARGET)
#squared variables
squared_vars <-
apply(pred_vars, 2, function(x) x^2) %>%
as.data.frame()
colnames(squared_vars) <- paste0(names(squared_vars), "2")
#squart root variables
sqrt_vars <-
apply(pred_vars, 2, function(x) x^2) %>%
as.data.frame()
colnames(sqrt_vars) <- paste0(names(sqrt_vars), "_sqrt")
#log variables
log_vars <-
apply(pred_vars, 2, function(x) log(x + .01)) %>%
as.data.frame()
colnames(log_vars) <- paste0(names(log_vars), "_log")
individual_vars <- cbind(squared_vars, sqrt_vars, log_vars, pred_vars)
#interaction variables
all_interactions <- data.frame(t(apply(individual_vars, 1, combn, 2, prod)))
colnames(all_interactions) <- combn(names(individual_vars), 2, paste, collapse=":")
all_predictors <- cbind(pred_vars, all_interactions)
# response variable transformations
dep_vars <-
train_data %>%
transmute(
TARGET = TARGET,
TARGET2 = TARGET^2,
TARGET_sqrt = sqrt(TARGET),
TARGET_log = log(TARGET + .01)
)
# create correlation df
all_corr <-
cor(dep_vars, all_predictors, use = "pairwise.complete.obs") %>%
correlation_df() %>%
filter((Var2 %like% "%TARGET%" | Var1 %like% "%TARGET%")
& !(Var2 %like% "%TARGET%" & Var1 %like% "%TARGET%"))
##https://stackoverflow.com/questions/2080774/generating-interaction-variables-in-r-dataframes?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
rownames(all_corr) <- 1:nrow(all_corr)
kable(head(filter(all_corr, Var1 == "TARGET"), 20), digits = 3, row.names = T, caption = "Top Corrrelations with the Response Variable")
| Var1 | Var2 | Correlation | Rsquared | |
|---|---|---|---|---|
| 1 | TARGET | STARS | 0.685 | 0.470 |
| 2 | TARGET | Density:STARS | 0.684 | 0.468 |
| 3 | TARGET | Density2:STARS | 0.682 | 0.465 |
| 4 | TARGET | Density_sqrt:STARS | 0.682 | 0.465 |
| 5 | TARGET | AcidIndex_log:STARS | 0.677 | 0.458 |
| 6 | TARGET | Alcohol_log:STARS | 0.665 | 0.442 |
| 7 | TARGET | TotalSulfurDioxide_log:STARS | 0.663 | 0.439 |
| 8 | TARGET | AcidIndex:STARS | 0.657 | 0.431 |
| 9 | TARGET | pH_log:STARS | 0.652 | 0.425 |
| 10 | TARGET | pH:STARS | 0.650 | 0.423 |
| 11 | TARGET | FreeSulfurDioxide_log:STARS | 0.636 | 0.404 |
| 12 | TARGET | AcidIndex_log:STARS_log | 0.634 | 0.402 |
| 13 | TARGET | STARS_log:AcidIndex | 0.633 | 0.401 |
| 14 | TARGET | STARS_log:Density | 0.629 | 0.396 |
| 15 | TARGET | Density2:STARS_log | 0.629 | 0.395 |
| 16 | TARGET | Density_sqrt:STARS_log | 0.629 | 0.395 |
| 17 | TARGET | Alcohol_log:STARS_log | 0.620 | 0.384 |
| 18 | TARGET | pH_log:STARS_log | 0.616 | 0.380 |
| 19 | TARGET | STARS_log:pH | 0.616 | 0.379 |
| 20 | TARGET | Alcohol:STARS | 0.614 | 0.377 |
In the plots and table below, we can see that 7 variables are missing values and that only 68% of the observations are complete. Since most of the predictors are not correlated with each other, it may be difficult to accurately impute the missing values.
## Missing Values
options(scipen = 999)
missing_plot <- VIM::aggr(train_data,
numbers = T,
sortVars = T,
col = c("lightgreen", "darkred", "orange"),
labels=str_sub(names(train_data), 1, 8),
ylab=c("Missing Value Counts"
, "Pattern"))
summary(missing_plot)
missing_plot$missings %>%
mutate(
pct_missing = Count / nrow(train_data)
) %>%
arrange(-pct_missing) %>%
filter(pct_missing > 0) %>%
kable(digits = 3, row.names = T, caption = "Variables Missing Values")
| Variable | Count | pct_missing | |
|---|---|---|---|
| 1 | Sulphates | 1210 | 0.095 |
| 2 | TotalSulfurDioxide | 682 | 0.053 |
| 3 | Alcohol | 653 | 0.051 |
| 4 | FreeSulfurDioxide | 647 | 0.051 |
| 5 | Chlorides | 638 | 0.050 |
| 6 | ResidualSugar | 616 | 0.048 |
| 7 | pH | 395 | 0.031 |
#options(scipen=0, digits=7)
While our “Top Corrrelations with the Response Variable” table did not indicate that we should transform our variables, our summary statistics did show that the following 9 varibles have impossibly negative values, and the table below shows that 8 variables contain more than 10% negative values. This many invalid variable values certainly raises concerns about the study’s quality of the data collection and measurement.
vars_neg_values <-
dplyr::select(train_data,
intersect(rownames(metrics_df)[metrics_df$unique_values > 15],
rownames(metrics_df)[metrics_df$min < 0])
)
neg_proportions <- t(apply(vars_neg_values, 2, function(x) prop.table(table(x < 0))))
data.frame(
Var = rownames(neg_proportions),
is_negative = neg_proportions[, 2]
) %>% arrange(-is_negative) %>%
kable(digits = 2)
| Var | is_negative |
|---|---|
| Chlorides | 0.26 |
| ResidualSugar | 0.26 |
| FreeSulfurDioxide | 0.25 |
| CitricAcid | 0.23 |
| VolatileAcidity | 0.22 |
| TotalSulfurDioxide | 0.21 |
| Sulphates | 0.20 |
| FixedAcidity | 0.13 |
| Alcohol | 0.01 |
In the side-by-side boxplots below, we see that if we were to take the absolute value of the negative numbers, the distributions of the transformed negative values are mostly similar to the positive value distributions. The 2 variables with dissimilar distributions FixedAcidity and Alcohol have the fewest negative values. Consequently, we will take the absolute values of these variables.
vars_neg_values_melted <-
vars_neg_values %>%
reshape::melt() %>%
na.omit() %>%
mutate(is_negative = as.factor(value < 0), #relevel(, "1")
abs_value = abs(value))
ggplot(data = vars_neg_values_melted, aes(x = variable, y = abs_value)) +
geom_boxplot(aes(fill = is_negative)) +
facet_wrap( ~ variable, scales = "free")
train_transformed <-
train_data %>%
mutate(
FixedAcidity = abs(FixedAcidity),
VolatileAcidity = abs(VolatileAcidity),
CitricAcid = abs(CitricAcid),
ResidualSugar = abs(ResidualSugar),
Chlorides = abs(Chlorides),
FreeSulfurDioxide = abs(FreeSulfurDioxide),
TotalSulfurDioxide = abs(TotalSulfurDioxide),
Sulphates = abs(Sulphates),
Alcohol = abs(Alcohol))
eval_transformed <-
eval_data %>%
mutate(
FixedAcidity = abs(FixedAcidity),
VolatileAcidity = abs(VolatileAcidity),
CitricAcid = abs(CitricAcid),
ResidualSugar = abs(ResidualSugar),
Chlorides = abs(Chlorides),
FreeSulfurDioxide = abs(FreeSulfurDioxide),
TotalSulfurDioxide = abs(TotalSulfurDioxide),
Sulphates = abs(Sulphates),
Alcohol = abs(Alcohol))
Let’s use the missForest package to do nonparametric missing-value imputation using Random Forest on both the training and evaluation sets.
When we use the out-of-the-bag error in order to calculate the normalized root-mean-squares error, we see NRMSE values closer to zero than not, which indicates that we have well-fitted imputations. However, the NRMSE values are higher than previous imputations using this package, which may be the result of there being not much correlation between the predictor variables.
| Variable | Count | MSE | NRMSE |
|---|---|---|---|
| ResidualSugar | 616 | 648.15 | 0.18 |
| FreeSulfurDioxide | 647 | 12165.65 | 0.18 |
| Chlorides | 638 | 0.06 | 0.18 |
| Sulphates | 1210 | 0.45 | 0.16 |
| TotalSulfurDioxide | 682 | 27500.94 | 0.16 |
| Alcohol | 653 | 13.14 | 0.14 |
| pH | 395 | 0.48 | 0.12 |
Using the training data set, build at least two different poisson regression models, at least two different negative binomial regression models, and at least two multiple linear regression models, using different variables
For our first model, let’s use all variables in our imputed data set with a backward elimination process that removes the predictor with the highest p-value until all of the remaining p-values are statistically significant at a .05 level. As we suspected, the STARS variable, followed by LabelAppeal, has the most practical significance in the model. With the other variables held constant, for every 1 point increase in the expert wine rating, we would expect an increase of nearly 1 wine case (.98) to be purchased.
train_imputed <- imputed_train$ximp
backward_elimination <- function(lmod){
#performs backward elimination model selection
#removes variables until all remaining ones are stat-sig
removed_vars <- c()
removed_pvalues <- c()
#handles category dummy variables
cat_levels <- unlist(lmod$xlevels)
cat_vars <- str_sub(names(cat_levels), 1, nchar(names(cat_levels)) - 1)
cat_var_df <- data.frame(cat_vars,
dummy_vars = str_c(cat_vars, cat_levels),
stringsAsFactors = F)
# checks for p-values > .05 execpt for the intercept
while (max(summary(lmod)$coefficients[2:length(summary(lmod)$coefficients[, 4]), 4]) > .05){
# find insignificant pvalue
pvalues <- summary(lmod)$coefficients[2:length(summary(lmod)$coefficients[, 4]), 4]
max_pvalue <- max(pvalues)
remove <- names(which.max(pvalues))
#if categorical dummy variable, remove the variable
dummy_var <- dplyr::filter(cat_var_df, dummy_vars == remove)
remove <- ifelse(nrow(dummy_var) > 0, dummy_var[, 1], remove)
#record the removed variables
removed_vars <- c(removed_vars, remove)
removed_pvalues <- c(removed_pvalues, max_pvalue)
# update model
lmod <- update(lmod, as.formula(paste0(".~.-`", remove, "`")))
}
print(kable(data.frame(removed_vars, removed_pvalues), digits = 3))
return(lmod)
}
all_vars_lmod <- lm(TARGET ~ ., x = T, y = T, data = train_imputed)
bkwd_elim_lmod <- backward_elimination(all_vars_lmod)
##
##
## removed_vars removed_pvalues
## ------------------ ----------------
## FixedAcidity 0.997
## ResidualSugar 0.849
## FreeSulfurDioxide 0.080
## CitricAcid 0.073
## Density 0.056
options(scipen = 9)
summary(bkwd_elim_lmod)
##
## Call:
## lm(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide +
## pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS,
## data = train_imputed, x = T, y = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5164 -0.9623 0.0611 0.9084 5.9849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.25688909 0.10707585 30.417 < 2e-16 ***
## VolatileAcidity -0.11520183 0.02115523 -5.446 0.0000000526 ***
## Chlorides -0.10236737 0.05135908 -1.993 0.046264 *
## TotalSulfurDioxide 0.00027447 0.00007398 3.710 0.000208 ***
## pH -0.03491883 0.01755995 -1.989 0.046772 *
## Sulphates -0.04533164 0.01878802 -2.413 0.015845 *
## Alcohol 0.01216317 0.00332349 3.660 0.000253 ***
## LabelAppeal 0.43224023 0.01368762 31.579 < 2e-16 ***
## AcidIndex -0.21058093 0.00905249 -23.262 < 2e-16 ***
## STARS 0.98017371 0.01045697 93.734 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.326 on 12785 degrees of freedom
## Multiple R-squared: 0.5263, Adjusted R-squared: 0.5259
## F-statistic: 1578 on 9 and 12785 DF, p-value: < 2.2e-16
In the model’s diagnostic plots, while the Residuals-vs-Fitted plot displays what appears to be constant variance given the discrete response variable, the fitted line in the standardized-residual plot reveals some nonconstant variance. The Normal Q-Q plot shows that the standardized residuals are close to normality. The Leverage plot shows that we do not have any influential points.
par(mfrow=c(2,2))
plot(bkwd_elim_lmod)
#http://data.library.virginia.edu/diagnostic-plots/
#http://analyticspro.org/2016/03/07/r-tutorial-how-to-use-diagnostic-plots-for-regression-models/
With a p-value near zero, this 9-variable model is statistically significant when compared to the null hypothesis. The moderate p-value for the Durbin-Watson test of independence indicates that we fail to reject the null hypothesis of no autocorrelation. However, the small p-values for the noncontant variance test from the car package and the Anderson-Darling test indicate that we would reject the null hypotheses of homoscedasticity and normality. None of the variables are exhibiting collinearity with a variance inflation factor greater than 4 (VIF_gt_4).
PRESS <- function(linear.model) {
#source: https://gist.github.com/tomhopper/8c204d978c4a0cbcb8c0#file-press-r
#' calculate the predictive residuals
pr <- residuals(linear.model)/(1 - lm.influence(linear.model)$hat)
#' calculate the PRESS
PRESS <- sum(pr^2)
return(PRESS)
}
pred_r_squared <- function(linear.model) {
#source: https://gist.github.com/tomhopper/8c204d978c4a0cbcb8c0#file-pred_r_squared-r
#' Use anova() to get the sum of squares for the linear model
lm.anova <- anova(linear.model)
#' Calculate the total sum of squares
tss <- sum(lm.anova$'Sum Sq')
# Calculate the predictive R^2
pred.r.squared <- 1 - PRESS(linear.model)/(tss)
return(pred.r.squared)
}
lm_evaluation <- function(lmod) {
### Summarizes the model's key statistics in one row
### https://gist.github.com/stephenturner/722049#file-pvalue-from-lm-object-r-L5
lm_summary <- summary(lmod)
f <- as.numeric(lm_summary$fstatistic)
df_summary <-
data.frame(
model_name = deparse(substitute(lmod)),
n_vars = ncol(lmod$model) - 1,
numdf = f[2],
fstat = f[1],
p.value = formatC(pf(f[1], f[2], f[3], lower.tail = F), format = "e", digits = 2),
adj.r.squared = lm_summary$adj.r.squared,
pre.r.squared = pred_r_squared(lmod),
CV_RMSE = lmvar::cv.lm(lmod, k = 100)$MSE_sqrt$mean
)
return(df_summary)
}
lm_diagnotics <- function(lmod){
diag_df <- data.frame(
DW.test = car::durbinWatsonTest(lmod)$p,
NCV.test = formatC(car::ncvTest(lmod)$p, format = "e", digits = 2),
AD.test = formatC(nortest::ad.test(lmod$residuals)$p.value, format = "e", digits = 2),
VIF_gt_4 = sum(car::vif(lmod) > 4)
)
return(diag_df)
}
#evaluate performance & diagnostics
kable(lm_results <- lm_evaluation(bkwd_elim_lmod), digits = 3, caption = "Model Summary Statistics")
| model_name | n_vars | numdf | fstat | p.value | adj.r.squared | pre.r.squared | CV_RMSE |
|---|---|---|---|---|---|---|---|
| bkwd_elim_lmod | 9 | 9 | 1578.156 | 0.00e+00 | 0.526 | 0.526 | 1.325 |
kable(lm_results_diagnostics <- lm_diagnotics(bkwd_elim_lmod), digits = 3, caption = "Model Diagnostic Statistics")
| DW.test | NCV.test | AD.test | VIF_gt_4 |
|---|---|---|---|
| 0.77 | 1.93e-132 | 2.07e-21 | 0 |
Now let’s use the original variables of the imputed data set with a BIC selection. Due to its high predictor penalty, this process produced a more parsimonious model with 6 statistically significant variables. It removed 3 variables that were included in the backward elimination model.
n <- nrow(all_vars_lmod$model)
BIC_lmod <- step(all_vars_lmod, k = log(n), trace = 0)
removed_variables <- function(larger_mod, smaller_mod){
#compares variables of 2 models
#returns the variables not in the small model
removed <- names(coef(larger_mod))[!names(coef(larger_mod)) %in%
names(coef(smaller_mod))]
print(paste("removed variable(s):", length(removed)))
print(removed)
}
removed_variables(bkwd_elim_lmod, BIC_lmod)
## [1] "removed variable(s): 3"
## [1] "Chlorides" "pH" "Sulphates"
summary(BIC_lmod)
##
## Call:
## lm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide +
## Alcohol + LabelAppeal + AcidIndex + STARS, data = train_imputed,
## x = T, y = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5919 -0.9626 0.0635 0.9111 6.0067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0822844 0.0865818 35.600 < 2e-16 ***
## VolatileAcidity -0.1165419 0.0211602 -5.508 0.0000000371 ***
## TotalSulfurDioxide 0.0002765 0.0000740 3.737 0.000187 ***
## Alcohol 0.0122037 0.0033248 3.671 0.000243 ***
## LabelAppeal 0.4319875 0.0136923 31.550 < 2e-16 ***
## AcidIndex -0.2106406 0.0090302 -23.326 < 2e-16 ***
## STARS 0.9813136 0.0104567 93.846 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.327 on 12788 degrees of freedom
## Multiple R-squared: 0.5258, Adjusted R-squared: 0.5255
## F-statistic: 2363 on 6 and 12788 DF, p-value: < 2.2e-16
The BIC model’s diagnostic plots closely resemble the plots from the backward elimination mode. While the Residuals-vs-Fitted plot displays what appears to be constant variance given the discrete response variable, the standardized-residual plot reveals some nonconstant variance. The Normal Q-Q plot shows that the standardized residuals are close to normality. The leverage plot shows that we do not have any influential points.
par(mfrow=c(2,2))
plot(BIC_lmod)
With only 6 variables, this more parsimonious model’s predicted R-squared is nearly as good as the first 9-variable model. The model passes Durbin-Watson test of independence, but fails the nonconstant variance test and the Anderson-Darling test of normality.
#evaluate performance & diagnostics
model_eval <- lm_evaluation(BIC_lmod)
model_diag <- lm_diagnotics(BIC_lmod)
kable(lm_results <- rbind(lm_results, model_eval), digits = 3, caption = "Model Summary Statistics")
| model_name | n_vars | numdf | fstat | p.value | adj.r.squared | pre.r.squared | CV_RMSE |
|---|---|---|---|---|---|---|---|
| bkwd_elim_lmod | 9 | 9 | 1578.156 | 0.00e+00 | 0.526 | 0.526 | 1.325 |
| BIC_lmod | 6 | 6 | 2362.791 | 0.00e+00 | 0.526 | 0.525 | 1.325 |
kable(lm_results_diagnostics <- rbind(lm_results_diagnostics, model_diag), digits = 3, caption = "Model Diagnostic Statistics")
| DW.test | NCV.test | AD.test | VIF_gt_4 |
|---|---|---|---|
| 0.770 | 1.93e-132 | 2.07e-21 | 0 |
| 0.828 | 5.97e-134 | 1.30e-21 | 0 |
Since the response variable is a count, this data set is a good candidate for generalized linear regression with the Poisson distribution. Let’s use Poisson regression with the BIC variable selection process. This process removed 9 variables creating the most parsimonious model so far with 5 statistically significant features.
## Poission Regression
### Regular Poisson Model with BIC Selection
pois_mod <- glm(TARGET ~ ., family = "poisson", data = train_imputed)
BIC_pois_mod <- step(pois_mod, k = log(n), trace = 0)
removed_variables(pois_mod, BIC_pois_mod)
## [1] "removed variable(s): 9"
## [1] "FixedAcidity" "CitricAcid" "ResidualSugar"
## [4] "Chlorides" "FreeSulfurDioxide" "Density"
## [7] "pH" "Sulphates" "Alcohol"
summary(BIC_pois_mod)
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide +
## LabelAppeal + AcidIndex + STARS, family = "poisson", data = train_imputed)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0072 -0.7180 0.0617 0.5745 3.2583
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2249334 0.0377551 32.444 < 2e-16 ***
## VolatileAcidity -0.0418041 0.0093917 -4.451 0.00000854 ***
## TotalSulfurDioxide 0.0001019 0.0000319 3.194 0.0014 **
## LabelAppeal 0.1329514 0.0060616 21.933 < 2e-16 ***
## AcidIndex -0.0881767 0.0044690 -19.731 < 2e-16 ***
## STARS 0.3132407 0.0045127 69.413 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 14773 on 12789 degrees of freedom
## AIC: 46727
##
## Number of Fisher Scoring iterations: 5
Let’s exponentiate the model’s coefficients in order to make them interpretable in terms of wine cases. With the other variables held constant, for every 1 point increase in the expert wine rating STARS, we would expect on average an increase of 1.37 wine cases to be purchased.
exp(cbind(coef(BIC_pois_mod), confint(BIC_pois_mod)))
## 2.5 % 97.5 %
## (Intercept) 3.4039393 3.1613621 3.6656303
## VolatileAcidity 0.9590577 0.9415158 0.9768239
## TotalSulfurDioxide 1.0001019 1.0000392 1.0001642
## LabelAppeal 1.1421945 1.1287048 1.1558451
## AcidIndex 0.9155991 0.9075962 0.9236356
## STARS 1.3678507 1.3558108 1.3800078
#https://stats.stackexchange.com/questions/11096/how-to-interpret-coefficients-in-a-poisson-regression?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
With a p-value near zero, this 5-variable model is statistically significant when compared to the null hypothesis, but it only explains 35% of the deviance (ELMR p. 87). The p-value for the goodness-of-fit chi-squared test is near zero, which indicates that the model’s deviance is not small enough for a good fit. At .85, the model has underdispersion, which indicates that the data exhibit less variation than the Poisson distribution expects. None of the variables are exhibiting collinearity with a variance inflation factor greater than 4 (VIF_gt_4).
glm_performance <- function(model) {
### Summarizes the model's key statistics
df_summary <- data.frame(
model_name = deparse(substitute(model)),
n_vars = length(coef(model)) - 1,
# https://stats.stackexchange.com/questions/141177/test-glm-model-using-null-and-model-deviances?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
pvalue = formatC(with(model, pchisq(null.deviance - deviance, df.null - df.residual, lower = F)), format = "e", digits = 2),
# page 87 of ELMR
deviance_explained = with(model, 1 - deviance/null.deviance),
# https://stats.idre.ucla.edu/r/dae/poisson-regression/
GoFtest = formatC(with(model, pchisq(deviance, df.residual, lower.tail=FALSE)), format = "e", digits = 2),
# page 88 of ELMR
dispersion_parameter = sum(residuals(model,type="pearson")^2)/model$df.res,
VIF_gt_4 = sum(car::vif(model) > 4),
# https://www.rdocumentation.org/packages/boot/versions/1.3-20/topics/cv.glm
CV_RMSE = sqrt(boot::cv.glm(model$model, model, K = 100)$delta[1])
)
return(df_summary)
}
#https://stackoverflow.com/questions/38272150/compute-cross-validation-for-glms-with-negative-binomial-response
#https://stackoverflow.com/questions/20744224/k-fold-cross-validation-using-cv-lm?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
glmod <- glm_performance(BIC_pois_mod)
kable(all_glmods <- glmod, digits = 3)
| model_name | n_vars | pvalue | deviance_explained | GoFtest | dispersion_parameter | VIF_gt_4 | CV_RMSE |
|---|---|---|---|---|---|---|---|
| BIC_pois_mod | 5 | 0.00e+00 | 0.354 | 1.51e-32 | 0.854 | 0 | 1.406 |
Since the previous Poisson model was underdispersed, let’s try applying the quasi-Poisson generalized linear model to those 5 variables.
### Quasi-Poisson Model with BIC Selection
quasi_pois_mod <- glm(BIC_pois_mod$formula, family = "quasipoisson", data = train_imputed)
options(scipen = 9)
summary(quasi_pois_mod)
##
## Call:
## glm(formula = BIC_pois_mod$formula, family = "quasipoisson",
## data = train_imputed)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0072 -0.7180 0.0617 0.5745 3.2583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.22493338 0.03488467 35.114 < 2e-16 ***
## VolatileAcidity -0.04180408 0.00867771 -4.817 0.00000147 ***
## TotalSulfurDioxide 0.00010188 0.00002947 3.457 0.000548 ***
## LabelAppeal 0.13295139 0.00560073 23.738 < 2e-16 ***
## AcidIndex -0.08817668 0.00412921 -21.354 < 2e-16 ***
## STARS 0.31324066 0.00416959 75.125 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 0.853724)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 14773 on 12789 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
The use of the underdispersed quasi-Poisson model has no effect on the model’s p-value, deviance explained or goodness-of-fit chi-squared test p-value.
glmod <- glm_performance(quasi_pois_mod)
kable(all_glmods <- rbind(all_glmods, glmod), digits = 3)
| model_name | n_vars | pvalue | deviance_explained | GoFtest | dispersion_parameter | VIF_gt_4 | CV_RMSE |
|---|---|---|---|---|---|---|---|
| BIC_pois_mod | 5 | 0.00e+00 | 0.354 | 1.51e-32 | 0.854 | 0 | 1.406 |
| quasi_pois_mod | 5 | 0.00e+00 | 0.354 | 1.51e-32 | 0.854 | 0 | 1.406 |
In fact, the Poisson and quasi-Poisson model coefficients are exactly the same.
#the Poisson and quasi-Poisson model coefficients are exactly the same.
options(knitr.kable.NA = '')
#https://stats.stackexchange.com/questions/11096/how-to-interpret-coefficients-in-a-poisson-regression?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
compare_coefficients <- data.frame(var = rev(names(coef(all_vars_lmod))))
pois_models <- c("quasi_pois_mod", "BIC_pois_mod")
for (i in 1:length(pois_models)){
model <- get(pois_models[i])
model_name <- pois_models[i]
is_lm_obj <- rep(class(model)[1] == "lm", length(coef(model)))
df <- data.frame(var = as.character(names(coef(model))))
df[, model_name] <- ifelse(is_lm_obj, coef(model), exp(coef(model)))
compare_coefficients <- left_join(compare_coefficients, df)
}
ind <- apply(compare_coefficients[ , 2:3], 1, function(x) all(is.na(x)))
#https://stackoverflow.com/questions/6471689/remove-rows-in-r-matrix-where-all-data-is-na?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
compare_coefficients$is_equal <- with(compare_coefficients, quasi_pois_mod == BIC_pois_mod)
kable(compare_coefficients[!ind, ], digits = 5)
| var | quasi_pois_mod | BIC_pois_mod | is_equal | |
|---|---|---|---|---|
| 1 | STARS | 1.36785 | 1.36785 | TRUE |
| 2 | AcidIndex | 0.91560 | 0.91560 | TRUE |
| 3 | LabelAppeal | 1.14219 | 1.14219 | TRUE |
| 8 | TotalSulfurDioxide | 1.00010 | 1.00010 | TRUE |
| 13 | VolatileAcidity | 0.95906 | 0.95906 | TRUE |
| 15 | (Intercept) | 3.40394 | 3.40394 | TRUE |
# zi_pois_all_vars <- pscl::zeroinfl(TARGET ~ ., data = train_imputed)
#
# if (!exists("zi_pois_pois_mod")){
# zi_pois_pois_mod <- step(zi_pois_all_vars, k = log(n), trace = 0)
# }
# summary(zi_pois_pois_mod)
# glm_performance(zi_pois_pois_mod)
# data.frame(coef(zi_pois_pois_mod))
# exp(coef(zi_pois_pois_mod))
Now let’s try negative binomial regression, which can arise out of a generalized Poisson regression. We will start with a disperson parameter of 1, which corresponds to the geometric distribution and use imputed original variables with a BIC selection process. The result of this process is a 7-feature model with the addition of the Sulphates and pH variables.
## Negative Binomial Regression
### BIC selection with Dispersion Parameter = 1
nb1_mod_all_vars <- glm(TARGET ~ ., family = negative.binomial(1), data = train_imputed)
BIC_nb_k1_mod <- step(nb1_mod_all_vars, k = log(n), trace = 0)
summary(BIC_nb_k1_mod)
##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide +
## pH + Sulphates + LabelAppeal + AcidIndex + STARS, family = negative.binomial(1),
## data = train_imputed)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.82923 -0.39345 0.00025 0.29918 1.75915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4709768 0.0511707 28.746 < 2e-16 ***
## VolatileAcidity -0.0553022 0.0105887 -5.223 0.000000179 ***
## TotalSulfurDioxide 0.0001542 0.0000366 4.214 0.000025248 ***
## pH -0.0272945 0.0087236 -3.129 0.00176 **
## Sulphates -0.0298069 0.0093827 -3.177 0.00149 **
## LabelAppeal 0.1185411 0.0068369 17.338 < 2e-16 ***
## AcidIndex -0.1180306 0.0047711 -24.739 < 2e-16 ***
## STARS 0.3666216 0.0051649 70.984 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1) family taken to be 0.3104753)
##
## Null deviance: 9042.5 on 12794 degrees of freedom
## Residual deviance: 6769.6 on 12787 degrees of freedom
## AIC: 55517
##
## Number of Fisher Scoring iterations: 6
With a p-value near zero, this 7-variable model is statistically significant when compared to the null hypothesis, but it only explains 25% of the deviance. The p-value for the goodness-of-fit chi-squared test appears to be near 1, which indicates a good fit. None of the variables are exhibiting collinearity with a variance inflation factor greater than 4 (VIF_gt_4).
glmod <- glm_performance(BIC_nb_k1_mod)
kable(all_glmods <- rbind(all_glmods, glmod), digits = 3)
| model_name | n_vars | pvalue | deviance_explained | GoFtest | dispersion_parameter | VIF_gt_4 | CV_RMSE |
|---|---|---|---|---|---|---|---|
| BIC_pois_mod | 5 | 0.00e+00 | 0.354 | 1.51e-32 | 0.854 | 0 | 1.406 |
| quasi_pois_mod | 5 | 0.00e+00 | 0.354 | 1.51e-32 | 0.854 | 0 | 1.406 |
| BIC_nb_k1_mod | 7 | 0.00e+00 | 0.251 | 1.00e+00 | 0.310 | 0 | 1.485 |
Finally, let’s run a BIC selection process on a negative binomial model where the dispersion parameter is allowed to vary. It will be estimated using the maximum likelihood. The result of this process is the 5-variable model below.
## BIC selection with Floating Dispersion Parameter
nb_mod_all_vars <- MASS::glm.nb(TARGET ~ ., data = train_imputed)
BIC_nb_mod <- step(nb_mod_all_vars, k = log(n), trace = 0)
summary(BIC_nb_mod)
##
## Call:
## MASS::glm.nb(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide +
## LabelAppeal + AcidIndex + STARS, data = train_imputed, init.theta = 48864.4877,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0071 -0.7179 0.0617 0.5744 3.2582
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2249406 0.0377565 32.443 < 2e-16 ***
## VolatileAcidity -0.0418050 0.0093921 -4.451 0.00000854 ***
## TotalSulfurDioxide 0.0001019 0.0000319 3.194 0.0014 **
## LabelAppeal 0.1329508 0.0060618 21.933 < 2e-16 ***
## AcidIndex -0.0881786 0.0044691 -19.731 < 2e-16 ***
## STARS 0.3132447 0.0045129 69.412 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(48864.49) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 14773 on 12789 degrees of freedom
## AIC: 46729
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 48864
## Std. Err.: 50635
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -46715.5
With a p-value near zero, this 5-variable model is statistically significant when compared to the null hypothesis, but it only explains 35% of the deviance. The p-value for the goodness-of-fit chi-squared test is near zero, which indicates that the model’s deviance is not small enough for a good fit. None of the variables are exhibiting collinearity with a variance inflation factor greater than 4 (VIF_gt_4).
glmod <- glm_performance(BIC_nb_mod)
kable(all_glmods <- rbind(all_glmods, glmod), digits = 3)
| model_name | n_vars | pvalue | deviance_explained | GoFtest | dispersion_parameter | VIF_gt_4 | CV_RMSE |
|---|---|---|---|---|---|---|---|
| BIC_pois_mod | 5 | 0.00e+00 | 0.354 | 1.51e-32 | 0.854 | 0 | 1.406 |
| quasi_pois_mod | 5 | 0.00e+00 | 0.354 | 1.51e-32 | 0.854 | 0 | 1.406 |
| BIC_nb_k1_mod | 7 | 0.00e+00 | 0.251 | 1.00e+00 | 0.310 | 0 | 1.485 |
| BIC_nb_mod | 5 | 0.00e+00 | 0.354 | 1.56e-32 | 0.854 | 0 | 1.406 |
Let’s take a look at the coefficients from our models. In the table below, we see that the intercepts are all approximately 3 to 4 wines cases. The linear coefficients largely align, and the generalized linear coefficients largely align. Among the linear models, there are small differences, but not practical differences. Among the generalized linear models, the Poisson (BIC_pois_mod), quasi-Poisson (quasi_pois_mod) & negative binomial with the varying dispersion parameter (BIC_nb_mod) are nearly identical. This exhibits how closely the Poisson and negative binomial distributions can approximate each other. Only the negative binomial model with a dispersion parameter of 1 has some small differences. Interestingly, while the variables VolatileAcidity & AcidIndex had negative effects on the response variable for the linear models, they had positive ones in the generalized linear models.
options(knitr.kable.NA = '')
#https://stats.stackexchange.com/questions/11096/how-to-interpret-coefficients-in-a-poisson-regression?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
compare_coefficients <- data.frame(var = rev(names(coef(all_vars_lmod))))
all_models <- c(as.character(lm_results$model_name), as.character(all_glmods$model_name))
for (i in 1:length(all_models)){
model <- get(all_models[i])
model_name <- all_models[i]
is_lm_obj <- rep(class(model)[1] == "lm", length(coef(model)))
df <- data.frame(var = as.character(names(coef(model))))
df[, model_name] <- ifelse(is_lm_obj, coef(model), exp(coef(model)))
compare_coefficients <- left_join(compare_coefficients, df)
}
ind <- apply(compare_coefficients[ , 2:7], 1, function(x) all(is.na(x)))
#https://stackoverflow.com/questions/6471689/remove-rows-in-r-matrix-where-all-data-is-na?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
compare_coefficients_df <-
compare_coefficients[!ind, ] %>%
arrange(-bkwd_elim_lmod)
kable(compare_coefficients_df, digits = 6)
| var | bkwd_elim_lmod | BIC_lmod | BIC_pois_mod | quasi_pois_mod | BIC_nb_k1_mod | BIC_nb_mod |
|---|---|---|---|---|---|---|
| (Intercept) | 3.256889 | 3.082284 | 3.403939 | 3.403939 | 4.353485 | 3.403964 |
| STARS | 0.980174 | 0.981314 | 1.367851 | 1.367851 | 1.442852 | 1.367856 |
| LabelAppeal | 0.432240 | 0.431988 | 1.142194 | 1.142194 | 1.125853 | 1.142194 |
| Alcohol | 0.012163 | 0.012204 | ||||
| TotalSulfurDioxide | 0.000274 | 0.000277 | 1.000102 | 1.000102 | 1.000154 | 1.000102 |
| pH | -0.034919 | 0.973075 | ||||
| Sulphates | -0.045332 | 0.970633 | ||||
| Chlorides | -0.102367 | |||||
| VolatileAcidity | -0.115202 | -0.116542 | 0.959058 | 0.959058 | 0.946199 | 0.959057 |
| AcidIndex | -0.210581 | -0.210641 | 0.915599 | 0.915599 | 0.888669 | 0.915597 |
In the next 2 tables, while we see that the 2 linear models actually have the smallest cross-validated root mean square error, these models also have a small number of negative fitted values, which demonstrates the limited application of linear models to count response variables. Among the remaining generalized linear models, while the negative binomial model with a dispersion parameter of 1 (BIC_nb_k1_mod) has the largest cross-validated root mean square error, the difference between 1.406 and 1.486 may not be practically significant. Additionally, it is the only model that had a good fit under the chi-squared distribution. Consequently, it is the best model.
negative_fits <- data.frame(
bkwd_elim_lmod = sum(fitted(bkwd_elim_lmod) < 0),
BIC_lmod = sum(fitted(BIC_lmod) < 0)
)
kable(negative_fits, caption = "The number of negative fitted values in the linear models")
| bkwd_elim_lmod | BIC_lmod |
|---|---|
| 23 | 20 |
final_summary <- rbind(lm_results[, c("model_name", "n_vars", "CV_RMSE")],
all_glmods[, c("model_name", "n_vars", "CV_RMSE")])
kable(final_summary, digits = 3, caption = "Final Model Comparison")
| model_name | n_vars | CV_RMSE |
|---|---|---|
| bkwd_elim_lmod | 9 | 1.325 |
| BIC_lmod | 6 | 1.325 |
| BIC_pois_mod | 5 | 1.406 |
| quasi_pois_mod | 5 | 1.406 |
| BIC_nb_k1_mod | 7 | 1.485 |
| BIC_nb_mod | 5 | 1.406 |
We used the negative binomial model with a dispersion parameter of 1 and made predictions on the evaluation data set. The following is statistical summary of the predicted responses.
summary(predict(BIC_nb_k1_mod, newdata = imputed_eval$ximp, type = "response"))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.497 1.792 2.671 3.114 3.937 11.833