In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine.
HINT: Sometimes, the fact that a variable is missing is actually predictive of the target.
You can only use the variables given to you (or variables that you derive from the variables provided).
Below is a short description of the variables of interest in the data set:
## [1] 12795 16
The training dataframe has 12795 rows and 16 columns with 1 target variable and 15 features.
## Classes 'data.table' and 'data.frame': 12795 obs. of 16 variables:
## $ INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET : int 3 3 5 3 4 0 0 4 3 6 ...
## $ FixedAcidity : num 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
## $ VolatileAcidity : num 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
## $ CitricAcid : num -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
## $ ResidualSugar : num 54.2 26.1 14.8 18.8 9.4 ...
## $ Chlorides : num -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
## $ FreeSulfurDioxide : num NA 15 214 22 -167 -37 287 523 -213 62 ...
## $ TotalSulfurDioxide: num 268 -327 142 115 108 15 156 551 NA 180 ...
## $ Density : num 0.993 1.028 0.995 0.996 0.995 ...
## $ pH : num 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
## $ Sulphates : num -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
## $ Alcohol : num 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
## $ LabelAppeal : int 0 -1 -1 -1 0 0 0 1 0 0 ...
## $ AcidIndex : int 8 7 8 6 9 11 8 7 6 8 ...
## $ STARS : int 2 3 3 1 2 NA NA 3 NA 4 ...
## - attr(*, ".internal.selfref")=<externalptr>
show_summary <- function(df) {
cat(rep("+", 50), "\n")
cat(paste("DIMENSIONS : (", nrow(df), ", ", ncol(df), ")\n", sep = ""), "\n")
cat(rep("+", 50), "\n")
cat("COLUMNS:\n", "\n")
col_names <- names(df)
cat(paste(col_names, ", "))
cat(rep("+", 50), "\n")
cat("DATA INFO:\n", "\n")
cat(sapply(df, class), "\n")
cat(rep("+", 50), "\n")
cat("MISSING VALUES:\n", "\n")
missing_values <- colSums(is.na(df))
cat(paste(col_names, ": ", missing_values, "\n"))
}
show_summary(train_df)## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## DIMENSIONS : (12795, 16)
##
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## COLUMNS:
##
## INDEX , TARGET , FixedAcidity , VolatileAcidity , CitricAcid , ResidualSugar , Chlorides , FreeSulfurDioxide , TotalSulfurDioxide , Density , pH , Sulphates , Alcohol , LabelAppeal , AcidIndex , STARS , + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## DATA INFO:
##
## integer integer numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric numeric integer integer integer
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## MISSING VALUES:
##
## INDEX : 0
## TARGET : 0
## FixedAcidity : 0
## VolatileAcidity : 0
## CitricAcid : 0
## ResidualSugar : 616
## Chlorides : 638
## FreeSulfurDioxide : 647
## TotalSulfurDioxide : 682
## Density : 0
## pH : 395
## Sulphates : 1210
## Alcohol : 653
## LabelAppeal : 0
## AcidIndex : 0
## STARS : 3359
There the different chemical components of wine and its STAR rating on 1-5 scale where 5 stars is the highest quality of wine.
par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))
for (col_name in names(train_df)[-1]) {
if (is.numeric(train_df[[col_name]])) {
hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
}
}
par(mfrow = c(1, 1))par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))
for (col_name in names(train_df)[-1]) {
if (is.numeric(train_df[[col_name]])) {
boxplot(train_df[[col_name]], main = paste(col_name), horizontal = TRUE,
ylab = "Value")
}
}
par(mfrow = c(1, 1))All of the features distribution is center around the mean which is what we want. Further supported with boxplots that the distributions look normal.
Firstly, we want to remove the INDEX column from both the training and testing dataframe.
We then split the training data in to train-test split since the evaluation data is unlabeled thus we can not use for test the models’ performance. In addition, we chose to use MICE imputation method to fill out the missing values in the data.
# Impute missing data using mice
set.seed(123123)
imputed_data <- mice(train_df, m = 5, method = "pmm")##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
complete_train_df <- complete(imputed_data, 1)
test_df <- test_df[, 2:15]
test_imputed <- mice(test_df, m = 5, method = "pmm")##
## iter imp variable
## 1 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 1 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 2 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 3 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 4 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 1 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 2 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 3 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 4 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
## 5 5 ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide pH Sulphates Alcohol STARS
correlation_matrix <- cor(complete_train_df[, 1:14])
corrplot(correlation_matrix, method = "color", type = "upper", addCoef.col = "black",
number.cex = 0.5)The features are have little to no correlation with the target and with other predictor variables.
set.seed(123123)
train_indices <- createDataPartition(complete_train_df$TARGET, p = 0.8, list = FALSE)
train_model_df <- complete_train_df[train_indices, ]
test_model_df <- complete_train_df[-train_indices, ]
x_train <- train_model_df[, 2:15] |>
as.matrix()
y_train <- train_model_df$TARGET |>
as.matrix() |>
as.numeric()performance_metrics <- function(model, true_values, predicted_values) {
deviance <- deviance(model)
aic <- AIC(model)
bic <- BIC(model)
mse_val <- mean((true_values - predicted_values)^2)
rmse_val <- sqrt(mse_val)
mae_val <- mean(abs(predicted_values - true_values))
metrics_df <- data.frame(Deviance = deviance, AIC = aic, BIC = bic, MSE = mse_val,
RMSE = rmse_val, MAE = mae_val)
return(metrics_df)
}##
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = train_model_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.198e+00 2.288e-01 9.603 < 2e-16 ***
## FixedAcidity -1.222e-03 9.214e-04 -1.326 0.184803
## VolatileAcidity -4.087e-02 7.321e-03 -5.582 2.37e-08 ***
## CitricAcid 8.012e-03 6.564e-03 1.221 0.222197
## ResidualSugar 6.622e-05 1.684e-04 0.393 0.694146
## Chlorides -5.183e-02 1.788e-02 -2.898 0.003751 **
## FreeSulfurDioxide 1.356e-04 3.821e-05 3.550 0.000386 ***
## TotalSulfurDioxide 9.716e-05 2.480e-05 3.917 8.96e-05 ***
## Density -3.637e-01 2.143e-01 -1.697 0.089646 .
## pH -1.551e-02 8.406e-03 -1.845 0.065039 .
## Sulphates -1.257e-02 6.021e-03 -2.088 0.036835 *
## Alcohol 3.353e-03 1.539e-03 2.179 0.029358 *
## LabelAppeal 1.492e-01 6.780e-03 22.005 < 2e-16 ***
## AcidIndex -6.954e-01 3.974e-02 -17.499 < 2e-16 ***
## STARS 3.394e-01 6.275e-03 54.087 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18374 on 10237 degrees of freedom
## Residual deviance: 12937 on 10223 degrees of freedom
## AIC: 38498
##
## Number of Fisher Scoring iterations: 5
pm_1_predictions <- predict(pm_1, newdata = test_model_df)
performance_metrics(pm_1, test_model_df$TARGET, pm_1_predictions)## Deviance AIC BIC MSE RMSE MAE
## 1 12936.87 38497.9 38606.4 6.81127 2.609841 2.304567
lasso_model_poisson <- cv.glmnet(x_train, y_train, family = "poisson", alpha = 1,
nfolds = 5)
best_lambda <- lasso_model_poisson$lambda.min
selected_features_poisson <- which(coef(lasso_model_poisson, s = best_lambda) !=
0)
selected_features_lasso <- which(coef(lasso_model_poisson, s = best_lambda) != 0)
print(paste("Selected features by LASSO:", selected_features_lasso))## [1] "Selected features by LASSO: 1" "Selected features by LASSO: 2"
## [3] "Selected features by LASSO: 3" "Selected features by LASSO: 4"
## [5] "Selected features by LASSO: 5" "Selected features by LASSO: 6"
## [7] "Selected features by LASSO: 7" "Selected features by LASSO: 8"
## [9] "Selected features by LASSO: 9" "Selected features by LASSO: 10"
## [11] "Selected features by LASSO: 11" "Selected features by LASSO: 12"
## [13] "Selected features by LASSO: 13" "Selected features by LASSO: 14"
## [15] "Selected features by LASSO: 15"
Since lasso returns that all predictors should be in the model that would the same as model 1. Therefore, for this model we will manually keep the features that are the most significant based on model 1.
pm_lasso <- glm(TARGET ~ VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide +
LabelAppeal + AcidIndex + STARS, data = train_model_df, family = poisson)
summary(pm_lasso)##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide +
## TotalSulfurDioxide + LabelAppeal + AcidIndex + STARS, family = poisson,
## data = train_model_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.825e+00 8.189e-02 22.285 < 2e-16 ***
## VolatileAcidity -4.127e-02 7.319e-03 -5.639 1.71e-08 ***
## FreeSulfurDioxide 1.333e-04 3.820e-05 3.490 0.000482 ***
## TotalSulfurDioxide 9.567e-05 2.477e-05 3.863 0.000112 ***
## LabelAppeal 1.487e-01 6.777e-03 21.938 < 2e-16 ***
## AcidIndex -7.062e-01 3.903e-02 -18.095 < 2e-16 ***
## STARS 3.414e-01 6.256e-03 54.579 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18374 on 10237 degrees of freedom
## Residual deviance: 12965 on 10231 degrees of freedom
## AIC: 38510
##
## Number of Fisher Scoring iterations: 5
m2_predictions <- predict(pm_lasso, test_model_df)
performance_metrics(pm_lasso, test_model_df$TARGET, m2_predictions)## Deviance AIC BIC MSE RMSE MAE
## 1 12964.97 38510 38560.63 6.81459 2.610477 2.305362
require(randomForest)
rf_model <- randomForest(x_train, y_train, ntree = 100, importance = TRUE)
print(rf_model)##
## Call:
## randomForest(x = x_train, y = y_train, ntree = 100, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 1.810821
## % Var explained: 51.41
feature_importance <- importance(rf_model)
sorted_features <- sort(feature_importance[, 1], decreasing = TRUE)
top_features <- names(sorted_features)[1:5]
print(top_features)## [1] "STARS" "LabelAppeal" "AcidIndex"
## [4] "VolatileAcidity" "TotalSulfurDioxide"
pm_3 <- glm(TARGET ~ STARS + LabelAppeal + AcidIndex + VolatileAcidity + Chlorides,
data = train_model_df, family = poisson)
summary(pm_3)##
## Call:
## glm(formula = TARGET ~ STARS + LabelAppeal + AcidIndex + VolatileAcidity +
## Chlorides, family = poisson, data = train_model_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.855998 0.081576 22.752 < 2e-16 ***
## STARS 0.341355 0.006252 54.599 < 2e-16 ***
## LabelAppeal 0.148702 0.006780 21.932 < 2e-16 ***
## AcidIndex -0.711879 0.038961 -18.271 < 2e-16 ***
## VolatileAcidity -0.042063 0.007314 -5.751 8.86e-09 ***
## Chlorides -0.052673 0.017861 -2.949 0.00319 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18374 on 10237 degrees of freedom
## Residual deviance: 12984 on 10232 degrees of freedom
## AIC: 38527
##
## Number of Fisher Scoring iterations: 5
pm3_predictions <- predict(pm_3, newdata = test_model_df)
performance_metrics(pm_3, test_model_df$TARGET, pm3_predictions)## Deviance AIC BIC MSE RMSE MAE
## 1 12983.61 38526.64 38570.04 6.813023 2.610177 2.30607
##
## Call:
## glm.nb(formula = TARGET ~ ., data = train_df, init.theta = 140198.2884,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.593e+00 2.506e-01 6.359 2.03e-10 ***
## FixedAcidity 3.293e-04 1.053e-03 0.313 0.75446
## VolatileAcidity -2.560e-02 8.353e-03 -3.065 0.00218 **
## CitricAcid -7.259e-04 7.575e-03 -0.096 0.92365
## ResidualSugar -6.141e-05 1.941e-04 -0.316 0.75166
## Chlorides -3.007e-02 2.056e-02 -1.463 0.14347
## FreeSulfurDioxide 6.734e-05 4.404e-05 1.529 0.12621
## TotalSulfurDioxide 2.081e-05 2.855e-05 0.729 0.46618
## Density -3.725e-01 2.462e-01 -1.513 0.13026
## pH -4.661e-03 9.598e-03 -0.486 0.62722
## Sulphates -5.164e-03 7.052e-03 -0.732 0.46398
## Alcohol 3.948e-03 1.771e-03 2.229 0.02579 *
## LabelAppeal 1.771e-01 7.954e-03 22.271 < 2e-16 ***
## AcidIndex -4.870e-02 5.903e-03 -8.251 < 2e-16 ***
## STARS 1.871e-01 7.487e-03 24.992 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(140198.3) family taken to be 1)
##
## Null deviance: 5843.9 on 6435 degrees of freedom
## Residual deviance: 4009.1 on 6421 degrees of freedom
## (6359 observations deleted due to missingness)
## AIC: 23174
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 140198
## Std. Err.: 234984
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -23141.85
nb_1_predictions <- predict(nb_1, test_model_df)
performance_metrics(nb_1, test_model_df$TARGET, nb_1_predictions)## Deviance AIC BIC MSE RMSE MAE
## 1 4009.079 23173.85 23282.17 5.474343 2.339731 2.111443
full_model <- glm.nb(TARGET ~ ., data = train_model_df)
nb_2 <- stepAIC(full_model, direction = "both")## Start: AIC=38498.07
## TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
## Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
## pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + STARS
##
## Df AIC
## - ResidualSugar 1 38496
## - CitricAcid 1 38498
## - FixedAcidity 1 38498
## <none> 38498
## - Density 1 38499
## - pH 1 38499
## - Sulphates 1 38500
## - Alcohol 1 38501
## - Chlorides 1 38504
## - FreeSulfurDioxide 1 38509
## - TotalSulfurDioxide 1 38511
## - VolatileAcidity 1 38527
## - AcidIndex 1 38808
## - LabelAppeal 1 38980
## - STARS 1 41355
##
## Step: AIC=38496.23
## TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + Chlorides +
## FreeSulfurDioxide + TotalSulfurDioxide + Density + pH + Sulphates +
## Alcohol + LabelAppeal + AcidIndex + STARS
##
## Df AIC
## - CitricAcid 1 38496
## - FixedAcidity 1 38496
## <none> 38496
## - Density 1 38497
## - pH 1 38498
## + ResidualSugar 1 38498
## - Sulphates 1 38499
## - Alcohol 1 38499
## - Chlorides 1 38503
## - FreeSulfurDioxide 1 38507
## - TotalSulfurDioxide 1 38510
## - VolatileAcidity 1 38525
## - AcidIndex 1 38807
## - LabelAppeal 1 38978
## - STARS 1 41355
##
## Step: AIC=38495.7
## TARGET ~ FixedAcidity + VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + pH + Sulphates + Alcohol +
## LabelAppeal + AcidIndex + STARS
##
## Df AIC
## - FixedAcidity 1 38495
## <none> 38496
## + CitricAcid 1 38496
## - Density 1 38497
## - pH 1 38497
## + ResidualSugar 1 38498
## - Sulphates 1 38498
## - Alcohol 1 38498
## - Chlorides 1 38502
## - FreeSulfurDioxide 1 38506
## - TotalSulfurDioxide 1 38509
## - VolatileAcidity 1 38525
## - AcidIndex 1 38805
## - LabelAppeal 1 38978
## - STARS 1 41357
##
## Step: AIC=38495.47
## TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + LabelAppeal + AcidIndex +
## STARS
##
## Df AIC
## <none> 38495
## + FixedAcidity 1 38496
## + CitricAcid 1 38496
## - Density 1 38496
## - pH 1 38497
## + ResidualSugar 1 38497
## - Sulphates 1 38498
## - Alcohol 1 38498
## - Chlorides 1 38502
## - FreeSulfurDioxide 1 38506
## - TotalSulfurDioxide 1 38509
## - VolatileAcidity 1 38525
## - AcidIndex 1 38820
## - LabelAppeal 1 38979
## - STARS 1 41356
##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + pH + Sulphates + Alcohol +
## LabelAppeal + AcidIndex + STARS, data = train_model_df, init.theta = 48525.14821,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.207e+00 2.287e-01 9.650 < 2e-16 ***
## VolatileAcidity -4.107e-02 7.318e-03 -5.612 2.00e-08 ***
## Chlorides -5.186e-02 1.789e-02 -2.899 0.003742 **
## FreeSulfurDioxide 1.357e-04 3.821e-05 3.552 0.000383 ***
## TotalSulfurDioxide 9.797e-05 2.479e-05 3.952 7.75e-05 ***
## Density -3.686e-01 2.143e-01 -1.720 0.085438 .
## pH -1.549e-02 8.405e-03 -1.843 0.065356 .
## Sulphates -1.285e-02 6.020e-03 -2.135 0.032780 *
## Alcohol 3.369e-03 1.539e-03 2.190 0.028556 *
## LabelAppeal 1.493e-01 6.780e-03 22.025 < 2e-16 ***
## AcidIndex -7.005e-01 3.920e-02 -17.871 < 2e-16 ***
## STARS 3.395e-01 6.273e-03 54.120 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(48525.15) family taken to be 1)
##
## Null deviance: 18373 on 10237 degrees of freedom
## Residual deviance: 12940 on 10226 degrees of freedom
## AIC: 38497
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 48525
## Std. Err.: 63351
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -38471.47
nb_2_predictions <- predict(nb_2, test_model_df)
performance_metrics(nb_2, test_model_df$TARGET, nb_2_predictions)## Deviance AIC BIC MSE RMSE MAE
## 1 12939.8 38497.47 38591.51 6.809707 2.609541 2.304512
### NB Model 3 Tree Based Feature Selection
nb_3 <- glm.nb(TARGET ~ STARS + LabelAppeal + AcidIndex + VolatileAcidity + Chlorides,
data = train_model_df)
summary(nb_3)##
## Call:
## glm.nb(formula = TARGET ~ STARS + LabelAppeal + AcidIndex + VolatileAcidity +
## Chlorides, data = train_model_df, init.theta = 48454.10433,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.856021 0.081579 22.751 < 2e-16 ***
## STARS 0.341359 0.006252 54.598 < 2e-16 ***
## LabelAppeal 0.148701 0.006780 21.931 < 2e-16 ***
## AcidIndex -0.711894 0.038963 -18.271 < 2e-16 ***
## VolatileAcidity -0.042064 0.007314 -5.751 8.86e-09 ***
## Chlorides -0.052675 0.017862 -2.949 0.00319 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(48454.1) family taken to be 1)
##
## Null deviance: 18373 on 10237 degrees of freedom
## Residual deviance: 12983 on 10232 degrees of freedom
## AIC: 38529
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 48454
## Std. Err.: 63478
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -38514.81
nb_3_predictions <- predict(nb_3, test_model_df)
performance_metrics(nb_3, test_model_df$TARGET, nb_3_predictions)## Deviance AIC BIC MSE RMSE MAE
## 1 12983.14 38528.81 38579.44 6.813019 2.610176 2.306069
model_names <- c("Poisson Model 1", "Poisson Model 2", "Poisson Model 3", "NB Model 1",
"NB Model 2", "NB Model 3")
pm1_metrics <- performance_metrics(pm_1, test_model_df$TARGET, pm_1_predictions)
pm2_metrics <- performance_metrics(pm_lasso, test_model_df$TARGET, m2_predictions)
pm3_metrics <- performance_metrics(pm_3, test_model_df$TARGET, pm3_predictions)
nb1_metrics <- performance_metrics(nb_1, test_model_df$TARGET, nb_1_predictions)
nb2_metrics <- performance_metrics(nb_2, test_model_df$TARGET, nb_2_predictions)
nb3_metrics <- performance_metrics(nb_3, test_model_df$TARGET, nb_3_predictions)
all_models_metrics <- rbind(pm1_metrics, pm2_metrics, pm3_metrics, nb1_metrics, nb2_metrics,
nb3_metrics)
final_table <- cbind(Model = model_names, all_models_metrics)
data.table(final_table)## Model Deviance AIC BIC MSE RMSE MAE
## 1: Poisson Model 1 12936.868 38497.90 38606.40 6.811270 2.609841 2.304567
## 2: Poisson Model 2 12964.969 38510.00 38560.63 6.814590 2.610477 2.305362
## 3: Poisson Model 3 12983.607 38526.64 38570.04 6.813023 2.610177 2.306070
## 4: NB Model 1 4009.079 23173.85 23282.17 5.474343 2.339731 2.111443
## 5: NB Model 2 12939.800 38497.47 38591.51 6.809707 2.609541 2.304512
## 6: NB Model 3 12983.140 38528.81 38579.44 6.813019 2.610176 2.306069
All models performed almost identical for which we did not expect. We will Negative Binomial Model 1 which had the better performance metrics to predict the evaluation dataframe.
predictions_df <- predict(nb_1, newdata = complete_test_df, type = "response")
predicted_test_df <- cbind(TARGET = round(predictions_df), test_df)
data.table(predicted_test_df)## TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
## 1: 3 5.4 -0.860 0.27 -10.7 0.092
## 2: 4 12.4 0.385 -0.76 -19.7 1.169
## 3: 3 7.2 1.750 0.17 -33.0 0.065
## 4: 2 6.2 0.100 1.80 1.0 -0.179
## 5: 3 11.4 0.210 0.28 1.2 0.038
## ---
## 3331: 3 7.8 1.180 0.24 NA 0.034
## 3332: 6 -0.4 -0.980 0.34 1.3 0.226
## 3333: 4 6.6 0.410 0.22 NA 0.035
## 3334: 3 6.1 -0.220 1.15 1.1 0.041
## 3335: 4 7.1 0.210 0.31 14.6 0.021
## FreeSulfurDioxide TotalSulfurDioxide Density pH Sulphates Alcohol
## 1: 23 398 0.98527 5.02 0.64 12.30
## 2: -37 68 0.99048 3.37 1.09 16.00
## 3: 9 76 1.04641 4.61 0.68 8.55
## 4: 104 89 0.98877 3.20 2.11 12.30
## 5: 70 53 1.02899 2.54 -0.07 4.80
## ---
## 3331: 29 -120 0.99031 3.10 0.40 6.10
## 3332: 136 0 0.99176 3.07 -1.22 14.70
## 3333: 23 117 1.00962 2.84 0.39 8.44
## 3334: 32 92 1.01541 3.26 NA 17.20
## 3335: 281 142 0.99215 3.17 -0.37 9.70
## LabelAppeal AcidIndex STARS
## 1: -1 6 NA
## 2: 0 6 2
## 3: 0 8 1
## 4: -1 8 1
## 5: 0 10 NA
## ---
## 3331: 0 7 2
## 3332: 1 9 4
## 3333: 0 7 2
## 3334: 0 7 1
## 3335: 0 8 3