Data 621 - HW5
Loading of Libraries
Data Dictionary
Loading of files
setwd("/Users/dpong/Data 621/HW5/")
# Load Wine dataset
wine_train <- read.csv('https://raw.githubusercontent.com/metis-macys-66898/Data_621/main/HW5/wine-training-data.csv', fileEncoding="UTF-8-BOM")
wine_eval <- read.csv('https://raw.githubusercontent.com/metis-macys-66898/Data_621/main/HW5/wine-evaluation-data.csv')As we know the INDEX variable isn’t going to be useful for this modeling exercise, we decide to drop it from the dataset altogether.
Given that the Index column had no impact on the target variable, number of wines, it was dropped.
Data Exploration
Summary
summary(wine_train)## TARGET FixedAcidity VolatileAcidity CitricAcid
## Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400
## 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300
## Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100
## Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084
## 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800
## Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600
##
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## Min. :-127.800 Min. :-1.1710 Min. :-555.00 Min. :-823.0
## 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00 1st Qu.: 27.0
## Median : 3.900 Median : 0.0460 Median : 30.00 Median : 123.0
## Mean : 5.419 Mean : 0.0548 Mean : 30.85 Mean : 120.7
## 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00 3rd Qu.: 208.0
## Max. : 141.150 Max. : 1.3510 Max. : 623.00 Max. :1057.0
## NA's :616 NA's :638 NA's :647 NA's :682
## Density pH Sulphates Alcohol
## Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70
## 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00
## Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40
## Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49
## 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40
## Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50
## NA's :395 NA's :1210 NA's :653
## LabelAppeal AcidIndex STARS
## Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median : 0.000000 Median : 8.000 Median :2.000
## Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :3359
There is a total of 8 features that has NA’s (missing) values. Our target variable ranges between 0 and 8, which makes sense, because the target variable is the number of cases purchased. Even tho’ it might not make sense, some of these features measuring the quantity of chemical in the wine does have negative values. It might be due to the fact these variables had been transformed beforehand. We decided to leave them as is.
Histograms
# Create gather_df for ggplot()
gather_df <- wine_train %>% gather(key = 'variable', value = 'value')
# Histogram plots of each variable
ggplot(gather_df) +
geom_histogram(aes(x=value, y = ..density..), bins=30) +
geom_density(aes(x=value), color='blue') +
facet_wrap(. ~ variable, scales='free', ncol=4)## Warning: Removed 8200 rows containing non-finite values (stat_bin).
## Warning: Removed 8200 rows containing non-finite values (stat_density).
We see that most of the distributions has approximately normal distributions except for STARS and AcidIndex, which are both right skewed.
Boxplots
Next, we’re going to run some boxplots to visualize the spreads of each variable.
# Create gather_df for the input of ggplot
gather_df <- wine_train %>% gather(key = 'variable', value = 'value')
# Boxplots for each variable
ggplot(gather_df, aes(variable, value)) +
geom_boxplot() +
facet_wrap(. ~variable, scales='free', ncol=4)## Warning: Removed 8200 rows containing non-finite values (stat_boxplot).
df_pivot_wide <- wine_train %>%
dplyr::select(STARS, LabelAppeal, AcidIndex, TARGET ) %>%
pivot_longer(cols = -TARGET, names_to="variable", values_to="value") %>%
arrange(variable, value)
df_pivot_wide %>%
ggplot(mapping = aes(x = factor(value), y = TARGET)) +
geom_boxplot() +
facet_wrap(.~variable, scales="free") +
theme_minimal()Commentaries:
There aren’t too many outliners for AcidIndex. You can tell there are a lot of zeros for AcidIndex 12, 16, and 17. There is no clear pattern in relation to TARGET. As for LabelAppeal, I do see there is positive correlation with TARGET. The higher the LabelAppeal, the higher volume of TARGET you get. As for STARS, there is an obvious positive correlation with TARGET. TARGET = NA seems to be distribute across all spectrum of STARS. In order to satisfy some of the requirements for the model, I’d impute NA with 0. The overall trend with the existing values is still the same where the higher the value of STARS will naturally net a higher volume in TARGET, which is cases of wine sold.
Scatter Plots
featurePlot(wine_train[,2:ncol(wine_train)], wine_train[,1], pch = 18)What I am looking for is some irregular gaps for some values in a given variable. I do not see any irregular distribution against the the TARGET variable, which is shown in the y-axis.
Missing values & Imputations
With that said, we do need to check the missing values.
missing <- colSums(wine_train %>% sapply(is.na))
missing_pct <- round(missing / nrow(wine_train) * 100, 2)
stack(sort(missing_pct, decreasing = TRUE))## values ind
## 1 26.25 STARS
## 2 9.46 Sulphates
## 3 5.33 TotalSulfurDioxide
## 4 5.10 Alcohol
## 5 5.06 FreeSulfurDioxide
## 6 4.99 Chlorides
## 7 4.81 ResidualSugar
## 8 3.09 pH
## 9 0.00 TARGET
## 10 0.00 FixedAcidity
## 11 0.00 VolatileAcidity
## 12 0.00 CitricAcid
## 13 0.00 Density
## 14 0.00 LabelAppeal
## 15 0.00 AcidIndex
As you can see, there are 7 additional variables that need to be imputed in addition to the STARS variable.
Data Preparations
Strategies:
impute STARS to 0
Use knnImpute and BoxCox to impute all the remaining 7 columns
training_x <- wine_train %>% dplyr::select(-TARGET)
training_y <- wine_train$TARGET
eval_x <- wine_eval %>% dplyr::select(-TARGET)
eval_y <- wine_eval$TARGET
create_na_dummy <- function(vector) {
as.integer(vector %>% is.na())
}
impute_missing <- function(data) {
# Replace missing STARS with 0
data$STARS <- data$STARS %>%
tidyr::replace_na(0)
return(data)
}
# Replace missing STARS with 'unknown' and convert STASR to a factor
training_x <- impute_missing(training_x)
eval_x <- impute_missing(eval_x)
imputation <- caret::preProcess(training_x, method = c("knnImpute", 'BoxCox'))
# summary(imputation)
training_x_imp <- predict(imputation, training_x)
eval_x_imp <- predict(imputation, eval_x)
clean_df <- cbind(training_y, training_x_imp) %>%
as.data.frame() %>%
rename(TARGET = training_y)
clean_eval_df <- cbind(eval_y, eval_x_imp) %>%
as.data.frame() %>%
rename(TARGET = eval_y)Feature-Target Correlations
stack(sort(cor(clean_df[,1], clean_df[,2:ncol(clean_df)])[,], decreasing=TRUE))## values ind
## 1 0.685381473 STARS
## 2 0.356500469 LabelAppeal
## 3 0.062030498 Alcohol
## 4 0.051730323 TotalSulfurDioxide
## 5 0.043996542 FreeSulfurDioxide
## 6 0.016187709 ResidualSugar
## 7 0.008684633 CitricAcid
## 8 -0.009081197 pH
## 9 -0.035589560 Density
## 10 -0.039072231 Chlorides
## 11 -0.039917146 Sulphates
## 12 -0.049010939 FixedAcidity
## 13 -0.088793212 VolatileAcidity
## 14 -0.221991949 AcidIndex
Only STARS is considered borderline highly correlated with the TARGET variable. Note that this is after the imputation.
Multi-collinearity
The best way to check for multi-collinearity is to use correlation coefficients among variables, or predictors.
correlation = cor(clean_df, use = 'pairwise.complete.obs')
corrplot(correlation, 'ellipse', type = 'lower', order = 'hclust', col=brewer.pal(n=6, name="RdYlBu"))The correlation coefficients among predictors are quite low. With that said, we checked all the assumptions for linear regressions.
Final steps to data prep. I have to create a data partition separating out train set and test set. 80% train 20% test.
y_mat <- as.matrix(clean_df$TARGET)
# Create a train_vect
train_vect <- createDataPartition(y_mat, p=0.8, list=FALSE)
# Build train sets
trainX <- clean_df[train_vect,] %>% dplyr::select(-TARGET)
trainY <- clean_df[train_vect,] %>% dplyr::select(TARGET)
# Output test sets
testX <- clean_df[-train_vect,] %>% dplyr::select(-TARGET)
testY <- clean_df[-train_vect,] %>% dplyr::select(TARGET)
# Build a DF for both train and test
train_df <- as.data.frame(trainX)
train_df$TARGET <- trainY$TARGET
print(paste('Size of Training data frame: ', dim(train_df)[1]))## [1] "Size of Training data frame: 10238"
test_df <- as.data.frame(testX)
test_df$TARGET <- testY$TARGET
print(paste('Size of Testing data frame: ', dim(test_df)[1]))## [1] "Size of Testing data frame: 2557"
model_perf_metrics <- function(model, trainX, trainY, testX, testY) {
# Evaluate Model with testing data set
predY <- predict(model, newdata=trainX)
model_results <- data.frame(obs = trainY, pred=predY)
colnames(model_results) = c('obs', 'pred')
# defaultSummary includes RMSE, Rsquared, and MAE by default
model_eval <- defaultSummary(model_results)
# Add AIC score to the model_eval results
model_eval[4] <- AIC(model)
names(model_eval)[4] <- 'AIC'
# Add BIC score to the model_eval results
model_eval[5] <- BIC(model)
names(model_eval)[5] <- 'BIC'
return(model_eval)
}Model Building
variableImportancePlot <- function(model=NULL, chart_title='Variable Importance Plot') {
# Make sure a model was passed
if (is.null(model)) {
return
}
# use caret and gglot to print a variable importance plot
caret::varImp(model) %>% as.data.frame() %>%
ggplot(aes(x = reorder(rownames(.), desc(Overall)), y = Overall)) +
geom_col(aes(fill = Overall)) +
theme(panel.background = element_blank(),
panel.grid = element_blank(),
axis.text.x = element_text(angle = 90)) +
scale_fill_gradient() +
labs(title = chart_title,
x = "Parameter",
y = "Relative Importance")
}Poisson Model 1 (full model w/ 14 predictors)
pois1 <- glm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=train_df,
family=poisson
)
summary(pois1)##
## Call:
## glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = poisson,
## data = train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2539 -0.6371 -0.0063 0.4424 3.6752
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.687853 172.654518 -0.068 0.946028
## FixedAcidity 0.005250 0.005789 0.907 0.364535
## VolatileAcidity -0.021188 0.005709 -3.711 0.000206
## CitricAcid 0.005663 0.005682 0.997 0.318868
## ResidualSugar -0.003967 0.005788 -0.685 0.493128
## Chlorides -0.008474 0.005818 -1.457 0.145247
## FreeSulfurDioxide 0.012187 0.005784 2.107 0.035124
## TotalSulfurDioxide 0.018498 0.005839 3.168 0.001534
## Density -0.009423 0.005702 -1.652 0.098434
## pH -0.009472 0.005754 -1.646 0.099700
## Sulphates -0.012102 0.005971 -2.027 0.042676
## Alcohol 0.020384 0.005885 3.464 0.000532
## as.factor(LabelAppeal)-1.11204793733397 0.245210 0.042964 5.707 1.15e-08
## as.factor(LabelAppeal)0.0101741115806247 0.444586 0.041960 10.595 < 2e-16
## as.factor(LabelAppeal)1.13239616049522 0.578377 0.042654 13.560 < 2e-16
## as.factor(LabelAppeal)2.25461820940981 0.732594 0.047787 15.330 < 2e-16
## as.factor(AcidIndex)-3.59682937695875 11.538890 172.654526 0.067 0.946715
## as.factor(AcidIndex)-1.79176983045029 11.590080 172.654514 0.067 0.946479
## as.factor(AcidIndex)-0.545318540973785 11.556816 172.654514 0.067 0.946633
## as.factor(AcidIndex)0.362910765511677 11.519185 172.654514 0.067 0.946806
## as.factor(AcidIndex)1.05172974217783 11.411107 172.654515 0.066 0.947304
## as.factor(AcidIndex)1.59059728918163 11.231692 172.654517 0.065 0.948132
## as.factor(AcidIndex)2.02271372429848 10.909721 172.654525 0.063 0.949617
## as.factor(AcidIndex)2.37629509167962 10.949795 172.654538 0.063 0.949432
## as.factor(AcidIndex)2.67051656830802 11.106215 172.654545 0.064 0.948710
## as.factor(AcidIndex)2.9188445277671 10.988729 172.654580 0.064 0.949252
## as.factor(AcidIndex)3.13100139587667 11.454743 172.654695 0.066 0.947103
## as.factor(AcidIndex)3.31417429494859 11.017790 172.655094 0.064 0.949118
## as.factor(AcidIndex)3.47378568897179 10.878778 172.655096 0.063 0.949760
## as.factor(STARS)-0.42623524866846 0.746259 0.021899 34.078 < 2e-16
## as.factor(STARS)0.416552574962037 1.065055 0.020459 52.059 < 2e-16
## as.factor(STARS)1.25934039859254 1.179472 0.021518 54.812 < 2e-16
## as.factor(STARS)2.10212822222303 1.302543 0.027050 48.154 < 2e-16
##
## (Intercept)
## FixedAcidity
## VolatileAcidity ***
## CitricAcid
## ResidualSugar
## Chlorides
## FreeSulfurDioxide *
## TotalSulfurDioxide **
## Density .
## pH .
## Sulphates *
## Alcohol ***
## as.factor(LabelAppeal)-1.11204793733397 ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522 ***
## as.factor(LabelAppeal)2.25461820940981 ***
## as.factor(AcidIndex)-3.59682937695875
## as.factor(AcidIndex)-1.79176983045029
## as.factor(AcidIndex)-0.545318540973785
## as.factor(AcidIndex)0.362910765511677
## as.factor(AcidIndex)1.05172974217783
## as.factor(AcidIndex)1.59059728918163
## as.factor(AcidIndex)2.02271372429848
## as.factor(AcidIndex)2.37629509167962
## as.factor(AcidIndex)2.67051656830802
## as.factor(AcidIndex)2.9188445277671
## as.factor(AcidIndex)3.13100139587667
## as.factor(AcidIndex)3.31417429494859
## as.factor(AcidIndex)3.47378568897179
## as.factor(STARS)-0.42623524866846 ***
## as.factor(STARS)0.416552574962037 ***
## as.factor(STARS)1.25934039859254 ***
## as.factor(STARS)2.10212822222303 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18262 on 10237 degrees of freedom
## Residual deviance: 10784 on 10205 degrees of freedom
## AIC: 36436
##
## Number of Fisher Scoring iterations: 9
# Evaluation and VarImp
(pois1_eval <- model_perf_metrics(pois1, trainX, trainY, testX, testY))## RMSE Rsquared MAE aic bic
## 2.593117 0.505956 2.228388 36435.917144 36674.634576
poi1VarImp <- variableImportancePlot(pois1, "Poisson Model 1 Variable Importance")Poisson Model 2
Just picked the predictors that are statistical significant in model 1.
Predictors include:
VolatileAcidity
Chlorides
FreeSulfurDioxide
TotalSulfurDioxide
Sulphates
Alcohol
LabelAppeal
AcidIndex
STARS
pois2 <- glm(TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=clean_df,
family=poisson
)
summary(pois2)##
## Call:
## glm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), family = poisson,
## data = clean_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2357 -0.6510 -0.0058 0.4411 3.6662
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.019048 0.318723 0.060 0.95234
## VolatileAcidity -0.023413 0.005122 -4.571 4.85e-06
## Chlorides -0.012368 0.005217 -2.371 0.01776
## FreeSulfurDioxide 0.013273 0.005181 2.562 0.01041
## TotalSulfurDioxide 0.016843 0.005244 3.212 0.00132
## Sulphates -0.010623 0.005312 -2.000 0.04554
## Alcohol 0.016006 0.005232 3.059 0.00222
## as.factor(LabelAppeal)-1.11204793733397 0.239676 0.037999 6.307 2.84e-10
## as.factor(LabelAppeal)0.0101741115806247 0.429898 0.037065 11.599 < 2e-16
## as.factor(LabelAppeal)1.13239616049522 0.563397 0.037712 14.940 < 2e-16
## as.factor(LabelAppeal)2.25461820940981 0.698247 0.042448 16.450 < 2e-16
## as.factor(AcidIndex)-3.59682937695875 -0.158946 0.322391 -0.493 0.62200
## as.factor(AcidIndex)-1.79176983045029 -0.114374 0.316934 -0.361 0.71819
## as.factor(AcidIndex)-0.545318540973785 -0.148098 0.316650 -0.468 0.64000
## as.factor(AcidIndex)0.362910765511677 -0.178792 0.316681 -0.565 0.57236
## as.factor(AcidIndex)1.05172974217783 -0.288387 0.316984 -0.910 0.36294
## as.factor(AcidIndex)1.59059728918163 -0.445135 0.318061 -1.400 0.16165
## as.factor(AcidIndex)2.02271372429848 -0.806669 0.321621 -2.508 0.01214
## as.factor(AcidIndex)2.37629509167962 -0.820516 0.327286 -2.507 0.01217
## as.factor(AcidIndex)2.67051656830802 -0.658949 0.330177 -1.996 0.04596
## as.factor(AcidIndex)2.9188445277671 -0.759674 0.342770 -2.216 0.02667
## as.factor(AcidIndex)3.13100139587667 -0.315477 0.403499 -0.782 0.43430
## as.factor(AcidIndex)3.31417429494859 -0.969335 0.548020 -1.769 0.07693
## as.factor(AcidIndex)3.47378568897179 -1.203509 0.548079 -2.196 0.02810
## as.factor(STARS)-0.42623524866846 0.755288 0.019570 38.594 < 2e-16
## as.factor(STARS)0.416552574962037 1.074130 0.018265 58.809 < 2e-16
## as.factor(STARS)1.25934039859254 1.191989 0.019240 61.953 < 2e-16
## as.factor(STARS)2.10212822222303 1.312474 0.024338 53.926 < 2e-16
##
## (Intercept)
## VolatileAcidity ***
## Chlorides *
## FreeSulfurDioxide *
## TotalSulfurDioxide **
## Sulphates *
## Alcohol **
## as.factor(LabelAppeal)-1.11204793733397 ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522 ***
## as.factor(LabelAppeal)2.25461820940981 ***
## as.factor(AcidIndex)-3.59682937695875
## as.factor(AcidIndex)-1.79176983045029
## as.factor(AcidIndex)-0.545318540973785
## as.factor(AcidIndex)0.362910765511677
## as.factor(AcidIndex)1.05172974217783
## as.factor(AcidIndex)1.59059728918163
## as.factor(AcidIndex)2.02271372429848 *
## as.factor(AcidIndex)2.37629509167962 *
## as.factor(AcidIndex)2.67051656830802 *
## as.factor(AcidIndex)2.9188445277671 *
## as.factor(AcidIndex)3.13100139587667
## as.factor(AcidIndex)3.31417429494859 .
## as.factor(AcidIndex)3.47378568897179 *
## as.factor(STARS)-0.42623524866846 ***
## as.factor(STARS)0.416552574962037 ***
## as.factor(STARS)1.25934039859254 ***
## as.factor(STARS)2.10212822222303 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 22861 on 12794 degrees of freedom
## Residual deviance: 13534 on 12767 degrees of freedom
## AIC: 45532
##
## Number of Fisher Scoring iterations: 6
# Evaluate Model 2 with testing data set
(pois2_eval <- model_perf_metrics(pois2, trainX, trainY, testX, testY))## RMSE Rsquared MAE aic bic
## 2.591346e+00 5.210688e-01 2.226889e+00 4.553219e+04 4.574098e+04
poi2VarImp <- variableImportancePlot(pois2, "Poisson Model 2 Variable Importance")Negative Binomial Model 1
nb1 <- glm.nb(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=clean_df)## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary(nb1)##
## Call:
## glm.nb(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), data = clean_df,
## init.theta = 40957.00204, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2219 -0.6496 -0.0055 0.4446 3.6790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.0275198 0.3190721 0.086 0.93127
## FixedAcidity 0.0010176 0.0051829 0.196 0.84435
## VolatileAcidity -0.0231944 0.0051231 -4.527 5.97e-06
## CitricAcid 0.0039724 0.0050838 0.781 0.43458
## ResidualSugar 0.0005767 0.0052060 0.111 0.91179
## Chlorides -0.0122552 0.0052206 -2.347 0.01890
## FreeSulfurDioxide 0.0132531 0.0051832 2.557 0.01056
## TotalSulfurDioxide 0.0170166 0.0052484 3.242 0.00119
## Density -0.0074549 0.0050938 -1.464 0.14332
## pH -0.0066655 0.0051875 -1.285 0.19882
## Sulphates -0.0106282 0.0053149 -2.000 0.04554
## Alcohol 0.0157805 0.0052355 3.014 0.00258
## as.factor(LabelAppeal)-1.11204793733397 0.2398534 0.0380017 6.312 2.76e-10
## as.factor(LabelAppeal)0.0101741115806247 0.4300463 0.0370666 11.602 < 2e-16
## as.factor(LabelAppeal)1.13239616049522 0.5633460 0.0377142 14.937 < 2e-16
## as.factor(LabelAppeal)2.25461820940981 0.6992610 0.0424548 16.471 < 2e-16
## as.factor(AcidIndex)-3.59682937695875 -0.1651538 0.3226593 -0.512 0.60875
## as.factor(AcidIndex)-1.79176983045029 -0.1214774 0.3172467 -0.383 0.70179
## as.factor(AcidIndex)-0.545318540973785 -0.1558113 0.3169954 -0.492 0.62305
## as.factor(AcidIndex)0.362910765511677 -0.1871144 0.3170413 -0.590 0.55506
## as.factor(AcidIndex)1.05172974217783 -0.2972766 0.3173791 -0.937 0.34893
## as.factor(AcidIndex)1.59059728918163 -0.4542592 0.3184811 -1.426 0.15377
## as.factor(AcidIndex)2.02271372429848 -0.8158352 0.3220684 -2.533 0.01131
## as.factor(AcidIndex)2.37629509167962 -0.8303961 0.3277299 -2.534 0.01128
## as.factor(AcidIndex)2.67051656830802 -0.6688133 0.3306330 -2.023 0.04309
## as.factor(AcidIndex)2.9188445277671 -0.7687131 0.3432641 -2.239 0.02513
## as.factor(AcidIndex)3.13100139587667 -0.3297889 0.4038365 -0.817 0.41413
## as.factor(AcidIndex)3.31417429494859 -0.9814037 0.5484760 -1.789 0.07356
## as.factor(AcidIndex)3.47378568897179 -1.2022430 0.5486104 -2.191 0.02842
## as.factor(STARS)-0.42623524866846 0.7548590 0.0195728 38.567 < 2e-16
## as.factor(STARS)0.416552574962037 1.0732229 0.0182738 58.730 < 2e-16
## as.factor(STARS)1.25934039859254 1.1910222 0.0192473 61.880 < 2e-16
## as.factor(STARS)2.10212822222303 1.3117031 0.0243440 53.882 < 2e-16
##
## (Intercept)
## FixedAcidity
## VolatileAcidity ***
## CitricAcid
## ResidualSugar
## Chlorides *
## FreeSulfurDioxide *
## TotalSulfurDioxide **
## Density
## pH
## Sulphates *
## Alcohol **
## as.factor(LabelAppeal)-1.11204793733397 ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522 ***
## as.factor(LabelAppeal)2.25461820940981 ***
## as.factor(AcidIndex)-3.59682937695875
## as.factor(AcidIndex)-1.79176983045029
## as.factor(AcidIndex)-0.545318540973785
## as.factor(AcidIndex)0.362910765511677
## as.factor(AcidIndex)1.05172974217783
## as.factor(AcidIndex)1.59059728918163
## as.factor(AcidIndex)2.02271372429848 *
## as.factor(AcidIndex)2.37629509167962 *
## as.factor(AcidIndex)2.67051656830802 *
## as.factor(AcidIndex)2.9188445277671 *
## as.factor(AcidIndex)3.13100139587667
## as.factor(AcidIndex)3.31417429494859 .
## as.factor(AcidIndex)3.47378568897179 *
## as.factor(STARS)-0.42623524866846 ***
## as.factor(STARS)0.416552574962037 ***
## as.factor(STARS)1.25934039859254 ***
## as.factor(STARS)2.10212822222303 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40957) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 13529 on 12762 degrees of freedom
## AIC: 45540
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40957
## Std. Err.: 34344
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -45472.13
(nb1_eval <- model_perf_metrics(nb1, trainX, trainY, testX, testY))## RMSE Rsquared MAE aic bic
## 2.591223e+00 5.215761e-01 2.226684e+00 4.554013e+04 4.579366e+04
nb1VarImp <- variableImportancePlot(nb1, "Negative Binomial 1 Variable Importance")Negative Binomial Model 2 (full model w/ 14 predictors)
Just picked the predictors that are statistical significant in model 1.
Predictors include:
VolatileAcidity
Chlorides
FreeSulfurDioxide
TotalSulfurDioxide
Sulphates
Alcohol
LabelAppeal
AcidIndex
STARS
nb2 <- glm.nb(TARGET~ VolatileAcidity + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Sulphates
+
Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=clean_df)## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
summary (nb2)##
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), data = clean_df,
## init.theta = 40946.13456, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2356 -0.6510 -0.0058 0.4411 3.6661
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.019070 0.318742 0.060 0.95229
## VolatileAcidity -0.023413 0.005122 -4.571 4.85e-06
## Chlorides -0.012368 0.005217 -2.371 0.01776
## FreeSulfurDioxide 0.013274 0.005181 2.562 0.01041
## TotalSulfurDioxide 0.016844 0.005245 3.212 0.00132
## Sulphates -0.010623 0.005313 -2.000 0.04554
## Alcohol 0.016005 0.005233 3.059 0.00222
## as.factor(LabelAppeal)-1.11204793733397 0.239676 0.038000 6.307 2.84e-10
## as.factor(LabelAppeal)0.0101741115806247 0.429897 0.037066 11.598 < 2e-16
## as.factor(LabelAppeal)1.13239616049522 0.563394 0.037712 14.939 < 2e-16
## as.factor(LabelAppeal)2.25461820940981 0.698243 0.042449 16.449 < 2e-16
## as.factor(AcidIndex)-3.59682937695875 -0.158966 0.322411 -0.493 0.62198
## as.factor(AcidIndex)-1.79176983045029 -0.114392 0.316953 -0.361 0.71817
## as.factor(AcidIndex)-0.545318540973785 -0.148117 0.316669 -0.468 0.63998
## as.factor(AcidIndex)0.362910765511677 -0.178811 0.316700 -0.565 0.57234
## as.factor(AcidIndex)1.05172974217783 -0.288410 0.317003 -0.910 0.36293
## as.factor(AcidIndex)1.59059728918163 -0.445160 0.318080 -1.400 0.16166
## as.factor(AcidIndex)2.02271372429848 -0.806699 0.321640 -2.508 0.01214
## as.factor(AcidIndex)2.37629509167962 -0.820547 0.327305 -2.507 0.01218
## as.factor(AcidIndex)2.67051656830802 -0.658978 0.330196 -1.996 0.04596
## as.factor(AcidIndex)2.9188445277671 -0.759701 0.342789 -2.216 0.02668
## as.factor(AcidIndex)3.13100139587667 -0.315502 0.403519 -0.782 0.43429
## as.factor(AcidIndex)3.31417429494859 -0.969370 0.548038 -1.769 0.07693
## as.factor(AcidIndex)3.47378568897179 -1.203547 0.548095 -2.196 0.02810
## as.factor(STARS)-0.42623524866846 0.755287 0.019570 38.593 < 2e-16
## as.factor(STARS)0.416552574962037 1.074130 0.018265 58.808 < 2e-16
## as.factor(STARS)1.25934039859254 1.191989 0.019241 61.951 < 2e-16
## as.factor(STARS)2.10212822222303 1.312476 0.024340 53.924 < 2e-16
##
## (Intercept)
## VolatileAcidity ***
## Chlorides *
## FreeSulfurDioxide *
## TotalSulfurDioxide **
## Sulphates *
## Alcohol **
## as.factor(LabelAppeal)-1.11204793733397 ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522 ***
## as.factor(LabelAppeal)2.25461820940981 ***
## as.factor(AcidIndex)-3.59682937695875
## as.factor(AcidIndex)-1.79176983045029
## as.factor(AcidIndex)-0.545318540973785
## as.factor(AcidIndex)0.362910765511677
## as.factor(AcidIndex)1.05172974217783
## as.factor(AcidIndex)1.59059728918163
## as.factor(AcidIndex)2.02271372429848 *
## as.factor(AcidIndex)2.37629509167962 *
## as.factor(AcidIndex)2.67051656830802 *
## as.factor(AcidIndex)2.9188445277671 *
## as.factor(AcidIndex)3.13100139587667
## as.factor(AcidIndex)3.31417429494859 .
## as.factor(AcidIndex)3.47378568897179 *
## as.factor(STARS)-0.42623524866846 ***
## as.factor(STARS)0.416552574962037 ***
## as.factor(STARS)1.25934039859254 ***
## as.factor(STARS)2.10212822222303 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(40946.13) family taken to be 1)
##
## Null deviance: 22860 on 12794 degrees of freedom
## Residual deviance: 13534 on 12767 degrees of freedom
## AIC: 45535
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 40946
## Std. Err.: 34332
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -45476.61
(nb2_eval <- model_perf_metrics(nb2, trainX, trainY, testX, testY))## RMSE Rsquared MAE aic bic
## 2.591346e+00 5.210685e-01 2.226888e+00 4.553461e+04 4.575086e+04
nb2VarImp <- variableImportancePlot(nb2, "Negative Binomial 2 Variable Importance")Linear Model 1 (full model w/ 14 predictors)
lm1 <- lm(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) +
as.factor(STARS),
data=clean_df)
summary(lm1)##
## Call:
## lm(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) +
## as.factor(AcidIndex) + as.factor(STARS), data = clean_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9635 -0.8591 0.0325 0.8384 6.0750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.995095 0.755491 1.317 0.18781
## FixedAcidity 0.004930 0.011720 0.421 0.67401
## VolatileAcidity -0.073667 0.011565 -6.370 1.96e-10
## CitricAcid 0.014455 0.011558 1.251 0.21109
## ResidualSugar 0.004027 0.011782 0.342 0.73251
## Chlorides -0.038485 0.011776 -3.268 0.00109
## FreeSulfurDioxide 0.039825 0.011786 3.379 0.00073
## TotalSulfurDioxide 0.049447 0.011816 4.185 2.87e-05
## Density -0.022475 0.011545 -1.947 0.05160
## pH -0.018619 0.011704 -1.591 0.11167
## Sulphates -0.028248 0.012015 -2.351 0.01873
## Alcohol 0.050812 0.011830 4.295 1.76e-05
## as.factor(LabelAppeal)-1.11204793733397 0.367639 0.062729 5.861 4.72e-09
## as.factor(LabelAppeal)0.0101741115806247 0.835185 0.061168 13.654 < 2e-16
## as.factor(LabelAppeal)1.13239616049522 1.302062 0.063917 20.371 < 2e-16
## as.factor(LabelAppeal)2.25461820940981 1.889951 0.084169 22.454 < 2e-16
## as.factor(AcidIndex)-3.59682937695875 -0.334854 0.767938 -0.436 0.66281
## as.factor(AcidIndex)-1.79176983045029 -0.220221 0.754101 -0.292 0.77027
## as.factor(AcidIndex)-0.545318540973785 -0.322349 0.753472 -0.428 0.66879
## as.factor(AcidIndex)0.362910765511677 -0.429363 0.753539 -0.570 0.56883
## as.factor(AcidIndex)1.05172974217783 -0.732560 0.754117 -0.971 0.33136
## as.factor(AcidIndex)1.59059728918163 -1.041297 0.755385 -1.378 0.16807
## as.factor(AcidIndex)2.02271372429848 -1.513113 0.757752 -1.997 0.04586
## as.factor(AcidIndex)2.37629509167962 -1.533481 0.762156 -2.012 0.04424
## as.factor(AcidIndex)2.67051656830802 -1.552014 0.769606 -2.017 0.04375
## as.factor(AcidIndex)2.9188445277671 -1.400154 0.777299 -1.801 0.07168
## as.factor(AcidIndex)3.13100139587667 -0.692206 0.883131 -0.784 0.43317
## as.factor(AcidIndex)3.31417429494859 -1.772148 0.952843 -1.860 0.06293
## as.factor(AcidIndex)3.47378568897179 -1.920432 0.900840 -2.132 0.03304
## as.factor(STARS)-0.42623524866846 1.346560 0.032920 40.904 < 2e-16
## as.factor(STARS)0.416552574962037 2.381720 0.032021 74.381 < 2e-16
## as.factor(STARS)1.25934039859254 2.942287 0.037079 79.352 < 2e-16
## as.factor(STARS)2.10212822222303 3.629958 0.059150 61.368 < 2e-16
##
## (Intercept)
## FixedAcidity
## VolatileAcidity ***
## CitricAcid
## ResidualSugar
## Chlorides **
## FreeSulfurDioxide ***
## TotalSulfurDioxide ***
## Density .
## pH
## Sulphates *
## Alcohol ***
## as.factor(LabelAppeal)-1.11204793733397 ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522 ***
## as.factor(LabelAppeal)2.25461820940981 ***
## as.factor(AcidIndex)-3.59682937695875
## as.factor(AcidIndex)-1.79176983045029
## as.factor(AcidIndex)-0.545318540973785
## as.factor(AcidIndex)0.362910765511677
## as.factor(AcidIndex)1.05172974217783
## as.factor(AcidIndex)1.59059728918163
## as.factor(AcidIndex)2.02271372429848 *
## as.factor(AcidIndex)2.37629509167962 *
## as.factor(AcidIndex)2.67051656830802 *
## as.factor(AcidIndex)2.9188445277671 .
## as.factor(AcidIndex)3.13100139587667
## as.factor(AcidIndex)3.31417429494859 .
## as.factor(AcidIndex)3.47378568897179 *
## as.factor(STARS)-0.42623524866846 ***
## as.factor(STARS)0.416552574962037 ***
## as.factor(STARS)1.25934039859254 ***
## as.factor(STARS)2.10212822222303 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.302 on 12762 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5429
## F-statistic: 475.9 on 32 and 12762 DF, p-value: < 2.2e-16
(lm1_eval <- model_perf_metrics(lm1, trainX, trainY, testX, testY))## RMSE Rsquared MAE aic bic
## 1.297132e+00 5.467544e-01 1.014874e+00 4.310630e+04 4.335983e+04
lm1VarImp <- variableImportancePlot(lm1, "Linear Model 1 Variable Importance")Linear Model 2
For this linear model, we opted to use StepAIC to step thru’ the variable selection algorithm.
lm2 <- stepAIC(lm1, direction = "both",
scope = list(upper = lm1, lower = ~ 1),
scale = 0, trace = FALSE)
summary(lm2)##
## Call:
## lm(formula = TARGET ~ VolatileAcidity + Chlorides + FreeSulfurDioxide +
## TotalSulfurDioxide + Density + pH + Sulphates + Alcohol +
## as.factor(LabelAppeal) + as.factor(AcidIndex) + as.factor(STARS),
## data = clean_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9616 -0.8590 0.0352 0.8399 6.0675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.99019 0.75488 1.312 0.189639
## VolatileAcidity -0.07393 0.01156 -6.394 1.67e-10
## Chlorides -0.03864 0.01177 -3.282 0.001034
## FreeSulfurDioxide 0.04009 0.01178 3.403 0.000669
## TotalSulfurDioxide 0.04960 0.01181 4.200 2.69e-05
## Density -0.02270 0.01154 -1.967 0.049224
## pH -0.01862 0.01170 -1.591 0.111585
## Sulphates -0.02837 0.01201 -2.362 0.018191
## Alcohol 0.05100 0.01183 4.313 1.62e-05
## as.factor(LabelAppeal)-1.11204793733397 0.36722 0.06272 5.854 4.90e-09
## as.factor(LabelAppeal)0.0101741115806247 0.83483 0.06116 13.649 < 2e-16
## as.factor(LabelAppeal)1.13239616049522 1.30161 0.06391 20.367 < 2e-16
## as.factor(LabelAppeal)2.25461820940981 1.88998 0.08416 22.456 < 2e-16
## as.factor(AcidIndex)-3.59682937695875 -0.33511 0.76744 -0.437 0.662366
## as.factor(AcidIndex)-1.79176983045029 -0.21824 0.75353 -0.290 0.772104
## as.factor(AcidIndex)-0.545318540973785 -0.31837 0.75287 -0.423 0.672396
## as.factor(AcidIndex)0.362910765511677 -0.42442 0.75293 -0.564 0.572976
## as.factor(AcidIndex)1.05172974217783 -0.72561 0.75343 -0.963 0.335525
## as.factor(AcidIndex)1.59059728918163 -1.03390 0.75463 -1.370 0.170687
## as.factor(AcidIndex)2.02271372429848 -1.50407 0.75695 -1.987 0.046941
## as.factor(AcidIndex)2.37629509167962 -1.52217 0.76131 -1.999 0.045584
## as.factor(AcidIndex)2.67051656830802 -1.54018 0.76867 -2.004 0.045124
## as.factor(AcidIndex)2.9188445277671 -1.38512 0.77635 -1.784 0.074426
## as.factor(AcidIndex)3.13100139587667 -0.67937 0.88243 -0.770 0.441383
## as.factor(AcidIndex)3.31417429494859 -1.75343 0.95178 -1.842 0.065461
## as.factor(AcidIndex)3.47378568897179 -1.89498 0.89978 -2.106 0.035220
## as.factor(STARS)-0.42623524866846 1.34682 0.03291 40.918 < 2e-16
## as.factor(STARS)0.416552574962037 2.38256 0.03201 74.442 < 2e-16
## as.factor(STARS)1.25934039859254 2.94276 0.03707 79.374 < 2e-16
## as.factor(STARS)2.10212822222303 3.63105 0.05914 61.397 < 2e-16
##
## (Intercept)
## VolatileAcidity ***
## Chlorides **
## FreeSulfurDioxide ***
## TotalSulfurDioxide ***
## Density *
## pH
## Sulphates *
## Alcohol ***
## as.factor(LabelAppeal)-1.11204793733397 ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522 ***
## as.factor(LabelAppeal)2.25461820940981 ***
## as.factor(AcidIndex)-3.59682937695875
## as.factor(AcidIndex)-1.79176983045029
## as.factor(AcidIndex)-0.545318540973785
## as.factor(AcidIndex)0.362910765511677
## as.factor(AcidIndex)1.05172974217783
## as.factor(AcidIndex)1.59059728918163
## as.factor(AcidIndex)2.02271372429848 *
## as.factor(AcidIndex)2.37629509167962 *
## as.factor(AcidIndex)2.67051656830802 *
## as.factor(AcidIndex)2.9188445277671 .
## as.factor(AcidIndex)3.13100139587667
## as.factor(AcidIndex)3.31417429494859 .
## as.factor(AcidIndex)3.47378568897179 *
## as.factor(STARS)-0.42623524866846 ***
## as.factor(STARS)0.416552574962037 ***
## as.factor(STARS)1.25934039859254 ***
## as.factor(STARS)2.10212822222303 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.302 on 12765 degrees of freedom
## Multiple R-squared: 0.544, Adjusted R-squared: 0.543
## F-statistic: 525.1 on 29 and 12765 DF, p-value: < 2.2e-16
(lm2_eval <- model_perf_metrics(lm2, trainX, trainY, testX, testY))## RMSE Rsquared MAE aic bic
## 1.297291e+00 5.466431e-01 1.014944e+00 4.310216e+04 4.333332e+04
lm2VarImp <- variableImportancePlot(lm2, "Linear Model 2 Variable Importance")Zero-inflated Poisson
ziPois <- pscl::zeroinfl(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) | STARS,
data=clean_df,
dist = "poisson",
model = TRUE
)
summary(ziPois)##
## Call:
## pscl::zeroinfl(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) |
## STARS, data = clean_df, dist = "poisson", model = TRUE)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.33788 -0.43775 0.06112 0.41658 2.78301
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5761849 0.3224912 1.787 0.0740
## FixedAcidity 0.0031044 0.0052893 0.587 0.5573
## VolatileAcidity -0.0132550 0.0052469 -2.526 0.0115
## CitricAcid 0.0007563 0.0051733 0.146 0.8838
## ResidualSugar -0.0004410 0.0053152 -0.083 0.9339
## Chlorides -0.0079476 0.0053495 -1.486 0.1374
## FreeSulfurDioxide 0.0023806 0.0052703 0.452 0.6515
## TotalSulfurDioxide -0.0019097 0.0053409 -0.358 0.7207
## Density -0.0077893 0.0052460 -1.485 0.1376
## pH 0.0026663 0.0053087 0.502 0.6155
## Sulphates -0.0022225 0.0054476 -0.408 0.6833
## Alcohol 0.0336745 0.0052965 6.358 2.05e-10
## as.factor(LabelAppeal)-1.11204793733397 0.4215144 0.0389940 10.810 < 2e-16
## as.factor(LabelAppeal)0.0101741115806247 0.7405092 0.0378355 19.572 < 2e-16
## as.factor(LabelAppeal)1.13239616049522 0.9780913 0.0382070 25.600 < 2e-16
## as.factor(LabelAppeal)2.25461820940981 1.1707270 0.0427426 27.390 < 2e-16
## as.factor(AcidIndex)-3.59682937695875 0.0163092 0.3260926 0.050 0.9601
## as.factor(AcidIndex)-1.79176983045029 0.0634500 0.3206707 0.198 0.8431
## as.factor(AcidIndex)-0.545318540973785 0.0364686 0.3204104 0.114 0.9094
## as.factor(AcidIndex)0.362910765511677 0.0141717 0.3204609 0.044 0.9647
## as.factor(AcidIndex)1.05172974217783 -0.0326591 0.3208062 -0.102 0.9189
## as.factor(AcidIndex)1.59059728918163 -0.1262213 0.3220907 -0.392 0.6951
## as.factor(AcidIndex)2.02271372429848 -0.2064558 0.3276252 -0.630 0.5286
## as.factor(AcidIndex)2.37629509167962 -0.1459902 0.3357785 -0.435 0.6637
## as.factor(AcidIndex)2.67051656830802 0.0032496 0.3384153 0.010 0.9923
## as.factor(AcidIndex)2.9188445277671 -0.0843479 0.3613793 -0.233 0.8154
## as.factor(AcidIndex)3.13100139587667 0.0405826 0.4187399 0.097 0.9228
## as.factor(AcidIndex)3.31417429494859 0.2528954 0.6182782 0.409 0.6825
## as.factor(AcidIndex)3.47378568897179 -0.1756369 0.6094718 -0.288 0.7732
##
## (Intercept) .
## FixedAcidity
## VolatileAcidity *
## CitricAcid
## ResidualSugar
## Chlorides
## FreeSulfurDioxide
## TotalSulfurDioxide
## Density
## pH
## Sulphates
## Alcohol ***
## as.factor(LabelAppeal)-1.11204793733397 ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522 ***
## as.factor(LabelAppeal)2.25461820940981 ***
## as.factor(AcidIndex)-3.59682937695875
## as.factor(AcidIndex)-1.79176983045029
## as.factor(AcidIndex)-0.545318540973785
## as.factor(AcidIndex)0.362910765511677
## as.factor(AcidIndex)1.05172974217783
## as.factor(AcidIndex)1.59059728918163
## as.factor(AcidIndex)2.02271372429848
## as.factor(AcidIndex)2.37629509167962
## as.factor(AcidIndex)2.67051656830802
## as.factor(AcidIndex)2.9188445277671
## as.factor(AcidIndex)3.13100139587667
## as.factor(AcidIndex)3.31417429494859
## as.factor(AcidIndex)3.47378568897179
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.91313 0.06701 -43.47 <2e-16 ***
## STARS -2.62665 0.06167 -42.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 36
## Log-likelihood: -2.098e+04 on 31 Df
(ziPois_eval <- model_perf_metrics(ziPois, trainX, trainY, testX, testY))## RMSE Rsquared MAE aic bic
## 1.364038e+00 5.006595e-01 1.045664e+00 4.202855e+04 4.225971e+04
Zero-inflated Negative Binomial
ziNB <- pscl::zeroinfl(TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid + ResidualSugar +
Chlorides + FreeSulfurDioxide + TotalSulfurDioxide + Density +
pH + Sulphates + Alcohol +
as.factor(LabelAppeal) +
as.factor(AcidIndex) | STARS,
data=clean_df,
dist = "negbin",
model = TRUE
)
summary(ziNB)##
## Call:
## pscl::zeroinfl(formula = TARGET ~ FixedAcidity + VolatileAcidity + CitricAcid +
## ResidualSugar + Chlorides + FreeSulfurDioxide + TotalSulfurDioxide +
## Density + pH + Sulphates + Alcohol + as.factor(LabelAppeal) + as.factor(AcidIndex) |
## STARS, data = clean_df, dist = "negbin", model = TRUE)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.33788 -0.43775 0.06112 0.41658 2.78301
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.5763137 0.3224698 1.787 0.0739
## FixedAcidity 0.0031043 0.0052893 0.587 0.5573
## VolatileAcidity -0.0132554 0.0052469 -2.526 0.0115
## CitricAcid 0.0007564 0.0051733 0.146 0.8838
## ResidualSugar -0.0004406 0.0053152 -0.083 0.9339
## Chlorides -0.0079477 0.0053495 -1.486 0.1374
## FreeSulfurDioxide 0.0023806 0.0052703 0.452 0.6515
## TotalSulfurDioxide -0.0019095 0.0053409 -0.358 0.7207
## Density -0.0077897 0.0052460 -1.485 0.1376
## pH 0.0026663 0.0053087 0.502 0.6155
## Sulphates -0.0022225 0.0054476 -0.408 0.6833
## Alcohol 0.0336745 0.0052965 6.358 2.05e-10
## as.factor(LabelAppeal)-1.11204793733397 0.4215149 0.0389940 10.810 < 2e-16
## as.factor(LabelAppeal)0.0101741115806247 0.7405098 0.0378355 19.572 < 2e-16
## as.factor(LabelAppeal)1.13239616049522 0.9780918 0.0382070 25.600 < 2e-16
## as.factor(LabelAppeal)2.25461820940981 1.1707275 0.0427426 27.390 < 2e-16
## as.factor(AcidIndex)-3.59682937695875 0.0161787 0.3260715 0.050 0.9604
## as.factor(AcidIndex)-1.79176983045029 0.0633208 0.3206492 0.197 0.8435
## as.factor(AcidIndex)-0.545318540973785 0.0363393 0.3203889 0.113 0.9097
## as.factor(AcidIndex)0.362910765511677 0.0140424 0.3204394 0.044 0.9650
## as.factor(AcidIndex)1.05172974217783 -0.0327884 0.3207847 -0.102 0.9186
## as.factor(AcidIndex)1.59059728918163 -0.1263506 0.3220693 -0.392 0.6948
## as.factor(AcidIndex)2.02271372429848 -0.2065840 0.3276042 -0.631 0.5283
## as.factor(AcidIndex)2.37629509167962 -0.1461198 0.3357580 -0.435 0.6634
## as.factor(AcidIndex)2.67051656830802 0.0031210 0.3383949 0.009 0.9926
## as.factor(AcidIndex)2.9188445277671 -0.0844816 0.3613605 -0.234 0.8151
## as.factor(AcidIndex)3.13100139587667 0.0404510 0.4187236 0.097 0.9230
## as.factor(AcidIndex)3.31417429494859 0.2526745 0.6183067 0.409 0.6828
## as.factor(AcidIndex)3.47378568897179 -0.1757520 0.6094565 -0.288 0.7731
## Log(theta) 17.8553302 13.4640303 1.326 0.1848
##
## (Intercept) .
## FixedAcidity
## VolatileAcidity *
## CitricAcid
## ResidualSugar
## Chlorides
## FreeSulfurDioxide
## TotalSulfurDioxide
## Density
## pH
## Sulphates
## Alcohol ***
## as.factor(LabelAppeal)-1.11204793733397 ***
## as.factor(LabelAppeal)0.0101741115806247 ***
## as.factor(LabelAppeal)1.13239616049522 ***
## as.factor(LabelAppeal)2.25461820940981 ***
## as.factor(AcidIndex)-3.59682937695875
## as.factor(AcidIndex)-1.79176983045029
## as.factor(AcidIndex)-0.545318540973785
## as.factor(AcidIndex)0.362910765511677
## as.factor(AcidIndex)1.05172974217783
## as.factor(AcidIndex)1.59059728918163
## as.factor(AcidIndex)2.02271372429848
## as.factor(AcidIndex)2.37629509167962
## as.factor(AcidIndex)2.67051656830802
## as.factor(AcidIndex)2.9188445277671
## as.factor(AcidIndex)3.13100139587667
## as.factor(AcidIndex)3.31417429494859
## as.factor(AcidIndex)3.47378568897179
## Log(theta)
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.91313 0.06701 -43.47 <2e-16 ***
## STARS -2.62665 0.06167 -42.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 56816092.9572
## Number of iterations in BFGS optimization: 37
## Log-likelihood: -2.098e+04 on 32 Df
(ziNB_eval <- model_perf_metrics(ziNB, trainX, trainY, testX, testY))## RMSE Rsquared MAE aic bic
## 1.364038e+00 5.006596e-01 1.045664e+00 4.203055e+04 4.226917e+04
Model Selection
Any of the linear models (full or reduced) appears to be winning model of choice.
models_summary <- rbind(pois1_eval, pois2_eval, nb1_eval, nb2_eval, lm1_eval, lm2_eval, ziPois_eval, ziNB_eval)
kable(models_summary) %>%
kable_paper(full_width = F) %>%
column_spec(1, bold = T, border_right = T) %>%
row_spec(5:6, bold = T, color = "white", background = "purple")| RMSE | Rsquared | MAE | aic | bic | |
|---|---|---|---|---|---|
| pois1_eval | 2.593116 | 0.5059560 | 2.228388 | 36435.92 | 36674.63 |
| pois2_eval | 2.591346 | 0.5210688 | 2.226889 | 45532.19 | 45740.98 |
| nb1_eval | 2.591223 | 0.5215761 | 2.226684 | 45540.13 | 45793.66 |
| nb2_eval | 2.591346 | 0.5210685 | 2.226888 | 45534.61 | 45750.86 |
| lm1_eval | 1.297132 | 0.5467544 | 1.014874 | 43106.30 | 43359.83 |
| lm2_eval | 1.297291 | 0.5466431 | 1.014944 | 43102.16 | 43333.32 |
| ziPois_eval | 1.364038 | 0.5006595 | 1.045664 | 42028.55 | 42259.71 |
| ziNB_eval | 1.364038 | 0.5006596 | 1.045664 | 42030.55 | 42269.17 |
Variable Importance
From the Variable Importance point of view, the top 4 most important features across the board is always consistently a derivative of a factor of STARS.
grid.arrange(poi1VarImp, poi2VarImp, nb1VarImp, nb2VarImp, lm1VarImp, lm2VarImp, ncol = 2)