DATA621 - HW5
Overview
In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines.
The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine.
These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.
Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided).
Load Libraries
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(vip)
library(MASS)
library(ModelMetrics)
library(pscl)
library(ggpubr)
library(kableExtra)
library(corrplot)
rm(list=ls())Load data
We will load the data from the Cloud Server
df_training <- read_csv('http://www.my-cunymsds.com/data621/wine-training-data.csv')
df_validation <- read_csv('http://www.my-cunymsds.com/data621/wine-evaluation-data.csv')Exploratory Data Analysis (EDA)
Basic Checks
Let’s take a look at the general structure of the dataset
df_training %>%
glimpse() %>%
summary()## Rows: 12,795
## Columns: 16
## $ INDEX <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19…
## $ TARGET <dbl> 3, 3, 5, 3, 4, 0, 0, 4, 3, 6, 0, 4, 3, 7, 4, 0, 0, …
## $ FixedAcidity <dbl> 3.2, 4.5, 7.1, 5.7, 8.0, 11.3, 7.7, 6.5, 14.8, 5.5,…
## $ VolatileAcidity <dbl> 1.160, 0.160, 2.640, 0.385, 0.330, 0.320, 0.290, -1…
## $ CitricAcid <dbl> -0.98, -0.81, -0.88, 0.04, -1.26, 0.59, -0.40, 0.34…
## $ ResidualSugar <dbl> 54.20, 26.10, 14.80, 18.80, 9.40, 2.20, 21.50, 1.40…
## $ Chlorides <dbl> -0.567, -0.425, 0.037, -0.425, NA, 0.556, 0.060, 0.…
## $ FreeSulfurDioxide <dbl> NA, 15, 214, 22, -167, -37, 287, 523, -213, 62, 551…
## $ TotalSulfurDioxide <dbl> 268, -327, 142, 115, 108, 15, 156, 551, NA, 180, 65…
## $ Density <dbl> 0.99280, 1.02792, 0.99518, 0.99640, 0.99457, 0.9994…
## $ pH <dbl> 3.33, 3.38, 3.12, 2.24, 3.12, 3.20, 3.49, 3.20, 4.9…
## $ Sulphates <dbl> -0.59, 0.70, 0.48, 1.83, 1.77, 1.29, 1.21, NA, 0.26…
## $ Alcohol <dbl> 9.9, NA, 22.0, 6.2, 13.7, 15.4, 10.3, 11.6, 15.0, 1…
## $ LabelAppeal <dbl> 0, -1, -1, -1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 0, 0, …
## $ AcidIndex <dbl> 8, 7, 8, 6, 9, 11, 8, 7, 6, 8, 5, 10, 7, 8, 9, 8, 9…
## $ STARS <dbl> 2, 3, 3, 1, 2, NA, NA, 3, NA, 4, 1, 2, 2, 3, NA, NA…
## INDEX TARGET FixedAcidity VolatileAcidity
## Min. : 1 Min. :0.000 Min. :-18.100 Min. :-2.7900
## 1st Qu.: 4038 1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300
## Median : 8110 Median :3.000 Median : 6.900 Median : 0.2800
## Mean : 8070 Mean :3.029 Mean : 7.076 Mean : 0.3241
## 3rd Qu.:12106 3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400
## Max. :16129 Max. :8.000 Max. : 34.400 Max. : 3.6800
##
## CitricAcid ResidualSugar Chlorides FreeSulfurDioxide
## Min. :-3.2400 Min. :-127.800 Min. :-1.1710 Min. :-555.00
## 1st Qu.: 0.0300 1st Qu.: -2.000 1st Qu.:-0.0310 1st Qu.: 0.00
## Median : 0.3100 Median : 3.900 Median : 0.0460 Median : 30.00
## Mean : 0.3084 Mean : 5.419 Mean : 0.0548 Mean : 30.85
## 3rd Qu.: 0.5800 3rd Qu.: 15.900 3rd Qu.: 0.1530 3rd Qu.: 70.00
## Max. : 3.8600 Max. : 141.150 Max. : 1.3510 Max. : 623.00
## NA's :616 NA's :638 NA's :647
## TotalSulfurDioxide Density pH Sulphates
## Min. :-823.0 Min. :0.8881 Min. :0.480 Min. :-3.1300
## 1st Qu.: 27.0 1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800
## Median : 123.0 Median :0.9945 Median :3.200 Median : 0.5000
## Mean : 120.7 Mean :0.9942 Mean :3.208 Mean : 0.5271
## 3rd Qu.: 208.0 3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600
## Max. :1057.0 Max. :1.0992 Max. :6.130 Max. : 4.2400
## NA's :682 NA's :395 NA's :1210
## Alcohol LabelAppeal AcidIndex STARS
## Min. :-4.70 Min. :-2.000000 Min. : 4.000 Min. :1.000
## 1st Qu.: 9.00 1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
## Median :10.40 Median : 0.000000 Median : 8.000 Median :2.000
## Mean :10.49 Mean :-0.009066 Mean : 7.773 Mean :2.042
## 3rd Qu.:12.40 3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
## Max. :26.50 Max. : 2.000000 Max. :17.000 Max. :4.000
## NA's :653 NA's :3359
We can see all features are numerical, but we may need to change to FACTOR in some instances. Also notice in the glimpse that data has NA’s.
Lets remove INDEX. Not needed.
df_training <- df_training %>%
dplyr::select(-INDEX)
df_validation <- df_validation %>%
dplyr::select(-IN)Let’s check for NA’s
colSums(is.na(df_training))## TARGET FixedAcidity VolatileAcidity CitricAcid
## 0 0 0 0
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## 616 638 647 682
## Density pH Sulphates Alcohol
## 0 395 1210 653
## LabelAppeal AcidIndex STARS
## 0 0 3359
colSums(is.na(df_validation))## TARGET FixedAcidity VolatileAcidity CitricAcid
## 3335 0 0 0
## ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
## 168 138 152 157
## Density pH Sulphates Alcohol
## 0 104 310 185
## LabelAppeal AcidIndex STARS
## 0 0 841
We see that some features (ResidualSugar, Chlorides, FreSulfurDioxide, pH and others) have NA’s. We will impute values as needed.
Some visualizations
# boxplot
p1 <- df_training %>%
gather(na.rm = TRUE) %>%
ggplot(aes(factor(key), value)) +
geom_boxplot(outlier.colour = "#595dab", outlier.shape = 1, color = "#96eb28") +
scale_y_log10() +
coord_flip() +
labs(title = "Boxplot of Chemical Properties of Wine", x = "Chemical Properties", y = "Values") +
theme_minimal()
ggarrange(p1)The Boxplot show us that some features (TotalSulfurDioxide, pH, FixedAcidity and Alcohol) present a large number of outliers.
Let’s take a look at the distribution densities of our variables.
df_training %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_density(color="#96eb28") + # as density
theme_minimal()Data looks fairly normal with the exceptipon we noted on the boxplot
Let’s take a look a correlations.
library(corrplot)
M<- cor(df_training[,sapply(df_training,is.numeric)],use="complete.obs",method="pearson")
M %>%
corrplot(type = "upper",
addCoef.col = TRUE,
method = "shade",
diag = TRUE, number.cex = 0.5, tl.cex = 0.7)What we noticed is that besides (LabelAppeal, AcidIndex and STARS) almost all other variables have no correlation to the TARGET
Let’s tale a look at these 3 features to get some insigts.
p1 <- df_training %>%
dplyr::select(TARGET, STARS) %>%
mutate(STARS = as.factor(STARS),
TARGET = as.factor(TARGET)) %>%
ggplot(aes(STARS)) +
geom_bar(aes(fill = TARGET)) +
scale_fill_brewer(palette = "Spectral") +
theme_minimal()
p2 <- df_training %>%
dplyr::select(TARGET, LabelAppeal) %>%
mutate(STARS = as.factor(LabelAppeal),
TARGET = as.factor(TARGET)) %>%
ggplot(aes(LabelAppeal)) +
geom_bar(aes(fill = TARGET)) +
scale_fill_brewer(palette = "Spectral") +
theme_minimal()
p3 <- df_training %>%
dplyr::select(TARGET, AcidIndex) %>%
mutate(STARS = as.factor(AcidIndex),
TARGET = as.factor(TARGET)) %>%
ggplot(aes(AcidIndex)) +
geom_bar(aes(fill = TARGET)) +
scale_fill_brewer(palette = "Spectral") +
theme_minimal()
ggarrange(p1, p2, p3, ncol = 3, nrow = 1,common.legend = TRUE)A few important things to notice:
1- Many wines with 0 orders are NA
2- Star Rating seems very “normal” distributed meaning it peaks at 3-stars for the larger order (4,5,6)
3- Label appeal around zero seems to be driving many 0 orders.
Let’s look specifically at the TARGET variable.
# plot
p <- df_training %>%
ggplot( aes(x=TARGET)) +
geom_histogram(fill="#96eb28",binwidth = 1, origin = -0.5) +
#scale_x_continuous(trans='log10') +
ggtitle("Distribution of Target Amounts") +
theme_minimal()
pMany ZERO orders in the dataset. This will need special care as many models don’t work well with zero inflated datasets.
Data engineering and transformations
Let’s replace all STARS=NA with 0. This is because we saw that many wines with NA in the STAR features had no orders.
df_training <- df_training %>%
mutate(STARS = replace(STARS, is.na(STARS) , 0))
df_validation <- df_validation %>%
mutate(STARS = replace(STARS, is.na(STARS) , 0))We will create a new feature “LabelAppeal2” which is the the squared of LabelAppeal. This is becase we saw a higher number of 0 orders in the center of distribution when LabelAppeal=0.
df_training$LabelAppeal2 <- df_training$LabelAppeal^2
df_validation$LabelAppeal2 <- df_validation$LabelAppeal^2Let’s convert some features from numerical to FACTOR. We will do it with STARS since orders didn’t seemed to be linearly increasing with this feature.
df_training$LabelAppeal <- as.factor(df_training$LabelAppeal)
df_training$STARS <- as.factor(df_training$STARS)
df_training$TARGET <-as.integer(df_training$TARGET)
df_validation$LabelAppeal <- as.factor(df_validation$LabelAppeal)
df_validation$STARS <- as.factor(df_validation$STARS)We will use TIDYMODELS RECIPES to do some imputation and normalization Also we will convert all our factor variable into Dummy Variables to do our regressions.
data_recipe <-
recipe(formula = TARGET ~ . , data = df_training) %>%
#step_rm(TARGET2) %>%
#update_role(TARGET2, new_role = "id variable") %>% #ignores this var
step_impute_mean(all_numeric_predictors()) %>% #impute if necessary
step_impute_mode(all_factor_predictors()) %>% #impute
#step_other(all_factor_predictors(), threshold = 0.2) %>%
#step_dummy(all_factor_predictors()) %>%
step_normalize(all_numeric_predictors())Let’s now process our recipe to get new dataset with all our feature engineering and transformations.
df_training_mod <- data_recipe %>%
prep() %>%
bake(new_data = NULL)df_validation_mod <- data_recipe %>%
prep() %>%
bake(new_data = df_validation)Modelling
First let’t split our training further in to training and validation.
# Create a split object TEST and TRAINING
set.seed(888)
my_split <- initial_split(df_training_mod, prop = 0.8)
# Build training data set
my_training <- my_split %>% training()
# Build testing data set
my_test <- my_split %>% testing()Models 1a,1b - multiple regression
Let’s define all our models.
model1a <- lm(TARGET ~ ., data = my_training)
#tidy(model1a)
#summary(model1a)model1b <- lm(TARGET ~ STARS+LabelAppeal, data = my_training)
#tidy(model1b)
#summary(model1b)Model 2a,2b - Poisson Regression
model2a<-glm(TARGET ~., data=my_training, family ="poisson")
#tidy(model2a)
#summary(model2a)model2b<-glm(TARGET ~ STARS+LabelAppeal+LabelAppeal2, data=my_training, family ="poisson")
#tidy(model2b)
#summary(model2b)Models 3a, 3b - Negative Binomial
set.seed(123)
model3a<-glm.nb(TARGET ~ ., data=my_training)
#summary(model3a)set.seed(123)
model3b<-glm.nb(TARGET ~ STARS+LabelAppeal, data=my_training)
#summary(model3b)Model 4 - Hurdle Poisson Regression
#Hurdle Poisson
set.seed(123)
modelhp<-hurdle(TARGET ~ STARS+LabelAppeal | STARS, data=my_training , dist = "negbin")
#summary(modelhp)Model 5 - Zero Inflated Poisson Regression
set.seed(123)
modelzp<-zeroinfl(TARGET ~. | STARS, data=my_training,dist = "poisson")
#summary(modelzp)Analysis of results
Let’s predict using each of our 8 models and then test against our validation set.
We will the calculate the ROOT MEAN SQUARED ERROR of our predictions. RMSE will be the metric we wil use to select our best model. In case of regression RMSE or similar metrics like (MAE, MAPE, MSE) are common ones. RMSE seems to be the most commonly used.
pred <- predict.glm(model1a, newdata = my_test, type = "response")
rmsemodel1a <- ModelMetrics::rmse(my_test$TARGET,round(pred))
#rmsemodel1apred <- predict.glm(model1b, newdata = my_test, type = "response")
rmsemodel1b <- ModelMetrics::rmse(my_test$TARGET,round(pred))
#rmsemodel1bpred <- predict.glm(model2a, newdata = my_test, type = "response")
rmsemodel2a <- ModelMetrics::rmse(my_test$TARGET,round(pred))
#rmsemodel2apred <- predict.glm(model2b, newdata = my_test, type = "response")
rmsemodel2b <- ModelMetrics::rmse(my_test$TARGET,round(pred))
#rmsemodel2bpred <- predict.glm(model3a, newdata = my_test, type = "response")
rmsemodel3a <- ModelMetrics::rmse(my_test$TARGET,round(pred))
#rmsemodel3apred <- predict.glm(model3b, newdata = my_test, type = "response")
rmsemodel3b <- ModelMetrics::rmse(my_test$TARGET,round(pred))
#rmsemodel3bpredhp<- predict(modelhp,newdata=my_test, type = "response")
rmsemodelhp<-ModelMetrics::rmse(my_test$TARGET,round(predhp))
#rmsemodelhppredzp<- predict(modelzp,newdata=my_test, type = "response")
pred2 <- round(predzp)
rmsemodelzp<-ModelMetrics::rmse(my_test$TARGET,round(predzp))
#rmsemodelzpLet’s take a look now at all our results. Lowest RMSE wins!
RMSE <- list(rmsemodel1a,rmsemodel1b,rmsemodel2a,rmsemodel2b,
rmsemodel3a,rmsemodel3b,rmsemodelhp,rmsemodelzp)
RMSE <- lapply(RMSE,round,5)
newcols <- c("MR1", "MR2", "Poi1", "Poi2","NB1", "NB2","NP","ZP")
kable(rbind(RMSE), col.names = newcols, digits=4) %>%
kable_styling(full_width = T)| MR1 | MR2 | Poi1 | Poi2 | NB1 | NB2 | NP | ZP | |
|---|---|---|---|---|---|---|---|---|
| RMSE | 1.35523 | 1.37612 | 1.35984 | 1.35437 | 1.35984 | 1.35437 | 1.35624 | 1.3419 |
And the winner was ZERO INFLATED POISSON. Makes sense since this model considers that we had many zeroes in our dataset.
Use best model to predict on validation set
pred <- predict(modelzp, newdata = df_validation_mod, type = "response")
pred <- round(pred)
head(pred,50)## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 1 4 3 2 1 6 4 2 1 1 3 2 4 1 1 3 2 1 4 6 2 1 2 2 4 6
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## 3 6 5 2 4 2 4 3 1 4 3 3 1 2 2 1 1 1 3 4 3 5 4 3
Plot our predictions.
# plot
p <- pred %>% as_tibble() %>%
ggplot( aes(x=value)) +
geom_histogram(fill="#96eb28",binwidth = 1, origin = -0.5) +
#scale_x_continuous(trans='log10') +
ggtitle("Distribution of Predictions") +
theme_minimal()
pSave the data in .CSV
write.csv(pred,"predictions_ZP.csv")Our predictions of cases to sell are in a single file called predictions.csv