DATA621 - HW5

Overview

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines.

The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine.

These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales.

Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine. HINT: Sometimes, the fact that a variable is missing is actually predictive of the target. You can only use the variables given to you (or variables that you derive from the variables provided).

Load Libraries

library(tidyverse)
library(tidymodels)
library(ggplot2)
library(vip)
library(MASS)
library(ModelMetrics)
library(pscl)
library(ggpubr)
library(kableExtra)
library(corrplot)
rm(list=ls())

Load data

We will load the data from the Cloud Server

df_training <- read_csv('http://www.my-cunymsds.com/data621/wine-training-data.csv')

df_validation <- read_csv('http://www.my-cunymsds.com/data621/wine-evaluation-data.csv')

Exploratory Data Analysis (EDA)

Basic Checks

Let’s take a look at the general structure of the dataset

df_training %>%
  glimpse() %>%
  summary()
## Rows: 12,795
## Columns: 16
## $ INDEX              <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19…
## $ TARGET             <dbl> 3, 3, 5, 3, 4, 0, 0, 4, 3, 6, 0, 4, 3, 7, 4, 0, 0, …
## $ FixedAcidity       <dbl> 3.2, 4.5, 7.1, 5.7, 8.0, 11.3, 7.7, 6.5, 14.8, 5.5,…
## $ VolatileAcidity    <dbl> 1.160, 0.160, 2.640, 0.385, 0.330, 0.320, 0.290, -1…
## $ CitricAcid         <dbl> -0.98, -0.81, -0.88, 0.04, -1.26, 0.59, -0.40, 0.34…
## $ ResidualSugar      <dbl> 54.20, 26.10, 14.80, 18.80, 9.40, 2.20, 21.50, 1.40…
## $ Chlorides          <dbl> -0.567, -0.425, 0.037, -0.425, NA, 0.556, 0.060, 0.…
## $ FreeSulfurDioxide  <dbl> NA, 15, 214, 22, -167, -37, 287, 523, -213, 62, 551…
## $ TotalSulfurDioxide <dbl> 268, -327, 142, 115, 108, 15, 156, 551, NA, 180, 65…
## $ Density            <dbl> 0.99280, 1.02792, 0.99518, 0.99640, 0.99457, 0.9994…
## $ pH                 <dbl> 3.33, 3.38, 3.12, 2.24, 3.12, 3.20, 3.49, 3.20, 4.9…
## $ Sulphates          <dbl> -0.59, 0.70, 0.48, 1.83, 1.77, 1.29, 1.21, NA, 0.26…
## $ Alcohol            <dbl> 9.9, NA, 22.0, 6.2, 13.7, 15.4, 10.3, 11.6, 15.0, 1…
## $ LabelAppeal        <dbl> 0, -1, -1, -1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 0, 0, …
## $ AcidIndex          <dbl> 8, 7, 8, 6, 9, 11, 8, 7, 6, 8, 5, 10, 7, 8, 9, 8, 9…
## $ STARS              <dbl> 2, 3, 3, 1, 2, NA, NA, 3, NA, 4, 1, 2, 2, 3, NA, NA…
##      INDEX           TARGET       FixedAcidity     VolatileAcidity  
##  Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
##  1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
##  Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
##  Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
##  3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
##  Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
##                                                                     
##    CitricAcid      ResidualSugar        Chlorides       FreeSulfurDioxide
##  Min.   :-3.2400   Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00  
##  1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00  
##  Median : 0.3100   Median :   3.900   Median : 0.0460   Median :  30.00  
##  Mean   : 0.3084   Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85  
##  3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00  
##  Max.   : 3.8600   Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00  
##                    NA's   :616        NA's   :638       NA's   :647      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
##  1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
##  Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
##  Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
##  3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
##  Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
##  NA's   :682                         NA's   :395     NA's   :1210     
##     Alcohol       LabelAppeal          AcidIndex          STARS      
##  Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##  NA's   :653                                          NA's   :3359

We can see all features are numerical, but we may need to change to FACTOR in some instances. Also notice in the glimpse that data has NA’s.

Lets remove INDEX. Not needed.

df_training <- df_training %>%
  dplyr::select(-INDEX)

df_validation <- df_validation %>%
  dplyr::select(-IN)

Let’s check for NA’s

colSums(is.na(df_training))
##             TARGET       FixedAcidity    VolatileAcidity         CitricAcid 
##                  0                  0                  0                  0 
##      ResidualSugar          Chlorides  FreeSulfurDioxide TotalSulfurDioxide 
##                616                638                647                682 
##            Density                 pH          Sulphates            Alcohol 
##                  0                395               1210                653 
##        LabelAppeal          AcidIndex              STARS 
##                  0                  0               3359
colSums(is.na(df_validation))
##             TARGET       FixedAcidity    VolatileAcidity         CitricAcid 
##               3335                  0                  0                  0 
##      ResidualSugar          Chlorides  FreeSulfurDioxide TotalSulfurDioxide 
##                168                138                152                157 
##            Density                 pH          Sulphates            Alcohol 
##                  0                104                310                185 
##        LabelAppeal          AcidIndex              STARS 
##                  0                  0                841

We see that some features (ResidualSugar, Chlorides, FreSulfurDioxide, pH and others) have NA’s. We will impute values as needed.

Some visualizations

# boxplot
p1 <- df_training %>% 
  gather(na.rm = TRUE) %>% 
  ggplot(aes(factor(key), value)) +
  geom_boxplot(outlier.colour = "#595dab", outlier.shape = 1,  color = "#96eb28") +
  scale_y_log10() +
  coord_flip() +
  labs(title = "Boxplot of Chemical Properties of Wine", x = "Chemical Properties", y = "Values") +
  theme_minimal()


ggarrange(p1)

The Boxplot show us that some features (TotalSulfurDioxide, pH, FixedAcidity and Alcohol) present a large number of outliers.

Let’s take a look at the distribution densities of our variables.

df_training %>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ key, scales = "free") +   # In separate panels
    geom_density(color="#96eb28") +                         # as density
    theme_minimal()

Data looks fairly normal with the exceptipon we noted on the boxplot

Let’s take a look a correlations.

library(corrplot)


M<- cor(df_training[,sapply(df_training,is.numeric)],use="complete.obs",method="pearson")

M %>% 
  corrplot(type = "upper",
         addCoef.col = TRUE,
         method = "shade",
         diag = TRUE, number.cex = 0.5, tl.cex = 0.7)

What we noticed is that besides (LabelAppeal, AcidIndex and STARS) almost all other variables have no correlation to the TARGET

Let’s tale a look at these 3 features to get some insigts.

p1 <- df_training %>% 
  dplyr::select(TARGET, STARS) %>% 
  mutate(STARS = as.factor(STARS),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(STARS)) +
  geom_bar(aes(fill = TARGET)) +
  scale_fill_brewer(palette = "Spectral") +
  theme_minimal()

p2 <- df_training %>%
  dplyr::select(TARGET, LabelAppeal) %>% 
  mutate(STARS = as.factor(LabelAppeal),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(LabelAppeal)) +
  geom_bar(aes(fill = TARGET)) +
  scale_fill_brewer(palette = "Spectral") +
  theme_minimal()

p3 <- df_training %>% 
  dplyr::select(TARGET, AcidIndex) %>% 
  mutate(STARS = as.factor(AcidIndex),
         TARGET = as.factor(TARGET)) %>% 
  ggplot(aes(AcidIndex)) +
  geom_bar(aes(fill = TARGET)) +
  scale_fill_brewer(palette = "Spectral") +
  theme_minimal()

ggarrange(p1, p2, p3, ncol = 3, nrow = 1,common.legend = TRUE)

A few important things to notice:

1- Many wines with 0 orders are NA

2- Star Rating seems very “normal” distributed meaning it peaks at 3-stars for the larger order (4,5,6)

3- Label appeal around zero seems to be driving many 0 orders.

Let’s look specifically at the TARGET variable.

# plot
p <- df_training %>%
  ggplot( aes(x=TARGET)) +
  geom_histogram(fill="#96eb28",binwidth = 1, origin = -0.5) +
  #scale_x_continuous(trans='log10') +
    ggtitle("Distribution of Target Amounts") +
  theme_minimal()

p

Many ZERO orders in the dataset. This will need special care as many models don’t work well with zero inflated datasets.

Data engineering and transformations

Let’s replace all STARS=NA with 0. This is because we saw that many wines with NA in the STAR features had no orders.

df_training <- df_training %>% 
    mutate(STARS = replace(STARS, is.na(STARS) , 0))

df_validation <- df_validation %>% 
    mutate(STARS = replace(STARS, is.na(STARS) , 0))

We will create a new feature “LabelAppeal2” which is the the squared of LabelAppeal. This is becase we saw a higher number of 0 orders in the center of distribution when LabelAppeal=0.

df_training$LabelAppeal2 <- df_training$LabelAppeal^2
df_validation$LabelAppeal2 <- df_validation$LabelAppeal^2

Let’s convert some features from numerical to FACTOR. We will do it with STARS since orders didn’t seemed to be linearly increasing with this feature.

df_training$LabelAppeal <- as.factor(df_training$LabelAppeal)
df_training$STARS <- as.factor(df_training$STARS)
df_training$TARGET <-as.integer(df_training$TARGET)

df_validation$LabelAppeal <- as.factor(df_validation$LabelAppeal)
df_validation$STARS <- as.factor(df_validation$STARS)

We will use TIDYMODELS RECIPES to do some imputation and normalization Also we will convert all our factor variable into Dummy Variables to do our regressions.

data_recipe <-
  recipe(formula = TARGET ~ . , data = df_training) %>%
  #step_rm(TARGET2) %>%
  #update_role(TARGET2, new_role = "id variable") %>%  #ignores this var
  step_impute_mean(all_numeric_predictors()) %>% #impute if necessary
  step_impute_mode(all_factor_predictors()) %>% #impute
  #step_other(all_factor_predictors(), threshold = 0.2) %>%
  #step_dummy(all_factor_predictors()) %>%
  step_normalize(all_numeric_predictors())

Let’s now process our recipe to get new dataset with all our feature engineering and transformations.

df_training_mod <- data_recipe %>%
  prep() %>%
  bake(new_data = NULL)
df_validation_mod <- data_recipe %>%
  prep() %>%
  bake(new_data = df_validation)

Modelling

First let’t split our training further in to training and validation.

# Create a split object TEST and TRAINING
set.seed(888)
my_split <- initial_split(df_training_mod, prop = 0.8)

# Build training data set
my_training <- my_split %>% training()

# Build testing data set
my_test <- my_split %>% testing()

Models 1a,1b - multiple regression

Let’s define all our models.

model1a <- lm(TARGET ~ ., data = my_training)
#tidy(model1a)
#summary(model1a)
model1b <- lm(TARGET ~ STARS+LabelAppeal, data = my_training)
#tidy(model1b)
#summary(model1b)

Model 2a,2b - Poisson Regression

model2a<-glm(TARGET ~., data=my_training, family ="poisson")
#tidy(model2a)
#summary(model2a)
model2b<-glm(TARGET ~ STARS+LabelAppeal+LabelAppeal2, data=my_training, family ="poisson")
#tidy(model2b)
#summary(model2b)

Models 3a, 3b - Negative Binomial

set.seed(123)
model3a<-glm.nb(TARGET ~ ., data=my_training)
#summary(model3a)
set.seed(123)
model3b<-glm.nb(TARGET ~ STARS+LabelAppeal, data=my_training)
#summary(model3b)

Model 4 - Hurdle Poisson Regression

#Hurdle Poisson
set.seed(123)
modelhp<-hurdle(TARGET ~ STARS+LabelAppeal | STARS, data=my_training , dist = "negbin")
#summary(modelhp)

Model 5 - Zero Inflated Poisson Regression

set.seed(123)
modelzp<-zeroinfl(TARGET ~. | STARS, data=my_training,dist = "poisson")
#summary(modelzp)

Analysis of results

Let’s predict using each of our 8 models and then test against our validation set.

We will the calculate the ROOT MEAN SQUARED ERROR of our predictions. RMSE will be the metric we wil use to select our best model. In case of regression RMSE or similar metrics like (MAE, MAPE, MSE) are common ones. RMSE seems to be the most commonly used.

pred <- predict.glm(model1a, newdata = my_test, type = "response")
rmsemodel1a <- ModelMetrics::rmse(my_test$TARGET,round(pred))
#rmsemodel1a
pred <- predict.glm(model1b, newdata = my_test, type = "response")
rmsemodel1b <- ModelMetrics::rmse(my_test$TARGET,round(pred))
#rmsemodel1b
pred <- predict.glm(model2a, newdata = my_test, type = "response")
rmsemodel2a <- ModelMetrics::rmse(my_test$TARGET,round(pred))
#rmsemodel2a
pred <- predict.glm(model2b, newdata = my_test, type = "response")
rmsemodel2b <- ModelMetrics::rmse(my_test$TARGET,round(pred))
#rmsemodel2b
pred <- predict.glm(model3a, newdata = my_test, type = "response")
rmsemodel3a <- ModelMetrics::rmse(my_test$TARGET,round(pred))
#rmsemodel3a
pred <- predict.glm(model3b, newdata = my_test, type = "response")
rmsemodel3b <- ModelMetrics::rmse(my_test$TARGET,round(pred))
#rmsemodel3b
predhp<- predict(modelhp,newdata=my_test, type = "response")
rmsemodelhp<-ModelMetrics::rmse(my_test$TARGET,round(predhp))
#rmsemodelhp
predzp<- predict(modelzp,newdata=my_test, type = "response")
pred2 <- round(predzp)
rmsemodelzp<-ModelMetrics::rmse(my_test$TARGET,round(predzp))
#rmsemodelzp

Let’s take a look now at all our results. Lowest RMSE wins!

RMSE <- list(rmsemodel1a,rmsemodel1b,rmsemodel2a,rmsemodel2b,
             rmsemodel3a,rmsemodel3b,rmsemodelhp,rmsemodelzp)

RMSE <- lapply(RMSE,round,5)

newcols <- c("MR1", "MR2", "Poi1", "Poi2","NB1", "NB2","NP","ZP")
kable(rbind(RMSE), col.names = newcols, digits=4)  %>% 
  kable_styling(full_width = T)
MR1 MR2 Poi1 Poi2 NB1 NB2 NP ZP
RMSE 1.35523 1.37612 1.35984 1.35437 1.35984 1.35437 1.35624 1.3419

And the winner was ZERO INFLATED POISSON. Makes sense since this model considers that we had many zeroes in our dataset.

Use best model to predict on validation set

pred <- predict(modelzp, newdata = df_validation_mod, type = "response")
pred <- round(pred)
head(pred,50)
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  1  4  3  2  1  6  4  2  1  1  3  2  4  1  1  3  2  1  4  6  2  1  2  2  4  6 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
##  3  6  5  2  4  2  4  3  1  4  3  3  1  2  2  1  1  1  3  4  3  5  4  3

Plot our predictions.

# plot
p <- pred %>% as_tibble() %>%
  ggplot( aes(x=value)) +
  geom_histogram(fill="#96eb28",binwidth = 1, origin = -0.5) +
  #scale_x_continuous(trans='log10') +
    ggtitle("Distribution of Predictions") +
  theme_minimal()

p

Save the data in .CSV

write.csv(pred,"predictions_ZP.csv")

Our predictions of cases to sell are in a single file called predictions.csv

Done. Thank you!