Load all packages

Load data and perform data cleaning

Data Merging

There are multiple metadatasets from the Kaggle competition that we need to merge to make the training dataset. Moreover, since we obtained external data, we need to add that too. Given the large size of the original training set (125 million observations), We randomly sampled 1 million observations to make our work more feasible.

The cleaned/merged Training set is called ‘Train’. It has 30 variables. The Test Set is called ‘test_updated’.

A couple of things to bear in mind:

1)The Economic indicators vary monthly (i.e. the ) whereas the Kaggle data varies daily. 2)Lots of missing values in both the training sets and test sets. sometimes, entire variables are missing in the test set. 3) There’s roughly 3.75 years worth of data in training set and a couple of days worth of data in the Test Set.

Research Question: Brick-and-mortar stores have to stock physical products on-location, so preparing for demand is key. How can we model and predict unit sales (in order to determine what quantity to stock)? What economic and product-related factors are relevant?

EDA visualizations and tables

Univariate Displays:

Let’s look at the distribution of the Response Variable, Unit Sales. We’re removing the extreme outliers (as they make it harder to see the distribution).

min Q1 median Q3 max mean sd n missing
-92 2 4 9 4021 8.543422 18.92657 1e+06 0

Let’s look at the distribution of Unit Sales over time.

Evidently, Unit Sales fluctuate over time. We can see that there is evidence of seasonality and a slight negative trend line overall. A time series decomposition would be useful to visualize the spread of both unit sales and mean unit sales over time.

Fluctuation of Oil Prices over time:

## Warning: Removed 3 rows containing missing values (geom_path).

We include this here because, surprisingly, oil price wasn’t super useful in any of our models. That could be because we didn’t account for the effects of time. That said, perhaps encoding this variable as a categorical variable (high vs low prices etc) could be fruitful.

Multivariate Displays:

Some of the predictors turned out to have very strong relationships with the Response Variable. Others (like Oil Price) had surprisingly weaker relationships with the response variable than we initially expected.

Evidently, as this plot shows, certain groups of products and commodities sell more, on average, then others do.

Clearly, the number of Transactions has a strong relationship with the number of units sold.

Finally, we think that the Store number (which accounts for region) has a strong relationship with the number of units solds. There will be higher sales in populated cities, etc.

In this exploratory analysis, we wanted to visualize the distribution of the covariates available to us to see whether our understandings of their relationship with unit sales would hold. The ones we displayed above showed some particularly promising signs in terms of being useful predictors and, as we’ll see later, they were fairly important.

Models and validation

We made a number of Models with varying degrees of success. Here,for simplicity’s sake, we’re only including the Linear and the LASSO model (i.e. the baseline and the ultimate model):

Linear Model

Our baseline model was a ‘full’ linear model which contained a number of predictors.

ds<-Train %>% 
  mutate(cluster= as.factor(cluster) ,family=as.factor(family) , city=as.factor(city) , perishable=as.factor(perishable) ,onpromotion=as.factor(onpromotion) , type=as.factor(type),store_nbr = as.factor(store_nbr), item_nbr = as.factor(item_nbr))
ts1 <- test_updated %>% 
  mutate(cluster= as.factor(cluster) ,family=as.factor(family) , city=as.factor(city) , perishable=as.factor(perishable) ,onpromotion=as.factor(onpromotion)  , type=as.factor(type), store_nbr = as.factor(store_nbr),  item_nbr = as.factor(item_nbr))

model<-lm(unit_sales~ dcoilwtico  + `CPI Price, seas. adj.,,,` + `Exports Merchandise, Customs, current US$, millions, seas. adj.` + `Industrial Production, constant US$, seas. adj.,,` + `J.P. Morgan Emerging Markets Bond Spread (EMBI+),,,,` + `Real Effective Exchange Rate,,,,` + `Unemployment rate,Percent,,,` + cluster + family + city + perishable + month +onpromotion + type, data=ds)
y_hat <- predict(model, newdata=ts1)

model %>% 
 broom::glance()
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual
0.0648014 0.0646159 16.87901 349.2397 0 112 -2375315 4750857 4752126 159389499 559456

The score we got was around 1.4 using just this model.

LASSO:

Ultimately, we ended up running with our LASSO model as it had the best score on Kaggle.

projds <- Train
ts3<- test_updated
projds <- projds %>% 
  mutate(log_unit_sales = log(unit_sales)) %>% 
  drop_na()
ts3 <- ts3 %>% 
  mutate(log_unit_sales = 1) 

model_formula <- as.formula("log_unit_sales ~ date + store_nbr + item_nbr + onpromotion + city + state + type + cluster + family + class + perishable")

predictor_matrix_train <- model.matrix(model_formula, data = projds) [,-1]
predictor_matrix_test <- model.matrix(model_formula, data = ts3) [, -1] 

LASSO_fit <- glmnet(x = predictor_matrix_train, y = projds$log_unit_sales, alpha = 1)
LASSO_CV <- cv.glmnet(x=predictor_matrix_train, y = projds$log_unit_sales, alpha=1)
lambda_star <- LASSO_CV$lambda.min
lambda_star_1SE <- LASSO_CV$lambda.1se

unit_sales <- predict(LASSO_fit, newx=predictor_matrix_test, s=lambda_star_1SE) %>%
  as.vector() %>% 
  exp()

final_submission <- sample_submission
final_submission$unit_sales <- unit_sales

Plotting the LASSO fit:

#Function below is required to generate desired plots for LASSO:

get_LASSO_coefficients <- function(LASSO_fit){
  coeff_values <- LASSO_fit %>% 
    broom::tidy() %>% 
    as_tibble() %>% 
    select(-c(step, dev.ratio)) %>% 
    tidyr::complete(lambda, nesting(term), fill = list(estimate = 0)) %>% 
    arrange(desc(lambda)) %>% 
    select(term, estimate, lambda)
  return(coeff_values)
}

#Load the code below to visualize LASSO on the FULL METADATA:

LASSO_coefficients_Meta <- get_LASSO_coefficients(LASSO_fit) %>% 
  filter(term != "(Intercept)")

plot_LASSO_coefficients <- LASSO_coefficients_Meta %>% 
  ggplot(aes(x=lambda, y=estimate, col=term)) +
  geom_line() +
  scale_x_log10() +
  labs(x="lambda (log10-scale)", y="beta-hat coefficient estimate",
       title="LASSO on complete Metadata") +
  geom_vline(xintercept = lambda_star, col="red", alpha=0.4, linetype="dashed") +
  geom_vline(xintercept = lambda_star_1SE, col="blue", alpha=0.4, linetype="dashed")
  
plot_LASSO_coefficients %>% 
ggplotly()
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Caveat - We were unsure about the exact cross-validation because we didn’t know the weighting mechanism of the Kaggle Scoring process for this competition. Here’s the CV code -excluding runs because of prohibitive knitting times

##Our training score
training_unit_sales <- predict(LASSO_fit, newx = predictor_matrix_train, s = lambda_star_1SE) 
training_score <- RMSLE(exp(training_unit_sales), projds$unit_sales)
training_score
## [1] 0.7971787
##finding coefficiens at a given lambda
Coefficient_Values <-get_LASSO_coefficients(LASSO_fit) %>%
 filter(lambda == lambda_star_1SE) %>%
 filter(estimate != 0) %>%
 knitr::kable(digits=5)

#ds8 <- Train
#folded <- createFolds(ds8, k = 10, list = TRUE, returnTrain = FALSE)
#lambdas_to_test<- c(.0001,.001,.01,.1) #based on our tests and intuition, very small lambdas are useful int his context
#results <- matrix(0, ncol = 4, nrow = 10)
#results <- data.frame(results)
 
#for (i in 1:4){
 # lambda <- lambdas_to_test[i]
  #for (j in 1:10){
   # this_set <- filter(ds8, ds8$fold != i)
    #this_test <- filter(ds8, ds8$fold == i)
    #predictor_matrix_train <- model.matrix(model_formula, data = this_set) [,-1]
    #LASSO_fit <- glmnet(x = predictor_matrix_train, y = projds$log_unit_sales, alpha = 1)
   # predictions <- predict(LASSO_fit, newx=this_test, s=lambda)
   # rmse <- predictions - this_test$unit_sales %>%
  #    `^`(2) %>%
  #    mean %>%
  #    `sqrt`
 #   results[j,i] <- rmse
#  }
 # }

Create submission

write_csv(final_submission, "data/Final_Submission.csv")

Citations and references

The data we worked with was taken from the Corporacion Favoriata Competition on Kaggle: https://www.kaggle.com/c/favorita-grocery-sales-forecasting The additional data we used was obtained from the World Bank.

Supplementary Material - Additional Visualizations

Average Unit Sales per Month:

Unit Sales based on State:

Looking at the distribution of mean sales by decile (with the top decile removed) to see the variations in spread.