1 Executive Summary

1.1 Abstract

This report discuss suitable methods to forecast the demand for the following products -

  1. Red Bull Energy Drink
  2. Hershey’s Chocolate Bar
  3. Heinz Tomato Ketchup

The methods have been based upon historical sales data and analysis of external factors which have an influence on the sales of the products.

1.2 Problem Statement

  • Sales forecasts provide an input to business planning
  • Efficiency in supply chain can cut costs and increase profits
  • Company requires store-level forecasts for a horizon of 15 days to increase the efficiency of their operations

1.3 Project Goals

  • Understand the historical demand for the products in the last 2 years
  • Analyze the daily sales data and understand influence of external factors
  • Build predictive ML model to obtain forecasts for a 15 - day horizon
  • Create a web-based UI to produce forecasts

1.4 Methods

  • Analyzed past 2 year’s sales data for the products (Jan 2015 - Dec 2016)
  • Performed data analysis to identify outliers, impute missing values and attribute selection
  • Built predictive models using sales data and influential external factors
  • Mean Absolute Percentage Error (MAPE) used to evaluate the performance in comparison to baseline models
  • Analysis performed using R, an open source statistical language

1.5 Outcome

  • Forecast Accuracy for each product -
Product Model Accuracy (in %)
o.k. Energy-drink 25CL 91.02
Snickers 50G 67.73
Murren Ch-Poulet Tartarsauce 205G 71.21
  • The selected models will provide reliable forecasts for the next 4 months. Re-training will be required thereafter with the latest sales data.

2 Data Handling

2.1 Importing Data

Following data files were imported for the analysis:

  • Sales_Final.csv - historical sales data including promotion details (Jan 01, 2015 - Dec 07, 2016)
  • Holiday.csv - holiday data (Jan 2014 - Dec 2017)
  • Lotto_Final.csv - Swiss Lotto, Joker Lotto and Euro Millions Jackpot data (Jan 2015 - Dec 2017)
  • Weather.csv - weather data (Jan 01, 2015 - Dec 07, 2016)

The data set was created by cleaning each file independently and merging the files on the date attribute.

The merged data set was subsetted for each product containing data for the period Jan 01, 2015 - Dec 07, 2016.

2.2 Cleaning Data

The following are some of the major pre-processing steps performed to the data set:

  • Product SKey 615783 was missing sales data for 21 days. Is_Promotion was imputed with N and missing sales values were imputed with the median of the month & year.
  • Holiday data set had all attributes amalgamated in one column. Text processing was performed to split the attributes into multiple columns.
  • A back-fill was performed on the missing Jackpot amounts for days when no result was declared, i.e., missing amounts were filled with the latest value.
  • Missing values in the weather data were imputed with the median of the month.
  • Outliers present in Product SKey 615783 were imputed with median of the sales quantity of the product.
  • New attributes were created by splitting Date_BKey variable into date, month, year and weekday.
  • Data types of various attributes were converted before using them in the model.

2.3 Exploring data through Visualizations

The distribution of the sales values in the data were as below.

  1. Red Bull Energy Drink (102443)
Figure 1: Distribution for Sales Quantity of Red Bull Energy Drink

Figure 1: Distribution for Sales Quantity of Red Bull Energy Drink

The distribution is slightly deviated from normal with a skewness towards the right and has characteristics of weekly seasonality are present.

  1. Hershey’s Chocolate Bar (615783)
Figure 2: Distribution for Sales Quantity of Hershey's Chocolate Bar

Figure 2: Distribution for Sales Quantity of Hershey’s Chocolate Bar

The distribution is deviated from normal with a skewness towards the right and has a long tail. No characteristics of seasonality are present.

  1. Heinz Tomato Ketchup (758078)
Figure 3: Distribution for Sales Quantity of Heinz Tomato Ketchup

Figure 3: Distribution for Sales Quantity of Heinz Tomato Ketchup

The distribution is normal without any skewness and has characteristics of weekly seasonality are present.

Outlier values were identified for Product SKey 615783.

The distribution of the outlier values in the data were as below.

Figure 4: Outlier Handling for Sales Quantity of Hershey's Chocolate Bar (615783)

Figure 4: Outlier Handling for Sales Quantity of Hershey’s Chocolate Bar (615783)


3 Data Analysis

3.1 Creation of Test Data

To predict the Sales Quantity for the next 15 days, the models needed data for the Independent/Input variables for at least 1.5 years.

The following steps were used to generate the Test Data :

  1. Data was extracted from the processed Data set for the past 2 years: 2015 and 2016.

  2. Highly correlated numerical input variables were handled to avoid multi-collinearity.

  3. Feature selection was performed using Random Forest model and only the most influential input variables were retained for each product.

Figure 5: Variable Importance Plots

Figure 5: Variable Importance Plots

  1. Log was applied to Sales Quantity for Product SKey 102443 to stabilize the series on variance.

Below are snapshots of the created test data for the three products:

  1. Red Bull Energy Drink (102443)
Date_Bkey Sales_Quantity Is_Promotion Date DayofWeek Month Year Is_Holiday MeanTempC Swiss_Lotto_Amount Joker_Amount
2015-01-01 296 N 1 Thu Jan 2015 Y -6 9600000 70000
2015-01-02 380 N 2 Fri Jan 2015 Y -2 9600000 70000
2015-01-03 460 N 3 Sat Jan 2015 N 4 9600000 70000
2015-01-04 357 N 4 Sun Jan 2015 N 1 10300000 80000
2015-01-05 578 N 5 Mon Jan 2015 N -2 10300000 80000
  1. Hershey’s Chocolate Bar (615783)
Date_Bkey Sales_Quantity Date DayofWeek Month Year MeanTempC Swiss_Lotto_Amount Joker_Amount Euro_Millions_Amount
2015-01-01 8 1 Thu Jan 2015 -6 9600000 70000 1.8e+07
2015-01-02 11 2 Fri Jan 2015 -2 9600000 70000 1.8e+07
2015-01-03 8 3 Sat Jan 2015 4 9600000 70000 1.8e+07
2015-01-04 7 4 Sun Jan 2015 1 10300000 80000 2.6e+07
2015-01-05 13 5 Mon Jan 2015 -2 10300000 80000 2.6e+07
  1. Heinz Tomato Ketchup (758078)
Date_Bkey Sales_Quantity Date DayofWeek Month Year MeanTempC MeanSeaLPressurehPa Swiss_Lotto_Amount Joker_Amount
2015-01-01 15 1 Thu Jan 2015 -6 1037 9600000 70000
2015-01-02 10 2 Fri Jan 2015 -2 1036 9600000 70000
2015-01-03 9 3 Sat Jan 2015 4 1028 9600000 70000
2015-01-04 6 4 Sun Jan 2015 1 1032 10300000 80000
2015-01-05 11 5 Mon Jan 2015 -2 1034 10300000 80000

3.2 Training and Validation sets

  • For Time series models, Training set consisted of approximately 18 months (2015, 2016 - First 6 months) and Validation consisted of the next 15 days from the end of training set.

  • For non-Time series models (Random Forest, XGBoost, etc), Test data was randomly distributed to Training set and Validation set in the ratio 70:30 for Product SKey 102443 and 80:20 for Product SKey 615783 & 758078 with seed set as 1234.

3.3 Predictive Modelling

The following variables were considered for model building:

  1. Red Bull Energy Drink (102443)
    • Training set
      • X = Is Promotion, Date, DayofWeek, Month, Year, Is Holiday, MeanTempC, SwissLotto Amount, Joker Amount
      • Y = Sales Quantity
    • Validation set
      • X = Is Promotion, Date, DayofWeek, Month, Year, Is Holiday, MeanTempC, SwissLotto Amount, Joker Amount
      • Y = PredSales Quantity
  2. Hershey’s Chocolate Bar (615783)
    • Training set
      • X = Date, DayofWeek, Month, Year, MeanTempC, SwissLotto Amount, Joker Amount, EuroMillions Amount
      • Y = Sales Quantity
    • Validation set
      • X = Date, DayofWeek, Month, Year, MeanTempC, SwissLotto Amount, Joker Amount, EuroMillions Amount
      • Y = PredSales Quantity
  3. Heinz Tomato Ketchup (758078)
    • Training set
      • X = Date, DayofWeek, Month, Year, MeanTempC, MeanSeaLPressurehPa, SwissLotto Amount, Joker Amount
      • Y = Sales Quantity
    • Validation set
      • X = Date, DayofWeek, Month, Year, MeanTempC, MeanSeaLPressurehPa, SwissLotto Amount, Joker Amount
      • Y = PredSales Quantity

Baseline Forecasting Models were applied to the variable Sales Quantity obtain standard results for each product,

  • Meanf Model
  • Naive Model
  • Random Walk with drift

Time Series Forecasting Models were applied to the variable Sales Quantity to obtain optimal results for each product,

  • Holt’s Linear Model (on Seasonally Adjusted component)
  • Non - seasonal Auto ARIMA Model (on Seasonally Adjusted component)
  • Non - seasonal ARIMA Model (on Seasonally Adjusted component) : Manual Search Tuning Method
  • Holt - Winter’s Model
  • STL + ARIMA Model
  • STL + Exponential Smoothing (ETS) Model
  • Seasonal Auto ARIMA Model
  • Seasonal ARIMA Model : Manual Search Tuning Method

Predictive Machine Learning Models were applied to selected input variables obtain optimal results for each product,

  • Random Forest Model (Basic Model)
  • Random Forest Model : Manual Search Tuning Method
  • Random Forest + GLMBoosting Model
  • XGBoost Model (Basic Model)
  • XGBoost Model : Manual Search Tuning Method
  • XGBoost Model : Grid Search Tuning Method

The following models were selected for predicting sales of each product:

  1. Red Bull Energy Drink (102443)

Random Forest + GLMBoosting Model with the optimal values for Number of trees = 800, Min size of terminal nodes = 20 and Number of variables tried at each split = 6 is selected as the final model.

The output of the final Random Forest with GLMBoosting Model -

## 
## Call:
##  randomForest(formula = log(Sales_Quantity) ~ ., data = train.set,      ntree = 800, mtry = 6, replace = TRUE, nodesize = 20, keep.inbag = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 800
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.01819294
##                     % Var explained: 66.52
## 
##   Generalized Linear Models Fitted via Gradient Boosting
## 
## Call:
## glmboost.formula(formula = train.set.Sales_Quantity ~ ., data = df)
## 
## 
##   Squared Error (Regression) 
## 
## Loss function: (y - f)^2 
##  
## 
## Number of boosting iterations: mstop = 100 
## Step size:  0.1 
## Offset:  557.6012 
## 
## Coefficients: 
## (Intercept)     pred.rf 
## -639.058094    1.157336 
## attr(,"offset")
## [1] 557.6012
  1. Hershey’s Chocolate Bar (615783)

XGBoost Model: Grid Search Tuning Method with the optimal values for Learning rate = 0.1, Min loss reduction for partition = 5 and Min nodes in a leaf = 20 is selected as the final model.

The output of the final XGBoost Model -

## ##### xgb.Booster
## raw: 41.4 Kb 
## call:
##   xgb.train(params = params, data = dtrain, nrounds = 1000, watchlist = watchlist, 
##     verbose = 0, early_stopping_rounds = 10, maximize = F)
## params (as set within xgb.train):
##   objective = "reg:linear", booster = "gbtree", eta = "0.1", gamma = "5", lambda = "0", max_depth = "6", min_child_weight = "20", subsample = "1", colsample_bytree = "1", silent = "1"
## xgb.attributes:
##   best_iteration, best_msg, best_ntreelimit, best_score, niter
## callbacks:
##   cb.early.stop(stopping_rounds = early_stopping_rounds, maximize = maximize, 
##     verbose = verbose)
## niter: 32
## best_iteration: 22
## best_ntreelimit: 22
## best_score: 36.93789

  1. Heinz Tomato Ketchup (758078)

STL + Exponential Smoothing (ETS) Model is selected as the final model.

The output of the STL + ETS Model -

## ETS(A,N,N) 
## 
## Call:
##  ets(y = x, model = etsmodel, allow.multiplicative.trend = allow.multiplicative.trend) 
## 
##   Smoothing parameters:
##     alpha = 0.0425 
## 
##   Initial states:
##     l = 9.637 
## 
##   sigma:  2.7076
## 
##      AIC     AICc      BIC 
## 4702.519 4702.562 4715.525

4 Results

4.1 Performance on Validation set

  1. Red Bull Energy Drink (102443)

The Performance of various models built for Product SKey 102443 -

Figure 6: Model Performance for Red Bull Energy Drink (102443)

Figure 6: Model Performance for Red Bull Energy Drink (102443)

Forecast Accuracy for validation set on predictions obtained using Random Forest + GLMBoosting Model:

## [1] "91.02 %"

The following time series chart plots Actual vs Predicted Sales for the test data set:

Figure 7: Actual and Predicted Sales for Red Bull Energy Drink (102443)

  1. Hershey’s Chocolate Bar (615783)

The Performance of various models built for Product SKey 615783 -

Figure 8: Model Performance for Snickers (615783)

Figure 8: Model Performance for Snickers (615783)

Forecast Accuracy for validation set on predictions obtained using XGBoost Model:

## [1] "67.73 %"

The following time series chart plots Actual vs Predicted Sales for the test data set:

Figure 9: Actual and Predicted Sales for Snickers (615783)

  1. Heinz Tomato Ketchup (758078)

The Performance of various models built for Product SKey 758078 -

Figure 10: Model Performance for Heinz Tomato Ketchup (758078)

Figure 10: Model Performance for Heinz Tomato Ketchup (758078)

Forecast Accuracy for validation set on predictions obtained using STL + Exponential Smoothing (ETS) Model:

## [1] "71.21 %"

The following time series chart plots Actual vs Predicted Sales for the test data set:

Figure 11: Actual and Predicted Sales for Heinz Tomato Ketchup (758078)

4.2 Sales Forecasts

The following are the sales forecasts for each product for the next 15 days (Dec 08,2016 - Dec 22,2016) obtained from the selected models:

  1. Red Bull Energy Drink (102443)
Figure 12: Sales Forecasts for Red Bull Energy Drink (102443)

Figure 12: Sales Forecasts for Red Bull Energy Drink (102443)

  1. Hershey’s Chocolate Bar (615783)
Figure 13: Sales Forecasts for Snickers (615783)

Figure 13: Sales Forecasts for Snickers (615783)

  1. Heinz Tomato Ketchup (758078)
Figure 14: Sales Forecasts for Heinz Tomato Ketchup (758078)

Figure 14: Sales Forecasts for Heinz Tomato Ketchup (758078)


5 Conclusion

  1. The Machine Learning Models resulted in the following Forecast Accuracy for each product in comparison to Baseline Models:
Product Baseline Accuracy (in %) Model Accuracy (in %) Increase in Accuracy (in %)
o.k. Energy-drink 25CL 85.02 91.02 7.06
Snickers 50G 36.58 67.73 85.18
Murren Ch-Poulet Tartarsauce 205G 58.24 71.21 22.26
  1. These models will provide reliable forecasts for the next 4 months. It will be required to re-train the models thereafter.
  2. Accuracy can be further improved by collecting additional data, Feature Selection, and Model tuning.
  3. Forecasts from the Models will increase company’s operational efficiency and result in greater returns.
  4. Data Science could provide a key input in defining the demand for various products sold across multiple business locations worldwide.