This report discuss suitable methods to forecast the demand for the following products -
The methods have been based upon historical sales data and analysis of external factors which have an influence on the sales of the products.
R, an open source statistical language| Product | Model Accuracy (in %) |
|---|---|
| o.k. Energy-drink 25CL | 91.02 |
| Snickers 50G | 67.73 |
| Murren Ch-Poulet Tartarsauce 205G | 71.21 |
Following data files were imported for the analysis:
The data set was created by cleaning each file independently and merging the files on the date attribute.
The merged data set was subsetted for each product containing data for the period Jan 01, 2015 - Dec 07, 2016.
The following are some of the major pre-processing steps performed to the data set:
Is_Promotion was imputed with N and missing sales values were imputed with the median of the month & year.Date_BKey variable into date, month, year and weekday.
The distribution of the sales values in the data were as below.
Figure 1: Distribution for Sales Quantity of Red Bull Energy Drink
The distribution is slightly deviated from normal with a skewness towards the right and has characteristics of weekly seasonality are present.
Figure 2: Distribution for Sales Quantity of Hershey’s Chocolate Bar
The distribution is deviated from normal with a skewness towards the right and has a long tail. No characteristics of seasonality are present.
Figure 3: Distribution for Sales Quantity of Heinz Tomato Ketchup
The distribution is normal without any skewness and has characteristics of weekly seasonality are present.
Outlier values were identified for Product SKey 615783.
The distribution of the outlier values in the data were as below.
Figure 4: Outlier Handling for Sales Quantity of Hershey’s Chocolate Bar (615783)
To predict the Sales Quantity for the next 15 days, the models needed data for the Independent/Input variables for at least 1.5 years.
The following steps were used to generate the Test Data :
Data was extracted from the processed Data set for the past 2 years: 2015 and 2016.
Highly correlated numerical input variables were handled to avoid multi-collinearity.
Feature selection was performed using Random Forest model and only the most influential input variables were retained for each product.
Figure 5: Variable Importance Plots
Below are snapshots of the created test data for the three products:
| Date_Bkey | Sales_Quantity | Is_Promotion | Date | DayofWeek | Month | Year | Is_Holiday | MeanTempC | Swiss_Lotto_Amount | Joker_Amount |
|---|---|---|---|---|---|---|---|---|---|---|
| 2015-01-01 | 296 | N | 1 | Thu | Jan | 2015 | Y | -6 | 9600000 | 70000 |
| 2015-01-02 | 380 | N | 2 | Fri | Jan | 2015 | Y | -2 | 9600000 | 70000 |
| 2015-01-03 | 460 | N | 3 | Sat | Jan | 2015 | N | 4 | 9600000 | 70000 |
| 2015-01-04 | 357 | N | 4 | Sun | Jan | 2015 | N | 1 | 10300000 | 80000 |
| 2015-01-05 | 578 | N | 5 | Mon | Jan | 2015 | N | -2 | 10300000 | 80000 |
| Date_Bkey | Sales_Quantity | Date | DayofWeek | Month | Year | MeanTempC | Swiss_Lotto_Amount | Joker_Amount | Euro_Millions_Amount |
|---|---|---|---|---|---|---|---|---|---|
| 2015-01-01 | 8 | 1 | Thu | Jan | 2015 | -6 | 9600000 | 70000 | 1.8e+07 |
| 2015-01-02 | 11 | 2 | Fri | Jan | 2015 | -2 | 9600000 | 70000 | 1.8e+07 |
| 2015-01-03 | 8 | 3 | Sat | Jan | 2015 | 4 | 9600000 | 70000 | 1.8e+07 |
| 2015-01-04 | 7 | 4 | Sun | Jan | 2015 | 1 | 10300000 | 80000 | 2.6e+07 |
| 2015-01-05 | 13 | 5 | Mon | Jan | 2015 | -2 | 10300000 | 80000 | 2.6e+07 |
| Date_Bkey | Sales_Quantity | Date | DayofWeek | Month | Year | MeanTempC | MeanSeaLPressurehPa | Swiss_Lotto_Amount | Joker_Amount |
|---|---|---|---|---|---|---|---|---|---|
| 2015-01-01 | 15 | 1 | Thu | Jan | 2015 | -6 | 1037 | 9600000 | 70000 |
| 2015-01-02 | 10 | 2 | Fri | Jan | 2015 | -2 | 1036 | 9600000 | 70000 |
| 2015-01-03 | 9 | 3 | Sat | Jan | 2015 | 4 | 1028 | 9600000 | 70000 |
| 2015-01-04 | 6 | 4 | Sun | Jan | 2015 | 1 | 1032 | 10300000 | 80000 |
| 2015-01-05 | 11 | 5 | Mon | Jan | 2015 | -2 | 1034 | 10300000 | 80000 |
For Time series models, Training set consisted of approximately 18 months (2015, 2016 - First 6 months) and Validation consisted of the next 15 days from the end of training set.
For non-Time series models (Random Forest, XGBoost, etc), Test data was randomly distributed to Training set and Validation set in the ratio 70:30 for Product SKey 102443 and 80:20 for Product SKey 615783 & 758078 with seed set as 1234.
The following variables were considered for model building:
Baseline Forecasting Models were applied to the variable Sales Quantity obtain standard results for each product,
Time Series Forecasting Models were applied to the variable Sales Quantity to obtain optimal results for each product,
Predictive Machine Learning Models were applied to selected input variables obtain optimal results for each product,
The following models were selected for predicting sales of each product:
Random Forest + GLMBoosting Model with the optimal values for Number of trees = 800, Min size of terminal nodes = 20 and Number of variables tried at each split = 6 is selected as the final model.
The output of the final Random Forest with GLMBoosting Model -
##
## Call:
## randomForest(formula = log(Sales_Quantity) ~ ., data = train.set, ntree = 800, mtry = 6, replace = TRUE, nodesize = 20, keep.inbag = TRUE)
## Type of random forest: regression
## Number of trees: 800
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 0.01819294
## % Var explained: 66.52
##
## Generalized Linear Models Fitted via Gradient Boosting
##
## Call:
## glmboost.formula(formula = train.set.Sales_Quantity ~ ., data = df)
##
##
## Squared Error (Regression)
##
## Loss function: (y - f)^2
##
##
## Number of boosting iterations: mstop = 100
## Step size: 0.1
## Offset: 557.6012
##
## Coefficients:
## (Intercept) pred.rf
## -639.058094 1.157336
## attr(,"offset")
## [1] 557.6012
XGBoost Model: Grid Search Tuning Method with the optimal values for Learning rate = 0.1, Min loss reduction for partition = 5 and Min nodes in a leaf = 20 is selected as the final model.
The output of the final XGBoost Model -
## ##### xgb.Booster
## raw: 41.4 Kb
## call:
## xgb.train(params = params, data = dtrain, nrounds = 1000, watchlist = watchlist,
## verbose = 0, early_stopping_rounds = 10, maximize = F)
## params (as set within xgb.train):
## objective = "reg:linear", booster = "gbtree", eta = "0.1", gamma = "5", lambda = "0", max_depth = "6", min_child_weight = "20", subsample = "1", colsample_bytree = "1", silent = "1"
## xgb.attributes:
## best_iteration, best_msg, best_ntreelimit, best_score, niter
## callbacks:
## cb.early.stop(stopping_rounds = early_stopping_rounds, maximize = maximize,
## verbose = verbose)
## niter: 32
## best_iteration: 22
## best_ntreelimit: 22
## best_score: 36.93789
STL + Exponential Smoothing (ETS) Model is selected as the final model.
The output of the STL + ETS Model -
## ETS(A,N,N)
##
## Call:
## ets(y = x, model = etsmodel, allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 0.0425
##
## Initial states:
## l = 9.637
##
## sigma: 2.7076
##
## AIC AICc BIC
## 4702.519 4702.562 4715.525
The Performance of various models built for Product SKey 102443 -
Figure 6: Model Performance for Red Bull Energy Drink (102443)
Forecast Accuracy for validation set on predictions obtained using Random Forest + GLMBoosting Model:
## [1] "91.02 %"
The following time series chart plots Actual vs Predicted Sales for the test data set:
Figure 7: Actual and Predicted Sales for Red Bull Energy Drink (102443)
The Performance of various models built for Product SKey 615783 -
Figure 8: Model Performance for Snickers (615783)
Forecast Accuracy for validation set on predictions obtained using XGBoost Model:
## [1] "67.73 %"
The following time series chart plots Actual vs Predicted Sales for the test data set:
Figure 9: Actual and Predicted Sales for Snickers (615783)
The Performance of various models built for Product SKey 758078 -
Figure 10: Model Performance for Heinz Tomato Ketchup (758078)
Forecast Accuracy for validation set on predictions obtained using STL + Exponential Smoothing (ETS) Model:
## [1] "71.21 %"
The following time series chart plots Actual vs Predicted Sales for the test data set:
Figure 11: Actual and Predicted Sales for Heinz Tomato Ketchup (758078)
The following are the sales forecasts for each product for the next 15 days (Dec 08,2016 - Dec 22,2016) obtained from the selected models:
Figure 12: Sales Forecasts for Red Bull Energy Drink (102443)
Figure 13: Sales Forecasts for Snickers (615783)
Figure 14: Sales Forecasts for Heinz Tomato Ketchup (758078)
| Product | Baseline Accuracy (in %) | Model Accuracy (in %) | Increase in Accuracy (in %) |
|---|---|---|---|
| o.k. Energy-drink 25CL | 85.02 | 91.02 | 7.06 |
| Snickers 50G | 36.58 | 67.73 | 85.18 |
| Murren Ch-Poulet Tartarsauce 205G | 58.24 | 71.21 | 22.26 |