Demand Planning: 142 Elite SKU Analysis

Author

Tony Wang

Executive Summary: Demand Forecasting for High-Velocity Inventory

Project Overview

This project addresses a critical operational challenge: optimising inventory levels for a portfolio of 142 elite SKUs within a Ecommerce retail ecosystem in Brazil. By shifting from reactive stocking to a proactive machine-learning-driven approach, we aim to reduce carrying costs while ensuring high-demand products remain available.

Methodology & Technical Rigor

The analysis utilized a “Global Model” strategy, pooling data from all 142 SKUs to create a robust dataset of over 3,400 training observations. This approach allowed the models to learn shared demand patterns across different products, enhancing predictive stability.

To identify the most accurate forecasting engine, seven distinct machine learning architectures were benchmarked:

Baseline Models: Linear Regression and Stochastic Gradient Descent (SGD).
Non-Linear/Clustering Models: K-Nearest Neighbors (KNN) and Support Vector Regression (SVR).
Ensemble Methods: AdaBoost, XGBoost, and Random Forest.

Key Finding: The SVR Advantage

While initial testing with Linear Regression provided a strong baseline, Support Vector Regression (SVR) was selected as the final production model. SVR demonstrated superior performance in handling the non-linear demand spikes and seasonal fluctuations inherent in retail sales, outperforming ensemble methods in out-of-sample stability.

Performance Insights

Model Bias Analysis: The final model was evaluated not just on accuracy, but on systemic bias. Identifying a consistent bias (calculated at ~24% during development) allowed for a more nuanced interpretation of stock requirements, ensuring we don’t systematically under-order for top-performing items.
Forecasting Horizon: The model successfully generated a 4-week forward-looking forecast, providing actionable data for supply chain procurement.

Strategic Outcomes: 4-Week Stocking Outlook

The final forecast (visualized in the aggregate bar chart) indicates a stabilized demand of approximately 300 to 325 units per week across the elite SKU portfolio.

Business Recommendations:

Procurement Alignment: Use the 4-week forecast to negotiate bulk lead times with suppliers, specifically targeting the Week 1 peak of ~325 units.
Safety Stock Buffering: Incorporate the identified 24% model bias into safety stock calculations to mitigate the risk of stockouts during high-volatility periods.
Inventory Rotation: Focus warehouse labor on the 142 elite SKUs identified by the model, as they represent the highest volume and lowest risk of “zombie” stock.

This data-driven framework transforms raw sales history into a strategic asset, providing a scalable solution for Melbourne’s competitive retail market.

Preparetion

Define function

Show the code

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Define your functions here so they are ALWAYS available
def model_bias(y_predicted, y_actual):
    bias = (np.mean(y_predicted) / np.mean(y_actual) - 1) * 100
    return bias

def calculate_accuracy(y_pred, y_true):
    accuracy = (1 - np.sqrt(np.mean((y_pred - y_true)**2)) / np.mean(y_true)) * 100
    return accuracy
  
def data_preprocessing(df, num_lags, train_test_split):
    # This assumes df is a Series or a single-column DataFrame of your target variable
    data = df.values
    X, y = [], []
    
    # Create the lagged features
    for i in range(len(data) - num_lags):
        X.append(data[i:(i + num_lags)])
        y.append(data[i + num_lags])
    
    X, y = np.array(X), np.array(y)
    
    # Calculate the split point
    split = int(len(X) * train_test_split)
    
    x_train, x_test = X[:split], X[split:]
    y_train, y_test = y[:split], y[split:]
    
    return x_train, y_train, x_test, y_test

Show the code

# Importing the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from master_function import data_preprocessing, mass_import
from master_function import plot_train_test_values, calculate_accuracy, model_bias
from sklearn.metrics import mean_squared_error

Load data

Show the code

# Fetch Orders Data
file_path = "olist_orders_dataset.csv"

# Load the data
order = pd.read_csv(file_path)

# Display the first few rows to verify the import
print(order.head())

# Check basic info about columns and data types
print(order.info())

file_path = "olist_order_items_dataset.csv"

# Fetch items the dataset
item = pd.read_csv(file_path)

# Display the first few rows to verify the import
print(item.head())

# Check basic info about columns and data types
print(item.info())

# Combine the two dataframe into one
df = pd.merge(order, item, on='order_id', how='left').dropna()

# Check na
print(df.isnull().sum())

# Display the first few rows to verify the import
print(df.head())

# Check basic info about columns and data types
print(df.info())

                           order_id                       customer_id  \
0  e481f51cbdc54678b7cc49136f2d6af7  9ef432eb6251297304e76186b10a928d   
1  53cdb2fc8bc7dce0b6741e2150273451  b0830fb4747a6c6d20dea0b8c802d7ef   
2  47770eb9100c2d0c44946d9cf07ec65d  41ce2a54c0b03bf3443c3d931a367089   
3  949d5b44dbf5de918fe9c16f97b45f8a  f88197465ea7920adcdbec7375364d82   
4  ad21c59c0840e6cb83a9ceb5573f8159  8ab97904e6daea8866dbdbc4fb7aad2c   

  order_status order_purchase_timestamp    order_approved_at  \
0    delivered      2017-10-02 10:56:33  2017-10-02 11:07:15   
1    delivered      2018-07-24 20:41:37  2018-07-26 03:24:27   
2    delivered      2018-08-08 08:38:49  2018-08-08 08:55:23   
3    delivered      2017-11-18 19:28:06  2017-11-18 19:45:59   
4    delivered      2018-02-13 21:18:39  2018-02-13 22:20:29   

  order_delivered_carrier_date order_delivered_customer_date  \
0          2017-10-04 19:55:00           2017-10-10 21:25:13   
1          2018-07-26 14:31:00           2018-08-07 15:27:45   
2          2018-08-08 13:50:00           2018-08-17 18:06:29   
3          2017-11-22 13:39:59           2017-12-02 00:28:42   
4          2018-02-14 19:46:34           2018-02-16 18:17:02   

  order_estimated_delivery_date  
0           2017-10-18 00:00:00  
1           2018-08-13 00:00:00  
2           2018-09-04 00:00:00  
3           2017-12-15 00:00:00  
4           2018-02-26 00:00:00  
<class 'pandas.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype
---  ------                         --------------  -----
 0   order_id                       99441 non-null  str  
 1   customer_id                    99441 non-null  str  
 2   order_status                   99441 non-null  str  
 3   order_purchase_timestamp       99441 non-null  str  
 4   order_approved_at              99281 non-null  str  
 5   order_delivered_carrier_date   97658 non-null  str  
 6   order_delivered_customer_date  96476 non-null  str  
 7   order_estimated_delivery_date  99441 non-null  str  
dtypes: str(8)
memory usage: 6.1 MB
None
                           order_id  order_item_id  \
0  00010242fe8c5a6d1ba2dd792cb16214              1   
1  00018f77f2f0320c557190d7a144bdd3              1   
2  000229ec398224ef6ca0657da4fc703e              1   
3  00024acbcdf0a6daa1e931b038114c75              1   
4  00042b26cf59d7ce69dfabb4e55b4fd9              1   

                         product_id                         seller_id  \
0  4244733e06e7ecb4970a6e2683c13e61  48436dade18ac8b2bce089ec2a041202   
1  e5f2d52b802189ee658865ca93d83a8f  dd7ddc04e1b6c2c614352b383efe2d36   
2  c777355d18b72b67abbeef9df44fd0fd  5b51032eddd242adc84c38acab88f23d   
3  7634da152a4610f1595efa32f14722fc  9d7a1d34a5052409006425275ba1c2b4   
4  ac6c3623068f30de03045865e4e10089  df560393f3a51e74553ab94004ba5c87   

   shipping_limit_date   price  freight_value  
0  2017-09-19 09:45:35   58.90          13.29  
1  2017-05-03 11:05:13  239.90          19.93  
2  2018-01-18 14:48:30  199.00          17.87  
3  2018-08-15 10:10:18   12.99          12.79  
4  2017-02-13 13:57:51  199.90          18.14  
<class 'pandas.DataFrame'>
RangeIndex: 112650 entries, 0 to 112649
Data columns (total 7 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   order_id             112650 non-null  str    
 1   order_item_id        112650 non-null  int64  
 2   product_id           112650 non-null  str    
 3   seller_id            112650 non-null  str    
 4   shipping_limit_date  112650 non-null  str    
 5   price                112650 non-null  float64
 6   freight_value        112650 non-null  float64
dtypes: float64(2), int64(1), str(4)
memory usage: 6.0 MB
None
order_id                         0
customer_id                      0
order_status                     0
order_purchase_timestamp         0
order_approved_at                0
order_delivered_carrier_date     0
order_delivered_customer_date    0
order_estimated_delivery_date    0
order_item_id                    0
product_id                       0
seller_id                        0
shipping_limit_date              0
price                            0
freight_value                    0
dtype: int64
                           order_id                       customer_id  \
0  e481f51cbdc54678b7cc49136f2d6af7  9ef432eb6251297304e76186b10a928d   
1  53cdb2fc8bc7dce0b6741e2150273451  b0830fb4747a6c6d20dea0b8c802d7ef   
2  47770eb9100c2d0c44946d9cf07ec65d  41ce2a54c0b03bf3443c3d931a367089   
3  949d5b44dbf5de918fe9c16f97b45f8a  f88197465ea7920adcdbec7375364d82   
4  ad21c59c0840e6cb83a9ceb5573f8159  8ab97904e6daea8866dbdbc4fb7aad2c   

  order_status order_purchase_timestamp    order_approved_at  \
0    delivered      2017-10-02 10:56:33  2017-10-02 11:07:15   
1    delivered      2018-07-24 20:41:37  2018-07-26 03:24:27   
2    delivered      2018-08-08 08:38:49  2018-08-08 08:55:23   
3    delivered      2017-11-18 19:28:06  2017-11-18 19:45:59   
4    delivered      2018-02-13 21:18:39  2018-02-13 22:20:29   

  order_delivered_carrier_date order_delivered_customer_date  \
0          2017-10-04 19:55:00           2017-10-10 21:25:13   
1          2018-07-26 14:31:00           2018-08-07 15:27:45   
2          2018-08-08 13:50:00           2018-08-17 18:06:29   
3          2017-11-22 13:39:59           2017-12-02 00:28:42   
4          2018-02-14 19:46:34           2018-02-16 18:17:02   

  order_estimated_delivery_date  order_item_id  \
0           2017-10-18 00:00:00            1.0   
1           2018-08-13 00:00:00            1.0   
2           2018-09-04 00:00:00            1.0   
3           2017-12-15 00:00:00            1.0   
4           2018-02-26 00:00:00            1.0   

                         product_id                         seller_id  \
0  87285b34884572647811a353c7ac498a  3504c0cb71d7fa48d967e0e4c94d59d9   
1  595fac2a385ac33a80bd5114aec74eb8  289cdb325fb7e7f891c38608bf9e0962   
2  aa4383b373c6aca5d8797843e5594415  4869f7a5dfa277a7dca6462dcf3b52b2   
3  d0b61bfb1de832b15ba9d266ca96e5b0  66922902710d126a0e7d26b0e3805106   
4  65266b2da20d04dbe00c5c2d3bb7859e  2c9e548be18521d1c43cde1c582c6de8   

   shipping_limit_date   price  freight_value  
0  2017-10-06 11:07:15   29.99           8.72  
1  2018-07-30 03:24:27  118.70          22.76  
2  2018-08-13 08:55:23  159.90          19.22  
3  2017-11-23 19:45:59   45.00          27.20  
4  2018-02-19 20:31:37   19.90           8.72  
<class 'pandas.DataFrame'>
Index: 110180 entries, 0 to 113424
Data columns (total 14 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   order_id                       110180 non-null  str    
 1   customer_id                    110180 non-null  str    
 2   order_status                   110180 non-null  str    
 3   order_purchase_timestamp       110180 non-null  str    
 4   order_approved_at              110180 non-null  str    
 5   order_delivered_carrier_date   110180 non-null  str    
 6   order_delivered_customer_date  110180 non-null  str    
 7   order_estimated_delivery_date  110180 non-null  str    
 8   order_item_id                  110180 non-null  float64
 9   product_id                     110180 non-null  str    
 10  seller_id                      110180 non-null  str    
 11  shipping_limit_date            110180 non-null  str    
 12  price                          110180 non-null  float64
 13  freight_value                  110180 non-null  float64
dtypes: float64(3), str(11)
memory usage: 12.6 MB
None

Aggregate weekly sales by product ID

Show the code

# Transform time to datetime
df['order_purchase_timestamp'] = pd.to_datetime(df['order_purchase_timestamp'])

wkly_sales = df.groupby(['product_id',
                         pd.Grouper(key = 'order_purchase_timestamp', freq = 'W-MON')
                         ]).agg(
                             volume = ('product_id', 'count'),
                             value = ('price', 'sum')
                         ).reset_index()

wkly_sales.rename(columns={'order_purchase_timestamp': 'week'}, inplace=True)
wkly_sales = wkly_sales.sort_values(['week', 'product_id'])

print(f"Total products tracked:{wkly_sales['product_id'].nunique()}")
print(f"Total weekly records:{len(wkly_sales)}")

Total products tracked:32210
Total weekly records:77674

The average records per product is only 2.4. Let us take a close look at how the sales are distributed

Distribution of Product Order Volumes

Show the code

# Calculate total volume per product and plot distribution
product_volumes = wkly_sales.groupby('product_id')['volume'].sum()
plt.figure(figsize=(10, 6))
plt.hist(product_volumes, bins=100, color='royalblue', edgecolor='black', log=True)
plt.title('The "Long Tail" of Product Sales', fontsize=15)
plt.xlabel('Total Units Sold (Lifetime)', fontsize=12)
plt.ylabel('Number of Unique Products (Log Scale)', fontsize=12)
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.axvline(10, color='red', linestyle='--', label='Threshold: 10 Sales')
plt.axvline(50, color='orange', linestyle='--', label='Threshold: 50 Sales')
plt.legend()
plt.show()

product_volumes.describe()
print(f"Products with > 1 unit sold:  {(product_volumes > 1).sum():,}")
print(f"Products with > 10 units sold: {(product_volumes > 10).sum():,}")
print(f"Products with > 50 units sold: {(product_volumes > 50).sum():,}")

Products with > 1 unit sold:  14,495
Products with > 10 units sold: 1,637
Products with > 50 units sold: 168

The longtail perfectly describes the sales distribution, with the majority of products only selling less than 50 in two years.

Show the code

# How many weeks would a product generate sales
product_counts = wkly_sales['product_id'].value_counts()
plt.figure(figsize=(10, 6))
plt.hist(product_counts, bins=range(1, product_counts.max() + 2), color='skyblue', edgecolor='black')
plt.yscale('log')
plt.title('Distribution of Product Sales History', fontsize=14)
plt.xlabel('Number of Weeks with Sales', fontsize=12)
plt.ylabel('Number of Unique Products (Log Scale)', fontsize=12)
plt.grid(axis='y', alpha=0.3)
plt.show()

product_counts.describe()
print(f"Products sold > 1 time:  {(product_counts > 1).sum():,}")
print(f"Products sold > 10 times: {(product_counts > 10).sum():,}")
print(f"Products sold > 50 times: {(product_counts > 50).sum():,}")

Products sold > 1 time:  12,621
Products sold > 10 times: 987
Products sold > 50 times: 15

The longtail perfectly describes the sales distribution, with the majority of products only selling less than 50 in two yearsVery few products generate sales in 50 weeks or more, 75% only manage to generate sales in 2 weeks

Find out products with continuous sales big volume for forecasting

Show the code

analysis = wkly_sales.groupby('product_id').agg(
    weeks_count=('week', 'count'),
    total_volume=('volume', 'sum')
)

# Filter for the 'Perfect' products
perfect_candidates = analysis[(analysis['weeks_count'] > 20) & (analysis['total_volume'] > 50)]

print(f"Number of perfect products to forecast: {len(perfect_candidates)}")
print(perfect_candidates.sort_values('total_volume', ascending=False).head())

Number of perfect products to forecast: 142
                                  weeks_count  total_volume
product_id                                                 
aca2eb7d00ea1a7b8ebd4e68314663af           40           520
422879e10f46682990de24d770e7f83d           59           484
99a4788cb24856965c36a24e339b6058           74           477
389d119b48cf3043d311335e499d9c6b           60           390
368c6c730842d78016ad823897a372db           57           388

From the massive ‘Long Tail with 32,000 items, only 142 show consistent, high-volume demand patterns. To prioritise these 142 for automated machine-learning forecasting to optimise our warehouse space, and apply simple ’just-in-time’ ordering or safety-stock buffers for the remaining items

Prepare products list for forecasting

Show the code

# Filter products list for forecasting
target_ids = perfect_candidates.index.tolist()
forecast_df = wkly_sales[wkly_sales['product_id'].isin(target_ids)].copy()
forecast_df = forecast_df.sort_values(by=['product_id', 'week'])

print(f"Forecasting Dataframe Created.")
print(f"Total Rows: {len(forecast_df)}")
print(f"Unique Products: {forecast_df['product_id'].nunique()}")
forecast_df.tail()

Forecasting Dataframe Created.
Total Rows: 4964
Unique Products: 142

	product_id	week	volume	value
76545	fc1d8637c0268af3db482c14b7ef8e75	2018-07-02	1	149.9
76546	fc1d8637c0268af3db482c14b7ef8e75	2018-07-09	1	149.9
76547	fc1d8637c0268af3db482c14b7ef8e75	2018-07-23	1	149.9
76548	fc1d8637c0268af3db482c14b7ef8e75	2018-08-06	1	149.9
76549	fc1d8637c0268af3db482c14b7ef8e75	2018-08-13	2	299.8

Prepare training & testing data

Show the code

# Creating the training and test sets
all_x_train = []
all_y_train = []
all_x_test = []
all_y_test = []

num_lags = 4 # Looking at the last month to predict next week
train_split = 0.8

# Loop through our 142 elite products
for pid in target_ids:
    # Filter for the specific product
    product_data = forecast_df[forecast_df['product_id'] == pid]['volume']
    
    # Check if we have enough data points for the lags + at least one target
    if len(product_data) > num_lags + 1:
        # Run your existing preprocessing function
        x_tr, y_tr, x_te, y_te = data_preprocessing(product_data, num_lags, train_split)
        
        # Collect the results
        all_x_train.append(x_tr)
        all_y_train.append(y_tr)
        all_x_test.append(x_te)
        all_y_test.append(y_te)

# Concatenate all products into one large training/testing pool
x_train = np.concatenate(all_x_train)
y_train = np.concatenate(all_y_train)
x_test = np.concatenate(all_x_test)
y_test = np.concatenate(all_y_test)

print(f"Final Global Training Set: {x_train.shape}")
print(f"Final Global Testing Set: {x_test.shape}")

Final Global Training Set: (3459, 4)
Final Global Testing Set: (937, 4)

The training dataset has 3,459 records, and the testing dataset has 937 records

Forecasting Sales

Linear Regression Model

Show the code

# Load the packages again
from master_function import plot_train_test_values, calculate_accuracy, model_bias

# Fitting the model
model = LinearRegression()
model.fit(x_train, y_train)

# Predicting in-sample
y_predicted_train = np.reshape(model.predict(x_train), (-1, 1))

# Predicting out-of-sample
y_predicted = np.reshape(model.predict(x_test), (-1, 1))

# plotting
plot_train_test_values(100, 50, y_train, y_test, y_predicted)

# Performance evaluation
print('---')
print('Accuracy Train = ', round(calculate_accuracy(y_predicted_train, y_train), 2), '%')
print('Accuracy Test = ', round(calculate_accuracy(y_predicted, y_test), 2), '%')
print('RMSE Train = ', round(np.sqrt(mean_squared_error(y_predicted_train, y_train)), 10))
print('RMSE Test = ', round(np.sqrt(mean_squared_error(y_predicted, y_test)), 10))
print('Correlation In-Sample Predicted/Train = ', round(np.corrcoef(np.reshape(y_predicted_train, (-1)), y_train)[0][1], 3))
print('Correlation Out-of-Sample Predicted/Test = ', round(np.corrcoef(np.reshape(y_predicted, (-1)), np.reshape(y_test, (-1)))[0][1], 3))
print('Model Bias = ', round(model_bias(y_predicted, y_test), 2), '%')
print('---')

# To get true bias by looking ath the difference between actual and forecast total
Actual_Total = np.sum(y_test)
Forecast_Total = np.sum(y_predicted)

# A simple bias check: are we over or under forecasting in total?
Total_Bias = (Forecast_Total - Actual_Total) / Actual_Total
print(f"Volume Bias: {Total_Bias:.2%}") 

# The volume bias shows that 24% more are forecasted

---
Accuracy Train =  100.0 %
Accuracy Test =  100.0 %
RMSE Train =  3.5183687845
RMSE Test =  2.656813072
Correlation In-Sample Predicted/Train =  0.543
Correlation Out-of-Sample Predicted/Test =  0.281
Model Bias =  24.06 %
---
Volume Bias: 24.06%

The model is weak as out-of-sample accuracy is low. The Model Bias 24.06% is very high, mainly caused by sparse weeks (weeks with 0 sales)

SVR Model

Show the code

from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from master_function import data_preprocessing, mass_import
from master_function import plot_train_test_values, calculate_accuracy, model_bias
from sklearn.metrics import mean_squared_error

# Fitting the model
model = make_pipeline(StandardScaler(), SVR(kernel = 'rbf', C = 1, gamma = 0.04, epsilon = 0.01))
model.fit(x_train, y_train)

# Predicting in-sample
y_predicted_train = np.reshape(model.predict(x_train), (-1, 1))

# Predicting out-of-sample
y_predicted = np.reshape(model.predict(x_test), (-1, 1))

# plotting
plot_train_test_values(100, 50, y_train, y_test, y_predicted)

# Performance evaluation
print('---')
print('Accuracy Train = ', round(calculate_accuracy(y_predicted_train, y_train), 2), '%')
print('Accuracy Test = ', round(calculate_accuracy(y_predicted, y_test), 2), '%')
print('RMSE Train = ', round(np.sqrt(mean_squared_error(y_predicted_train, y_train)), 10))
print('RMSE Test = ', round(np.sqrt(mean_squared_error(y_predicted, y_test)), 10))
print('Correlation In-Sample Predicted/Train = ', round(np.corrcoef(np.reshape(y_predicted_train, (-1)), y_train)[0][1], 3))
print('Correlation Out-of-Sample Predicted/Test = ', round(np.corrcoef(np.reshape(y_predicted, (-1)), np.reshape(y_test, (-1)))[0][1], 3))
print('Model Bias = ', round(model_bias(y_predicted, y_test), 2), '%')
print('---')

# To get true bias by looking ath the difference between actual and forecast total
Actual_Total = np.sum(y_test)
Forecast_Total = np.sum(y_predicted)

# A simple bias check: are we over or under forecasting in total?
Total_Bias = (Forecast_Total - Actual_Total) / Actual_Total
print(f"Volume Bias: {Total_Bias:.2%}")

---
Accuracy Train =  100.0 %
Accuracy Test =  100.0 %
RMSE Train =  3.607642365
RMSE Test =  2.4711929894
Correlation In-Sample Predicted/Train =  0.543
Correlation Out-of-Sample Predicted/Test =  0.295
Model Bias =  0.17 %
---
Volume Bias: 0.17%

The model provides a much better result as volume bias is only 0.17%

SGD Model

Show the code

from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from master_function import data_preprocessing, mass_import
from master_function import plot_train_test_values, calculate_accuracy, model_bias
from sklearn.metrics import mean_squared_error

model = make_pipeline(StandardScaler(), SGDRegressor(max_iter = 50, tol = 1e-3))
model.fit(x_train, y_train)

# Predicting in-sample
y_predicted_train = np.reshape(model.predict(x_train), (-1, 1))

# Predicting out-of-sample
y_predicted = np.reshape(model.predict(x_test), (-1, 1))

# plotting
plot_train_test_values(100, 50, y_train, y_test, y_predicted)

# Performance evaluation
print('---')
print('Accuracy Train = ', round(calculate_accuracy(y_predicted_train, y_train), 2), '%')
print('Accuracy Test = ', round(calculate_accuracy(y_predicted, y_test), 2), '%')
print('RMSE Train = ', round(np.sqrt(mean_squared_error(y_predicted_train, y_train)), 10))
print('RMSE Test = ', round(np.sqrt(mean_squared_error(y_predicted, y_test)), 10))
print('Correlation In-Sample Predicted/Train = ', round(np.corrcoef(np.reshape(y_predicted_train, (-1)), y_train)[0][1], 3))
print('Correlation Out-of-Sample Predicted/Test = ', round(np.corrcoef(np.reshape(y_predicted, (-1)), np.reshape(y_test, (-1)))[0][1], 3))
print('Model Bias = ', round(model_bias(y_predicted, y_test), 2), '%')
print('---')

# To get true bias by looking ath the difference between actual and forecast total
Actual_Total = np.sum(y_test)
Forecast_Total = np.sum(y_predicted)

# A simple bias check: are we over or under forecasting in total?
Total_Bias = (Forecast_Total - Actual_Total) / Actual_Total
print(f"Volume Bias: {Total_Bias:.2%}")

---
Accuracy Train =  100.0 %
Accuracy Test =  100.0 %
RMSE Train =  3.521697194
RMSE Test =  2.6939524615
Correlation In-Sample Predicted/Train =  0.543
Correlation Out-of-Sample Predicted/Test =  0.277
Model Bias =  23.2 %
---
Volume Bias: 23.20%

The model is also weak as the volume bias goes to 26.28%

KNN Model

Show the code

from sklearn.neighbors import KNeighborsRegressor
from master_function import data_preprocessing, mass_import
from master_function import plot_train_test_values, calculate_accuracy, model_bias
from sklearn.metrics import mean_squared_error

train_test_split = 0.80

# Fitting the model
model = KNeighborsRegressor(n_neighbors = 10)
model.fit(x_train, y_train)

# Predicting in-sample
y_predicted_train = np.reshape(model.predict(x_train), (-1, 1))

# Predicting out-of-sample
y_predicted = np.reshape(model.predict(x_test), (-1, 1))

# plotting
plot_train_test_values(100, 50, y_train, y_test, y_predicted)

# Performance evaluation
print('---')
print('Accuracy Train = ', round(calculate_accuracy(y_predicted_train, y_train), 2), '%')
print('Accuracy Test = ', round(calculate_accuracy(y_predicted, y_test), 2), '%')
print('RMSE Train = ', round(np.sqrt(mean_squared_error(y_predicted_train, y_train)), 10))
print('RMSE Test = ', round(np.sqrt(mean_squared_error(y_predicted, y_test)), 10))
print('Correlation In-Sample Predicted/Train = ', round(np.corrcoef(np.reshape(y_predicted_train, (-1)), y_train)[0][1], 3))
print('Correlation Out-of-Sample Predicted/Test = ', round(np.corrcoef(np.reshape(y_predicted, (-1)), np.reshape(y_test, (-1)))[0][1], 3))
print('Model Bias = ', round(model_bias(y_predicted, y_test), 2), '%')
print('---')

# To get true bias by looking ath the difference between actual and forecast total
Actual_Total = np.sum(y_test)
Forecast_Total = np.sum(y_predicted)

# A simple bias check: are we over or under forecasting in total?
Total_Bias = (Forecast_Total - Actual_Total) / Actual_Total
print(f"Volume Bias: {Total_Bias:.2%}")

---
Accuracy Train =  100.0 %
Accuracy Test =  100.0 %
RMSE Train =  3.3259864105
RMSE Test =  2.7385757658
Correlation In-Sample Predicted/Train =  0.61
Correlation Out-of-Sample Predicted/Test =  0.219
Model Bias =  20.0 %
---
Volume Bias: 20.00%

The model is also weak as the volume bias goes to 20.00%

Decision Tree

Show the code

from sklearn.tree import DecisionTreeRegressor
from master_function import data_preprocessing, mass_import
from master_function import plot_train_test_values, calculate_accuracy, model_bias
from sklearn.metrics import mean_squared_error

# Fitting the model
model = DecisionTreeRegressor(random_state = 123)
model.fit(x_train, y_train)

# Predicting in-sample
y_predicted_train = np.reshape(model.predict(x_train), (-1, 1))

# Predicting out-of-sample
y_predicted = np.reshape(model.predict(x_test), (-1, 1))

# plotting
plot_train_test_values(100, 50, y_train, y_test, y_predicted)

# Performance evaluation
print('---')
print('Accuracy Train = ', round(calculate_accuracy(y_predicted_train, y_train), 2), '%')
print('Accuracy Test = ', round(calculate_accuracy(y_predicted, y_test), 2), '%')
print('RMSE Train = ', round(np.sqrt(mean_squared_error(y_predicted_train, y_train)), 10))
print('RMSE Test = ', round(np.sqrt(mean_squared_error(y_predicted, y_test)), 10))
print('Correlation In-Sample Predicted/Train = ', round(np.corrcoef(np.reshape(y_predicted_train, (-1)), y_train)[0][1], 3))
print('Correlation Out-of-Sample Predicted/Test = ', round(np.corrcoef(np.reshape(y_predicted, (-1)), np.reshape(y_test, (-1)))[0][1], 3))
# Change this line in your Quarto code chunk:
print('Model Bias = ', round(model_bias(y_predicted, y_test), 2), '%')
print('---')

# To get true bias by looking ath the difference between actual and forecast total
Actual_Total = np.sum(y_test)
Forecast_Total = np.sum(y_predicted)

# A simple bias check: are we over or under forecasting in total?
Total_Bias = (Forecast_Total - Actual_Total) / Actual_Total
print(f"Volume Bias: {Total_Bias:.2%}")

---
Accuracy Train =  100.0 %
Accuracy Test =  100.0 %
RMSE Train =  1.446186092
RMSE Test =  3.6667096523
Correlation In-Sample Predicted/Train =  0.939
Correlation Out-of-Sample Predicted/Test =  0.221
Model Bias =  30.13 %
---
Volume Bias: 30.13%

The model is even worse as the volume bias goes to 30.13%

Random Forecast

Show the code

from sklearn.ensemble import RandomForestRegressor
from master_function import data_preprocessing, mass_import
from master_function import plot_train_test_values, calculate_accuracy, model_bias
from sklearn.metrics import mean_squared_error

# Fitting the model
model = RandomForestRegressor(max_depth = 20, random_state = 123)
model.fit(x_train, y_train)

# Predicting in-sample
y_predicted_train = np.reshape(model.predict(x_train), (-1, 1))

# Predicting out-of-sample
y_predicted = np.reshape(model.predict(x_test), (-1, 1))

# plotting
plot_train_test_values(100, 50, y_train, y_test, y_predicted)

# Performance evaluation
print('---')
print('Accuracy Train = ', round(calculate_accuracy(y_predicted_train, y_train), 2), '%')
print('Accuracy Test = ', round(calculate_accuracy(y_predicted, y_test), 2), '%')
print('RMSE Train = ', round(np.sqrt(mean_squared_error(y_predicted_train, y_train)), 10))
print('RMSE Test = ', round(np.sqrt(mean_squared_error(y_predicted, y_test)), 10))
print('Correlation In-Sample Predicted/Train = ', round(np.corrcoef(np.reshape(y_predicted_train, (-1)), y_train)[0][1], 3))
print('Correlation Out-of-Sample Predicted/Test = ', round(np.corrcoef(np.reshape(y_predicted, (-1)), np.reshape(y_test, (-1)))[0][1], 3))
print('Model Bias = ', round(model_bias(y_predicted, y_test), 2), '%')
print('---')

# To get true bias by looking ath the difference between actual and forecast total
Actual_Total = np.sum(y_test)
Forecast_Total = np.sum(y_predicted)

# A simple bias check: are we over or under forecasting in total?
Total_Bias = (Forecast_Total - Actual_Total) / Actual_Total
print(f"Volume Bias: {Total_Bias:.2%}")

---
Accuracy Train =  100.0 %
Accuracy Test =  100.0 %
RMSE Train =  1.9323801583
RMSE Test =  2.9055511957
Correlation In-Sample Predicted/Train =  0.899
Correlation Out-of-Sample Predicted/Test =  0.212
Model Bias =  26.07 %
---
Volume Bias: 26.07%

The model is not ideal as the volume bias goes to 26.07%

AdaBoost

Show the code

from sklearn.ensemble import AdaBoostRegressor
from master_function import data_preprocessing, mass_import
from master_function import plot_train_test_values, calculate_accuracy, model_bias
from sklearn.metrics import mean_squared_error

# Fitting the model
model = AdaBoostRegressor(random_state = 123)
model.fit(x_train, y_train)

# Predicting in-sample
y_predicted_train = np.reshape(model.predict(x_train), (-1, 1))

# Predicting out-of-sample
y_predicted = np.reshape(model.predict(x_test), (-1, 1))

# plotting
plot_train_test_values(100, 50, y_train, y_test, y_predicted)

# Performance evaluation
print('---')
print('Accuracy Train = ', round(calculate_accuracy(y_predicted_train, y_train), 2), '%')
print('Accuracy Test = ', round(calculate_accuracy(y_predicted, y_test), 2), '%')
print('RMSE Train = ', round(np.sqrt(mean_squared_error(y_predicted_train, y_train)), 10))
print('RMSE Test = ', round(np.sqrt(mean_squared_error(y_predicted, y_test)), 10))
print('Correlation In-Sample Predicted/Train = ', round(np.corrcoef(np.reshape(y_predicted_train, (-1)), y_train)[0][1], 3))
print('Correlation Out-of-Sample Predicted/Test = ', round(np.corrcoef(np.reshape(y_predicted, (-1)), np.reshape(y_test, (-1)))[0][1], 3))
print('Model Bias = ', round(model_bias(y_predicted, y_test), 2), '%')
print('---')

# To get true bias by looking ath the difference between actual and forecast total
Actual_Total = np.sum(y_test)
Forecast_Total = np.sum(y_predicted)

# A simple bias check: are we over or under forecasting in total?
Total_Bias = (Forecast_Total - Actual_Total) / Actual_Total
print(f"Volume Bias: {Total_Bias:.2%}")

---
Accuracy Train =  100.0 %
Accuracy Test =  100.0 %
RMSE Train =  3.5157072995
RMSE Test =  2.8517893904
Correlation In-Sample Predicted/Train =  0.58
Correlation Out-of-Sample Predicted/Test =  0.239
Model Bias =  41.23 %
---
Volume Bias: 41.23%

The model gives the highest volume bias of 30.13%

XGBoost

Show the code

from xgboost import XGBRegressor
from master_function import data_preprocessing, mass_import
from master_function import plot_train_test_values, calculate_accuracy, model_bias
from sklearn.metrics import mean_squared_error

# Fitting the model
model = XGBRegressor(random_state = 123, n_estimators = 16, max_depth = 12)
model.fit(x_train, y_train)

# Predicting in-sample
y_predicted_train = np.reshape(model.predict(x_train), (-1, 1))

# Predicting out-of-sample
y_predicted = np.reshape(model.predict(x_test), (-1, 1))

# plotting
plot_train_test_values(100, 50, y_train, y_test, y_predicted)

# Performance evaluation
print('---')
print('Accuracy Train = ', round(calculate_accuracy(y_predicted_train, y_train), 2), '%')
print('Accuracy Test = ', round(calculate_accuracy(y_predicted, y_test), 2), '%')
print('RMSE Train = ', round(np.sqrt(mean_squared_error(y_predicted_train, y_train)), 10))
print('RMSE Test = ', round(np.sqrt(mean_squared_error(y_predicted, y_test)), 10))
print('Correlation In-Sample Predicted/Train = ', round(np.corrcoef(np.reshape(y_predicted_train, (-1)), y_train)[0][1], 3))
print('Correlation Out-of-Sample Predicted/Test = ', round(np.corrcoef(np.reshape(y_predicted, (-1)), np.reshape(y_test, (-1)))[0][1], 3))
print('Model Bias = ', round(model_bias(y_predicted, y_test), 2), '%')
print('---')

# To get true bias by looking ath the difference between actual and forecast total
Actual_Total = np.sum(y_test)
Forecast_Total = np.sum(y_predicted)

# A simple bias check: are we over or under forecasting in total?
Total_Bias = (Forecast_Total - Actual_Total) / Actual_Total
print(f"Volume Bias: {Total_Bias:.2%}")

---
Accuracy Train =  100.0 %
Accuracy Test =  100.0 %
RMSE Train =  1.545121866
RMSE Test =  3.1321877787
Correlation In-Sample Predicted/Train =  0.932
Correlation Out-of-Sample Predicted/Test =  0.173
Model Bias =  23.36 %
---
Volume Bias: 23.36%

The model gives the highest volume bias of 41.23%

It turns out that SVR model stands out as the most accurate model. Let us run it again and use it to forecast future sales

SVR Model again

Show the code

# Fitting the model
model = make_pipeline(StandardScaler(), SVR(kernel = 'rbf', C = 1, gamma = 0.04, epsilon = 0.01))
model.fit(x_train, y_train)

# Predicting in-sample
y_predicted_train = np.reshape(model.predict(x_train), (-1, 1))

# Predicting out-of-sample
y_predicted = np.reshape(model.predict(x_test), (-1, 1))

# plotting
plot_train_test_values(100, 50, y_train, y_test, y_predicted)

# Performance evaluation
print('---')
print('Accuracy Train = ', round(calculate_accuracy(y_predicted_train, y_train), 2), '%')
print('Accuracy Test = ', round(calculate_accuracy(y_predicted, y_test), 2), '%')
print('RMSE Train = ', round(np.sqrt(mean_squared_error(y_predicted_train, y_train)), 10))
print('RMSE Test = ', round(np.sqrt(mean_squared_error(y_predicted, y_test)), 10))
print('Correlation In-Sample Predicted/Train = ', round(np.corrcoef(np.reshape(y_predicted_train, (-1)), y_train)[0][1], 3))
print('Correlation Out-of-Sample Predicted/Test = ', round(np.corrcoef(np.reshape(y_predicted, (-1)), np.reshape(y_test, (-1)))[0][1], 3))
print('Model Bias = ', round(model_bias(y_predicted, y_test), 2), '%')
print('---')

# To get true bias by looking ath the difference between actual and forecast total
Actual_Total = np.sum(y_test)
Forecast_Total = np.sum(y_predicted)

# A simple bias check: are we over or under forecasting in total?
Total_Bias = (Forecast_Total - Actual_Total) / Actual_Total
print(f"Volume Bias: {Total_Bias:.2%}")

---
Accuracy Train =  100.0 %
Accuracy Test =  100.0 %
RMSE Train =  3.607642365
RMSE Test =  2.4711929894
Correlation In-Sample Predicted/Train =  0.543
Correlation Out-of-Sample Predicted/Test =  0.295
Model Bias =  0.17 %
---
Volume Bias: 0.17%

Unpack the SKU

Show the code

# To get true bias by looking ath the difference between actual and forecast total
Actual_Total = np.sum(y_test)
Forecast_Total = np.sum(y_predicted)

# A simple bias check: are we over or under forecasting in total?
Total_Bias = (Forecast_Total - Actual_Total) / Actual_Total
print(f"Volume Bias: {Total_Bias:.2%}") 

# Unpack the skus list
results_list = []
current_idx = 0

for pid in target_ids:
    product_data = forecast_df[forecast_df['product_id'] == pid]['volume']

    if len(product_data) > num_lags + 1:
        # Get the number of test samples for this product
        n_test = len(all_y_test[target_ids.index(pid)])

        # Pull predictions and actuals, then FLATTEN them to 1D
        prod_preds = y_predicted[current_idx : current_idx + n_test].flatten()
        prod_actuals = y_test[current_idx : current_idx + n_test].flatten()

        # Store in a temporary dataframe
        temp_df = pd.DataFrame({
            'product_id': pid,
            'actual': prod_actuals,
            'forecast': prod_preds
        })
        results_list.append(temp_df)

        current_idx += n_test

sku_results = pd.concat(results_list)
# 1. Clip negative values to 0 (Regression can sometimes predict -0.5)
sku_results['forecast'] = sku_results['forecast'].clip(lower=0)

# 2. Round to the nearest whole number and convert to integer
sku_results['forecast'] = sku_results['forecast'].round(0).astype(int)

print(sku_results.head())

Volume Bias: 0.17%
                         product_id  actual  forecast
0  0152f69b6cf919bcdaf117aa8c43e5a2       2         2
1  0152f69b6cf919bcdaf117aa8c43e5a2       2         2
2  0152f69b6cf919bcdaf117aa8c43e5a2       4         2
3  0152f69b6cf919bcdaf117aa8c43e5a2       3         2
4  0152f69b6cf919bcdaf117aa8c43e5a2       1         2

Forecast into the future

Show the code

all_future_forecasts = []
num_steps = 4  # Forecast 4 weeks out

for pid in target_ids:
    # 1. Get the most recent history for THIS product
    # We need the last 'num_lags' weeks to start the engine
    recent_history = forecast_df[forecast_df['product_id'] == pid]['volume'].values[-num_lags:].tolist()
    
    current_window = recent_history.copy()
    pid_predictions = []

    # 2. Recursive Loop for this SKU
    for step in range(num_steps):
        window_array = np.array(current_window).reshape(1, -1)
        
        # Predict
        pred = model.predict(window_array)[0]
        # Clean: No negatives, round to integer
        pred_clean = int(max(0, round(float(pred))))
        
        pid_predictions.append(pred_clean)
        
        # Slide the window: remove oldest, add new prediction
        current_window.pop(0)
        current_window.append(pred_clean)

    # 3. Create a mini-dataframe for this product's future
    temp_df = pd.DataFrame({
        'product_id': [pid] * num_steps,
        'future_week_idx': [1, 2, 3, 4],
        'forecast_volume': pid_predictions
    })
    all_future_forecasts.append(temp_df)

# 4. Combine all 142 products into one Master Forecast Table
master_future_df = pd.concat(all_future_forecasts).reset_index(drop=True)

print(f"Generated 4-week forecast for {master_future_df['product_id'].nunique()} SKUs.")
print(master_future_df.head(8))

# Aggregate the 142 products into a total "Warehouse Load"
total_demand_trend = master_future_df.groupby('future_week_idx')['forecast_volume'].sum()

plt.figure(figsize=(10, 5))
plt.bar(total_demand_trend.index, total_demand_trend.values, color='teal', alpha=0.7)
plt.title('Total Forecasted Units Across All 142 SKUs', fontsize=14)
plt.xlabel('Weeks Into Future', fontsize=12)
plt.ylabel('Total Units to Stock', fontsize=12)
plt.xticks([1, 2, 3, 4])
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()

Generated 4-week forecast for 142 SKUs.
                         product_id  future_week_idx  forecast_volume
0  0152f69b6cf919bcdaf117aa8c43e5a2                1                2
1  0152f69b6cf919bcdaf117aa8c43e5a2                2                2
2  0152f69b6cf919bcdaf117aa8c43e5a2                3                2
3  0152f69b6cf919bcdaf117aa8c43e5a2                4                2
4  06edb72f1e0c64b14c5b79353f7abea3                1                2
5  06edb72f1e0c64b14c5b79353f7abea3                2                2
6  06edb72f1e0c64b14c5b79353f7abea3                3                2
7  06edb72f1e0c64b14c5b79353f7abea3                4                2