2025-06-16

Inventory Forecasting

The synthetic dataset used for this project was found on kaggle.com. It consists of 76000 rows and 16 columns:

  • Date: Date of the record.
  • Store ID: Unique identifier for the store.
  • Product ID: Unique identifier for the product.
  • Category: Product category.
  • Region: Geographical region of the store.
  • Inventory Level: Units available in stock.
  • Units Sold: Units sold on that day.
  • Units Ordered: Units ordered for restocking.
  • Price: Product price.
  • Discount: Discount applied, if any.
  • Weather Condition: Weather on the day of the record.
  • Promotion: 1 if there was a promotion, 0 otherwise.
  • Competitor Pricing: Price of a similar product from a competitor.
  • Seasonality: Season (e.g., Winter, Spring).
  • Epidemic: 1 if an epidemic occurred, 0 otherwise.
  • Demand: Daily estimated demand for the product.

Analyzing the Data

After careful examination, I noticed that the data consists of 20 different products for each of the 5 different stores. Each store and product has their own ID starting from ‘S001’ to ‘S005’ for store IDs, and starting from ‘P0001’ to ‘P0020’ for product IDs. Although two different stores may have products with the same product ID, they seem to be two distinct products. I say this because in some cases, two products with the same ID belonging to different stores are categorized differently. For example, the product with ID ‘P0001’ from store with ID ‘S001’ is categorized as ‘Electronics’ while the product with the same ID but associated with store ‘S002’ is categorized as ‘Groceries’.

Additionally, I noticed that each store has a different lead time. Lead time the amount of time it takes for a restocking order to be fulfilled after the order has been made. Some products may have different lead times depending on the supplier. This creates a challenge in the supply chain and inventory aspect of business. One difficult question to answer is: How much product should I buy for restock given ‘x’ amount of lead time for that product? The answer depends on many factors such as the weather forecast, time of year, whether a promotion is ongoing, etc.

Goal for this Project

In this project, I attempt to create an algorithm that predicts how much demand there will be for product for its given lead time. That is, if a product has a lead time of 3 days, the algorithm will attempt to create a 3-day forecast of that product’s demand. By doing this, it will prevent the business from being sold out of its products and losing potential customers. Additionally, it will help maintain a smooth inventory count ensuring overstocking is also avoided.

Using Penalty to Measure the Algorithm’s Effectiveness

The dataset already consists of a column named ‘Demand’ and will be used as the amount of items that item would have been purchased by customers for that day. For each day, if the initial inventory minus the demand for that day is below zero, it signifies that the business did not fulfill its customers demands and in return, lost some revenue. This loss will be recorded in a new column called ‘Penalty’ and will be used to measure how good the dataset was at inventory control.

After measuring the dataset’s penalty, I will then implement my own algorithm to hopefully minimize this penalty. This will be done by creating a demand forecast for each product’s lead time and making simulated restock orders.

Data Clean Up

# Read CSV
df <- read_csv("/Users/jonisalazar/School/Summer 2025/DAT301/Assignments/Week5/Inventory Forecasting/sales_data.csv")

# Convert Date to Date type and sort
df <- df %>%
  mutate(Date = ymd(Date)) %>%
  arrange(Date)

# Add Day of Year column
df <- df %>%
  mutate(Day = yday(Date))

# Calculate Penalty column
df <- df %>%
  mutate(Penalty = if_else(`Inventory Level` < Demand,
                           abs(`Inventory Level` - Demand),
                           0))

Computing the Penalties for each Product

# Group by Store ID and Product ID and sum penalties
penalty_sums_df <- df %>%
  group_by(`Store ID`, `Product ID`) %>%
  summarise(`Penalty Sum 1` = sum(Penalty), .groups = 'drop')

# Preview first 4 rows
head(penalty_sums_df, 4)
## # A tibble: 4 × 3
##   `Store ID` `Product ID` `Penalty Sum 1`
##   <chr>      <chr>                  <dbl>
## 1 S001       P0001                   1446
## 2 S001       P0002                  11177
## 3 S001       P0003                   1978
## 4 S001       P0004                   4786

Computing the Total Penalties from the Original Data

# Compute the sum for the Penalty column
original_penalty_sum <- sum(penalty_sums_df$`Penalty Sum 1`)

The sum for all products in this dataset is 4.75948^{5}. The goal is to reduce this number using a forecasting algorithm that depends on historical data and the lead time for that product. The following code will generate a demand forecast for each product and attempt to reduce the penalty score.

Generating a Demand Forecast for Each Product

## # A tibble: 5 × 4
##   `Store ID` `Product ID` `Penalty Sum 1` `Penalty Sum 2`
##   <chr>      <chr>                  <dbl>           <dbl>
## 1 S001       P0001                   1446             385
## 2 S001       P0002                  11177             688
## 3 S001       P0003                   1978             500
## 4 S001       P0004                   4786             925
## 5 S001       P0005                   2395             615

Comparing the Results

# New penalty with my algorithm
my_penalty_sum <- sum(penalty_sums_df$`Penalty Sum 2`)

# Determine which is better
if (original_penalty_sum < my_penalty_sum) {
  cat("My algorithm did not reduce penalties.\n")
} else {
  cat("My algorithm reduced penalties by", original_penalty_sum - my_penalty_sum, "\n")
}
## My algorithm reduced penalties by 402410

Plotting the Comparison

Trade-off: Reduced Penalty vs. Inventory Stability

While my algorithm did indeed reduce the penalty overall, it is at the cost of having an increased standard deviation for inventory count. This means the inventory count for a product may vary more than it did in the original but not by a significant amount. In my opinion, I think reduced penalties and increased earnings outweighs the slightly increased inventory instability.

To show this, the inventory stability for product ‘P0001’ from store ‘S001’ will be plotted onto a graph.

Inventory Stability Without Forecasting

## Penalty Sum: 485
## Penalty Standard Deviation: 101.9558

Inventory Stability With Forecasting

## Penalty Sum: 24
## Penalty Standard Deviation: 122.9925

Conclusion

The algorithm achieved its goal in reducing penalties which could then increase the company’s revenue. Although it did make the inventory count slightly less stable, I think the increase in revenue far outweighs this drawback.

To improve the algorithm, a Random Forest Regressor could be used to better predict what the demand would be for the upcoming days (lead time). The features it would use would be: “Inventory Level”, “Price”, “Discount”, “Weather Condition”, “Promotion”, “Competitor Pricing”, “Seasonality”, “Month”, “Day of Week”. The target variable would be: “Demand”. By using this model, it could make better predictions on how much inventory to order for restocking and possibly reduce penalties even more.

Thank You!