Rossman

Abstract

Demand forecasting is essential to supply chain optimisation in today’s retail environment. This paper studies the performance of advanced machine learning (ML) algorithms and traditional statistical models in predicting daily sales across German Rossmann outlets. In order to assess model performance and comprehend demand drivers, this study uses a robust dataset covering 1,115 stores from 2013 to 2015 that includes both time-based and contextual information, such as promotions, holidays, store types, and customers. This study incorporated three models which are; XGBoost, Random Forest, and a baseline Linear Regression model. Random Forest produced the lowest MAPE (7.54%) and RMSE (741.38). This study shows that advanced machine learning models, especially ensemble approaches, perform noticeably better than traditional methods by capturing non-linear connections and external variable interactions through intensive data preparation, feature engineering, and validation. The study investigates how improved demand forecasting directly enhances inventory management by lowering overstocking and understocking, in addition to model accuracy. Retailers looking for data-driven tactics to boost customer satisfaction and operational effectiveness might benefit from these insights. satisfaction.

Introduction

Today’s retailers work in more dynamic contexts where demand patterns are impacted by external factors including consumer behaviour, competition activity, and economic situations in addition to internal business operations. In this regard, precise demand forecasting has become a critical tool for improving customer service, cutting down on supply chain inefficiencies, and optimising inventories. Despite their past effectiveness, standard forecasting techniques find it difficult to adjust to this complexity. These models frequently miss seasonality, demand spikes brought on by promotions, and regional store-level variance because they only take into account past sales trends and linear assumptions. Interest in more adaptable and data-intensive methods driven by machine learning (ML) has increased as a result of this constraint.

The Rossmann Store Sales dataset was an excellent testing ground for this research. Between 2013 and 2015, it records comprehensive store-level sales, consumer traffic, and operational context for more than 1,100 German retailers(Florian & Will , 2015). The breadth of this information makes it possible to meaningfully examine not only time-based trends but also the effects of competition, store type, advertising tactics, and holidays. The performance of three models is compared in the current study using this data: XGBoost is a high-performing gradient boosting technique, Random Forest is a reliable ensemble method, and Linear Regression serves as a baseline.

The main goals are to determine which modelling technique produces the most accurate projections and to analyse the practical consequences of these forecasts for supply chain effectiveness. This paper thus tackles a number of important research questions:
(1) Which machine learning techniques among traditional and advanced models produce the most precise supply chain optimization demand forecasts?
(2) What are the main drawbacks of conventional forecasting models, and how may these be addressed by advanced models? (3) What effects do outside variables (such as promotions, seasonality, and economic indicators) have on the accuracy of demand forecasting? How do seasonal trends affect inventory needs?
(4)What effects does enhanced demand forecasting have on supply chain effectiveness and inventory optimization?
(5) How does the proposed methodology reduce overstocking and understocking in the retail supply chain?

In addition to adding to the expanding literature on machine learning applications in supply chain management, this study gives practical advice for retail professionals. The work closes the gap between data science innovation and operational decision-making by showcasing the superiority of ensemble machine learning models in real-world forecasting tasks.

Literature Review

Demand forecasting is a crucial element of supply chain management, as it determines supply-side readiness and influences operations, logistics, and planning. Poor forecasting accuracy leads to inefficiencies such as overstocking or stockouts, negatively affecting supply chain performance (Udbhav et al., 2021). Demand forecasting is fundamental to supply chain decision-making, enabling businesses to optimize inventory, allocate resources, and enhance operational efficiency in volatile and uncertain markets (Udbhav et al., 2021)

Traditional forecasting models, such as linear models, rely heavily on historical sales data and may fail to account for seasonality, market fluctuations, and external factors (Udbhav et al., 2021). According to Feizabadi’s article, Machine learning significantly enhances forecasting accuracy by mitigating demand distortion and variance amplification in supply chains, particularly addressing the bullwhip effect. By integrating hybrid models such as ARIMAX and Neural Networks, ML-based forecasting captures complex, non-linear relationships in historical and macroeconomic data that traditional time-series models often miss. These methods lead to statistically significant improvements in inventory performance, operational efficiency, and financial outcomes, including higher return on assets and profitability (Feizabadi, 2020). Unlike traditional models prone to either excessive variance or bias, ML techniques balance these weaknesses, providing more stable and reliable predictions. Particularly effective in homogeneous product industries like steel manufacturing, ML-based forecasting enables precise SKU-level demand predictions by refining aggregate-level forecasts. ARIMAX excels in capturing demand peaks, while Neural Networks offer smoother, more accurate overall forecasts. Feizabadi’s study underscores the value of a hybrid approach, blending multiple ML techniques to manage uncertainties in models, parameters, and data, leading to more robust demand predictions. Additionally, by reducing reliance on manual judgmental adjustments, ML-based forecasting enhances data-driven decision-making and improves supply chain planning, ultimately optimizing both operational and financial performance (Feizabadi, 2020).

Traditional Demand Forecasting Approaches

Traditional demand forecasting approaches rely on classical statistical models to predict future demand based on historical patterns. These methods remain foundational in supply chain management but face limitations in addressing complex market dynamics. Below is an analysis of key Classical Statisticaltechniques and their constraints:

Moving Averages

This method calculates demand forecasts by averaging historical data over a fixed window (e.g., 3-month or 6-month periods). It smooths out short-term fluctuations and is simple to implement. However, it assumes demand patterns are stable and fails to account for trends or seasonality (Jamal, Latifa, & Abdeslam , 2018). For example, a basic moving average would inaccurately forecast demand for a product with strong seasonal peaks, such as Christmas decoration items.

Exponential Smoothing

Exponential smoothing assigns exponentially decreasing weights to older data, prioritizing recent observations. Variants like Holt’s method incorporate trends, while Holt-Winters adds seasonality. These models perform well for short-term forecasts with consistent seasonal patterns (Chi & Turgay , 2024). However, they struggle with abrupt changes, such as sudden demand spikes caused by viral trends, and require manual adjustment for evolving seasonality.

ARIMA (Autoregressive Integrated Moving Average)

ARIMA models combine autoregression, differencing, and moving averages to handle non-stationary data. Seasonal ARIMA (SARIMA) extends this to periodic patterns. While ARIMA is flexible for time-series analysis, it demands large datasets for parameter estimation (p,d,qp, d, qp,d,q) and assumes linear relationships between variables (Jamal, Latifa, & Abdeslam , 2018). For instance, SARIMA might fail to predict demand disrupted by external shocks like economic crises, as it cannot integrate qualitative factors for example consumer sentiment (Raj, 2023).

Limitations of Traditional Methods

Challenge	Description	Example
Complex Patterns	Linear assumptions fail to capture nonlinear interactions (e.g., product cannibalization) (Raj, 2023).	A new product launch disrupts demand for existing items, causing forecast errors.
Seasonality	Non-seasonal models (e.g., simple ARIMA) ignore periodic trends, leading to inaccuracies (Jamal, Latifa, & Abdeslam , 2018).	Failing to anticipate holiday sales spikes without seasonal adjustments.
External Factors	Economic shifts, competitor actions, or social trends are not incorporated.	A viral TikTok trend suddenly boosts demand, overwhelming static models (Raj, 2023).
Data Requirements	ARIMA and exponential smoothing need extensive historical data, limiting applicability for new products (Chi & Turgay , 2024).	New tech products with no sales history cannot rely on these methods.
Scalability	Manual parameter tuning becomes impractical for large datasets or dynamic markets (Raj, 2023).	Retailers with 10,000+ SKUs struggle to update models in real time (Raj, 2023).

While traditional methods provide a baseline for demand planning, their reliance on historical data and rigid assumptions makes them less adaptable to modern supply chain complexities. Hybrid approaches combining statistical models with machine learning (e.g., Random Forests for nonlinear patterns) or qualitative inputs (e.g., Delphi method) are increasingly adopted to address these gaps. Organizations must also integrate real-time data streams and external indicators (e.g., economic indices) to enhance accuracy (Raj, 2023).

Machine Learning Methods in Demand Forecasting

Machine learning techniques have revolutionized demand forecasting by addressing limitations of traditional methods through enhanced pattern recognition and adaptability. Below is a synthesis of key approaches, their applications, and empirical evidence from recent studies:

Tree-Based Models

Random Forest and XGBoost excel at capturing nonlinear relationships between features (e.g., price fluctuations, lagged demand signals). Their ensemble structure reduces overfitting while providing interpretable feature importance scores. For example, XGBoost’s gradient-boosting framework optimizes decision trees iteratively, achieving 10–20% higher accuracy than traditional ARIMA in retail sales forecasting (Lena , Moritz , & Markus , 2024). However, these models struggle with long-term predictions due to their reliance on short-term feature interactions (Tang, 2024).

Deep Learning Models

LSTMs (Long Short-Term Memory Networks) process sequential data by retaining long-term dependencies, making them ideal for demand time series with seasonality. For instance, a hybrid Transformer-LSTM-CNN model improved gasoline consumption predictions by 15% compared to standalone models, leveraging attention mechanisms for global pattern detection (Mahmoud & Mohammad , 2024).

Transformers enhance accuracy in volatile markets by weighing historical data points dynamically. Their self-attention mechanism, defined as:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

enables context-aware forecasting, particularly useful for multi-variable datasets (e.g., GDP, fuel prices) (Mahmoud & Mohammad , 2024). The advantage of this model is that it is superior at modeling complex temporal dependencies and scalable to high-dimensional data (e.g., 10,000+ SKUs) (Kaoutar , Oucheikh, Othmane , & Charif , 2024). However, it requires extensive training data (>4x traditional methods) and it is prone to overfitting without regularization (Kaoutar , Oucheikh, Othmane , & Charif , 2024).

Hybrid and Ensemble Approaches

Combining methods mitigates individual weaknesses:

STACK (Stacked Generalization): Uses meta-learners to blend predictions from base models (e.g., Random Forest + XGBoost), reducing error rates by 12–18% in steel industry demand forecasts (S. M. Taslim, et al., 2022).

Transformer-LSTM-CNN: Integrates attention, temporal memory, and local pattern detection, achieving state-of-the-art accuracy in energy forecasting (Mahmoud & Mohammad , 2024).

Human-in-the-Loop Systems: Pairing ML predictions with expert adjustments improves forecast reliability by 8–10%, especially for new product launches (Phipps, 2023).

Comparative Analysis

Model Type	Accuracy Gain	Data Requirements	Use Case
Tree-Based (XGBoost)	10–20% (Lena , Moritz , & Markus , 2024)	Moderate	Medium-term retail forecasts
LSTM	15–25% (Mahmoud & Mohammad , 2024)	High	Seasonal energy demand
Hybrid (STACK)	12–18% (S. M. Taslim, et al., 2022)	High	Industrial manufacturing
Human-ML Hybrid	8–10% (Phipps, 2023)	Variable	New product launches

It is important to put into considerations that ML-driven forecasting require good data quality due to its rigorous preprocessing (e.g., outlier detection, normalization) to avoid biased predictions (Mahmoud & Mohammad , 2024). The Explainability of the data is also something to put into considerations because tree-based models outperform deep learning in transparency, critical for stakeholder buy-in (Tang, 2024).

Empirical studies show ML-driven forecasting reduces inventory costs by 1–2% per 1% accuracy improvement, directly boosting revenue (Demand Caster, 2023). However, successful implementation hinges on aligning model choice with business constraints, such as data availability and operational scalability (Lena , Moritz , & Markus , 2024).

External Factors Influencing Demand Forecasting

Demand forecasting is a critical process for businesses, and its accuracy is significantly influenced by various external factors. Understanding these factors is crucial for developing robust forecasting models and making informed business decisions.

Impact of Promotions, Holidays, and Seasonality

Promotions play a pivotal role in shaping consumer behavior and, consequently, demand patterns. They create immediate spikes in sales and offer valuable insights into marketing strategy effectiveness (Meshram, 2024). For instance, a store running an electronics promotion may experience a substantial increase in purchases. This promotional data can be integrated into forecasting models to predict future demand more accurately, allowing businesses to adjust inventory levels and production schedules accordingly (Meshram, 2024).

Seasonality and holidays also have a profound impact on demand forecasting. Many businesses experience seasonal fluctuations in product demand, such as winter clothing sales peaking in colder months or back-to-school rushes in the fall (Latham, 2024). Holiday seasons, particularly the winter holidays, can dramatically alter consumer spending patterns. Studies show that customers have increased holiday spending by about 8% year over year, with expectations of spending more than $1,700 per person during the holiday season (Latham, 2024). Recognizing these patterns allows businesses to stock inventory appropriately and avoid stockouts or overstocking.

Role of Economic Indicators and Competitor Activity

Economic indicators play a crucial role in shaping consumer purchasing behaviors and, by extension, demand forecasts. Factors such as GDP growth, inflation rates, interest rates, and consumer confidence significantly impact overall consumer demand (Sales Force, 2024). During favorable economic conditions, demand for products and services typically flourishes, while economic downturns can lead to decreased demand. Businesses must continuously monitor these economic indicators to adjust their forecasts and strategies accordingly.

Competitor activity is another critical external factor influencing demand forecasting. The introduction of competing products, changes in pricing strategies, or shifts in marketing campaigns can significantly impact consumer choices and alter demand patterns (Hyoduk & Tunay, 2010). For example, a competitor’s aggressive pricing strategy might lead to a decrease in demand for a company’s products. Therefore, businesses need to closely monitor their competitors’ actions and incorporate this information into their demand forecasting models to mitigate any negative impacts on their product demand.

Effect of Weather and Unforeseen External Events

Weather conditions can have a substantial impact on demand for certain products and services. For instance, unexpected heat waves might increase demand for air conditioners or cold beverages, while winters or prolonged periods of rain could boost sales of umbrellas and winter jackets. Incorporating weather forecasts into demand prediction models can help businesses prepare for these fluctuations more effectively (Flora, 2019).

Unforeseen external events, such as natural disasters, pandemics, or geopolitical crises, can dramatically disrupt supply chains and alter demand patterns (Priyanka, Mamatha, Naveen, Nithin, & Prarthana, 2023). The COVID-19 pandemic is a prime example of how an unexpected global event can radically shift consumer behavior and demand across various industries. These events underscore the importance of building flexibility and resilience into demand forecasting models to quickly adapt to sudden changes in the market landscape.

External factors significantly influence demand forecasting accuracy and effectiveness. By carefully considering the impact of promotions, seasonality, economic indicators, competitor activities, weather, and unforeseen events, businesses can develop more robust and accurate demand forecasting models. This, in turn, enables better inventory management, production planning, and overall business strategy, ultimately leading to improved customer satisfaction and business performance.

Supply Chain Optimization Through Improved Forecasting

Recent studies and research have demonstrated the significant impact of improved forecasting on supply chain optimization, particularly in inventory management and the reduction of overstocking and stockouts. These findings highlight the growing importance of predictive analytics in modern supply chain management.

A comprehensive study published in the World Journal of Advanced Research and Reviews in 2024 explored the implementation of AI-driven demand forecasting to enhance inventory management and customer satisfaction. The research found that advanced AI algorithms and machine learning models, when applied to historical sales data and external factors, generated more precise demand forecasts. This led to significant improvements in inventory optimization and cost reduction. Notably, the neural network model outperformed other models, achieving the lowest Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) (Olamide, Praveen, Yewande, Segun, & Oladapo, 2024).

Impact of Predictive Analytics on Reducing Overstocking and Stockouts

Recent research has demonstrated the powerful impact of predictive analytics on inventory management. A study conducted in 2024 utilized Minitab’s Predictive Analytics Module to analyze factors leading to stockouts of Bluetooth headphones. The research revealed that longer lead time was the most significant predictor of stockouts. By implementing changes based on this insight, such as ordering sooner, companies could drastically reduce the likelihood of stockouts (Franz, 2025).

Another study published on SSRN in November 2024 presented a data-driven approach to inventory optimization. The research combined demand forecasting using machine learning with advanced inventory management models. By employing a random forest model for demand forecasting and integrating it with the Economic Order Quantity (EOQ) model, the study demonstrated how businesses could significantly reduce excess inventory and stockouts, leading to overall supply chain efficiency (Kalpesh, et al., 2025).

Many companied like Deloitte utilizes machine learning to analyze time-series data and predict trends in supply chain demand. They use tools like Clairvoyance evaluates multiple algorithms, and to allow user input for optimal model selection, and integrates future assumptions through a Scenario Manager. It benchmarks predictions against advanced statistical methods like ARIMAX and accommodates multiple data sources, including historical client cash flows and macroeconomic trends. This comprehensive approach enables more accurate forecasting in complex, interconnected systems, helping businesses manage supply chain demand effectively (Deloitte, 2025). Also, another research by Deloitte claims that demand planning and forecasting in the supply chain improves forecast reliability by 10-20% and reduces inventory costs by up to 20% and increasing supply chain efficiency by 15% (Deloitte, 2024).

According to Konstantin’s article on LinkedIn, predictive analytics transforms inventory management by analyzing historical data and trends, offering real-time insights and enabling proactive decision-making. A 2023 Deloitte survey showed that 62% of retailers use external data to improve forecast accuracy. McKinsey reported that companies using predictive analytics saw a 20-50% reduction in inventory holding costs in 2022. Businesses implementing these solutions experienced up to a 35% reduction in stockouts, according to Deloitte. A Harvard Business Review study noted a 10-15% cash flow improvement, while Gartner found up to 30% reduced warehousing and logistics costs. The global predictive analytics market is expected to reach $22.1 billion by 2026, with a CAGR of 21.7% from 2019 (Babenko, 2024).

These studies and research findings consistently demonstrate the significant benefits of improved forecasting and predictive analytics in supply chain optimization. From reducing stockouts and overstocking to enhancing overall operational efficiency, the implementation of advanced forecasting techniques has proven to be a game-changer in modern supply chain management. As businesses continue to face increasingly complex and dynamic market conditions, the adoption of these data-driven approaches will likely become even more crucial for maintaining competitive advantage and ensuring customer satisfaction.

Challenges and Gaps in Current Research

Current research in demand forecasting models for supply chain optimization has made significant strides, but several challenges and gaps remain. These limitations hinder the development of more accurate and robust forecasting systems, impacting businesses’ ability to optimize their supply chains effectively.

Limitations of Current Forecasting Models

One of the primary limitations of current forecasting models is their struggle to handle complex, non-linear relationships in demand patterns. Traditional statistical models, such as ARIMA and exponential smoothing, often fall short when dealing with the intricate dynamics of modern markets. While machine learning models like Random Forests and XGBoost have shown promise in capturing non-linear patterns, they still face challenges in long-term predictions and extrapolating beyond the range of training data (Raj, 2023).

Furthermore, Regardless of chosen methods and the aims companies strive to achieve with demand forecasting, building predictive models is a tedious endeavor that encompasses several industry-specific and process-related challenges. Managing demand across global supply chains with multiple suppliers, varied regulations, and diverse market conditions can increase the complexity of forecasting and require a more nuanced approach. Rapid changes in market conditions, influenced by factors such as economic shifts, geopolitical events, or unexpected disruptions, can be a major roadblock to accurate planning and predictions. Identifying and accurately predicting seasonal patterns and trends, especially in industries with distinct peak seasons, requires sophisticated forecasting models to avoid underestimating or overestimating demand. Changes in consumer preferences, shopping channels, or buying habits can significantly impact demand forecasting accuracy.

Adapting to these changes requires continuous monitoring and adjustment of forecasting models. Inaccurate or incomplete historical data can compromise the effectiveness of forecasting models. Ensuring data accuracy and addressing data gaps can be challenging, especially in industries with rapidly changing product portfolios. Collaboration across departments Ensuring effective communication and collaboration among different departments, such as sales, marketing, and operations, is crucial. Siloed information can lead to misalignment and inaccurate demand forecasts. Aiming to address these challenges, businesses turn to implementing and integrating advanced forecasting technologies, such as machine learning or AI (Dmytro & Daria , 2024).

Challenges in Integrating External Factors

Integrating external factors into machine learning models presents a significant challenge in demand forecasting. While it’s widely recognized that factors such as economic indicators, weather patterns, and competitor activities significantly influence demand, incorporating these diverse data sources into forecasting models remains complex.

One major hurdle is the difficulty in quantifying and standardizing qualitative information, such as brand perception and customer sentiment. These factors can heavily impact demand but are challenging to represent numerically in a way that machine learning models can effectively utilize (Raj, 2023).

Additionally, the real-time integration of external data poses technical challenges. Many businesses still rely on historical data and periodic reporting, which may not accurately reflect current market conditions. This delay in data acquisition and analysis can result in forecasts that lag behind rapidly changing market dynamics (Raj, 2023).

Data Quality and Preprocessing Challenges

Data quality and preprocessing remain significant hurdles in demand forecasting. The foundation of any forecasting model is data, yet many organizations struggle with incomplete, inaccurate, or outdated datasets (Fakhitah & Wan, 2029). For instance, historical data might be missing for new products or emerging markets, complicating the forecasting process.

Massive amounts of data are available for the organization which will influence their business decision. Data collected from the various resources are dirty and this will affect the accuracy of prediction result. Data cleansing offers a better data quality which will be a great help for the organization to make sure their data is ready for the analyzing phase. However, the amount of data collected by the organizations has been increasing every year, which is making most of the existing methods no longer suitable for big data. Despite the data need to be analyzed quickly, the data cleansing process is complex and time-consuming in order to make sure the cleansed data have a better quality of data (Fakhitah & Wan, 2029).

techniques such as normalization, feature engineering, and outlier detection are crucial for enhancing model performance. However, applying these techniques effectively requires domain expertise and can be time-consuming. Moreover, the choice of preprocessing methods can significantly impact model outcomes, adding another layer of complexity to the forecasting process (Restack, 2025).

Scalability is another critical challenge, particularly as datasets grow larger. Efficient algorithms and data structures are necessary to handle large volumes of data without significant performance degradation (Raghavendra, 2024). This is especially relevant for businesses dealing with extensive product lines or operating in multiple markets.

While demand forecasting models have advanced considerably, significant challenges remain in developing truly robust and adaptable systems. Addressing these limitations requires interdisciplinary approaches that combine domain expertise with cutting-edge machine learning techniques. Future research should focus on developing models that can handle complex, non-linear relationships, integrate diverse external factors in real-time, and effectively process large-scale, high-quality datasets. By overcoming these challenges, businesses can achieve more accurate demand forecasts, leading to optimized supply chains and improved operational efficiency.

This study fills up a number of recognised holes in the demand forecasting literature. First, it shifts towards more comprehensive multivariate temporal-spatial modelling by integrating temporal variables, store metadata, promotion schedules, and competitive pressures into machine learning models. This framework lays the groundwork for including Transformer-based designs in subsequent work, even though it hasn’t done so yet.

Second, this thesis assesses performance over a longer time horizon—6 to 8 weeks—aligning with genuine promotional planning cycles in retail, whereas many research simply concentrate on projections for the following day. Third, this study presents evaluation measures such as MAPE and RMSE, and it might be expanded to incorporate quantile-based forecasts in subsequent iterations, even if uncertainty quantification was not the primary focus.

Lastly, the paper discusses implications for inventory management in addition to accuracy measurements. Retailers are able to make proactive decisions that increase supply chain efficiency by using the models to predict times of high and low demand. This work serves as a useful link between academic research and industry application because traditional forecasting studies frequently lack this direct practical relevance.

Reducing overstocking and understocking is one of the useful advantages of the modelling approach used in this thesis. The algorithms produce more precise predictions of future sales by utilising both past trends and contextual elements like competition and promotions. Better matching of inventory levels with anticipated demand is made possible by this. In addition to reducing the risk of running out of stock and losing sales, accurate demand forecasting guarantees that retailers only stock enough to satisfy consumer demands, eliminating expensive excess inventory that ties up money and storage. To put it another way, supply chain planners are given insights by this demand-aware forecasting method that enable leaner, more effective operations without sacrificing service quality.

Data Overview

The dataset utilized for this study was sourced from the Rossmann Store Sales Kaggle competition. It includes store-level and transactional data for 1,115 stores in Germany between January 2013 and July 2015(Florian & Will , 2015). The sales data and the store metadata are the two primary parts of the dataset. While the latter offers contextual store-level characteristics such store type, assortment level, and promotional campaign specifics, the former records daily sales and customer volume per store.

The dataset comprise historical sales data for 1,115 Rossmann stores and consists of three main files:

train.csv: Contains historical sales data, including information on store operations, short-term promotions, number of customers, status of the store, and holidays.
test.csv: Contains similar information as the training set but excludes the “Sales” column, which needs to be forecasted.
store.csv: Provides additional details about the stores, including store type, assortment, competition information, and long-term promotion history.

Key Variables

Below is an overview of key variables and their significance:

Train & Test Dataset Variables:

Store: Unique identifier for each store.
DayOfWeek: Day of the week (1 = Monday, …, 7 = Sunday).
Date: The date corresponding to the sales data.
Sales (Train Only): Daily revenue for the store.
Customers (Train Only): Number of customers visiting the store.
Open: Whether the store was open (1 = open, 0 = closed).
Promo: Indicates if a store was running a promotion on a given day.
StateHoliday: Indicates whether the day was a holiday (a = public holiday, b = Easter holiday, c = Christmas, 0 = None).
SchoolHoliday: Indicates if public schools were closed on that date.

Store Dataset Variables:

StoreType: Identifies four different store models (a, b, c, d). The dataset source was not explicit about the meaning of these models, but according to ResearchGate, A means grid arrangement, B means free form arrangement, C mean racetrack arrangement and D means circulation spine (Kien, Minh, Brett, Ibrahim , & Clinton , 2022) .
Assortment: Defines the level of product variety (a = basic, b = extra, c = extended) Ankur, Manghat , & Saurabh , 2015.
CompetitionDistance: Distance to the nearest competitor store (meters).
CompetitionOpenSinceMonth/Year: Approximate opening date of the nearest competitor.
Promo2: Indicates if the store is participating in a continuous promotional campaign (0 = No, 1 = Yes).
Promo2SinceWeek/Year: Calendar week and year when the store joined Promo2.
PromoInterval: Months in which Promo2 promotions occur.

To support the diverse analytical requirements of this study, a comprehensive suite of R libraries was loaded, each serving a specific function within the data science workflow. Packages like tidyverse, dplyr, and lubridate were used for efficient data wrangling and date parsing in order to facilitate data manipulation and a clean workflow. scales for axis formatting, patchwork for plot combining, and ggplot2 for advanced plotting all supported visualisation and reporting. In order to compute descriptive statistics, skewness, and kurtosis, the first statistical profiling and data exploration relied on skimr, DataExplorer, and moments. anomalize made it easier to find anomalies in temporal data, whereas forecast offered traditional modelling tools for time series analysis. With keras added for possible deep learning extensions, outlier detection was managed using isotree for isolation forests and Rlof for local outlier factor algorithms. Finally, time-indexed data structures were managed using zoo, particularly for imputation and rolling computations. A modular and reproducible data science pipeline encompassing data preparation, exploratory analysis, and model creation was made possible by these libraries working together.

Handling Missing Values

Upon preliminary examination of the three datasets, it was discovered that the store and test datasets contained multiple incomplete fields, while the main sales dataset (train) had no missing values. Due to business realities, these fields are naturally scarce and generally connected to competitor information or promotion programs (e.g., not all stores participate in Promo2, and not all have nearby competitors, or discrepancies in recording such characteristics).

Fixing the missing values

Since there were no concurrent state holidays, it was presumed that missing values in the test set’s Open column meant the store was open. Given the competition’s setting, this is a logical assumption that keeps the otherwise useable row intact. Due the existence of extreme outliers (such as competition distances exceeding 70,000 meters), the median was used to impute missing values in CompetitionDistance instead of the mean. Median imputation is resilient to skewed distributions and maintains central tendency. It was presumed that the competition began in January of the earliest year seen in the dataset for stores that did not have competition opening dates. This guarantees no NULL recurrence in later feature engineering and successfully down-weights the competitiveness effect for such stores. To indicate that there was no active promotion in stores without any running Promo2 campaigns, zero was utilised as a placeholder. Making this distinction is crucial when developing time-lag features like Promo2Since, which counts the weeks that a promotion has been in effect.

Exploratory Data Analysis (EDA)

Before moving on to formal modelling, the exploratory data analysis (EDA) phase is a crucial first step in comprehending the distribution, structure, and temporal patterns present in the dataset (IBM, 2025). The dataset used in this study includes three distinct tables: train, test, and store. The test set spans a future period of roughly six weeks and has the same elements as the train set, with the exception of the target variable, sales. The train set comprises daily historical sales data for several Rossmann stores from 2013 to 2015. Store type, assortment level, promotional intervals, and competitiveness metrics are among the static details about each store that are provided by the store dataset.

There are more than a million rows in the train dataset that represent store-day combinations. Variables including the day of the week, promotional status, holiday information, client count (accessible only in the training set), and actual daily sales (the prediction target variable) are all included in each record, which is linked to a particular date and store. Notably, the distributions of sales and customers are skewed to the right, as is typical of retail data because of periodic spikes during holidays or promotions(Orvath, 2023).

In order to start the EDA, the date column was transformed into the correct Date format using Lubridate. This made time-based grouping and filtering more efficient. After that, the data was examined statistically and visually in a variety of ways. There was significant fluctuation in daily sales among stores and dates, according to basic summary figures. Additionally, early distribution plots showed that some establishments were closed on specific days (shown by Open = 0), which naturally resulted in zero sales. As a result, in later modelling stages, these closed-store records were eliminated.

The EDA’s visualisation of sales trends over time was a crucial component. Plotting average sales by month and year showed variations from year to year as well as seasonality. For example, December typically saw higher sales volumes across all stores, which is consistent with what is expected of customers throughout the holiday season. On the other hand, January saw a decline in sales, most likely as a result of lower post-holiday spending. The final model’s incorporation of date-derived characteristics including month, week, and year is supported by these patterns.

The dataset also contains a number of categorical and binary variables, including StateHoliday, SchoolHoliday, and Promo. These factors have a significant influence on sales behaviour, according to analysis. For example, compared to days without promotions, average sales during promotions were much greater. In a similar vein, sales spiked on some public holidays (StateHoliday = ‘a’), indicating event-driven purchasing.

The EDA’s analysis of store heterogeneity was among its most enlightening features. In addition to location, Rossmann stores vary in terms of type and assortment approach (StoreType and Assortment). Plotting revenues by StoreType made it clear that certain store types routinely generated more revenue than others. This implies that store-specific modelling techniques, or at the very least, the use of store-level features as predictors, are required.

Moreover, customer counts were available in the training data but not in the test data. Initial correlation analysis between Sales and Customers showed a strong positive relationship, especially on open, non-holiday, and non-promotion days. However, because this variable is absent in the test set, it could not be directly used in model training for future forecasting.

As part of EDA, distribution tests and outlier detection were also carried out. Large stores were frequently linked to high-sales outliers during busy holiday or promotional times. Since they accurately depict business phenomena and are crucial for precise demand forecasting, these outliers were kept.

In conclusion, the EDA step made possible to have a solid grasp of the underlying data structure and temporal dynamics. It confirmed that important variables like promotion indicators, date-derived properties, and store-specific aspects were relevant. Decisions for feature engineering and model design in the next sections were directly influenced by the knowledge acquired during this phase.

Sales Distribution and Trends

There are days that has zero sales because the store was closed and there there are also records of 0 sales while the store was open. Stores that were closed were not included in the records. Assuming that such entries do not accurately represent average consumer behaviour or sales performance, this conclusion was made. By adding these observations, the training data would have been contaminated with noise, which would have distorted the model’s perception of real demand trends.By eliminating these, we made sure the model only included significant, non-zero transactions that more accurately represented actual demand, increasing the precision of our forecasting models.

However, according to the data source, there are cases where open stores reported zero sales because of shop renovations. These data were maintained since it is crucial that businesses receive renovations when necessary, and since this is something that should occur and has an impact on sales, it should be taken into account when forecasting sales.

Sales Summary Statistics

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    4859    6369    6956    8360   41551

Mean Sales: 6955.514

Median Sales: 6369

Sales Variance: 9636149

Sales Skewness: 1.593919

These sales summary statistics tell us important characteristics about the distribution of sales in the dataset.

The mean sales is higher than the median. This suggests that the distribution is right skewed, meaning most stores have lower sales, but a few stores have very high sales, pulling the mean upward.

Variance measures the spread of sales values meaning how much sales fluctuate across stores and days. A high variance means sales are not stable and vary significantly between different stores or time periods.

A skewness of 1.59 indicates that the distribution is right skewed. This means most sales are on the lower end, with some stores having very high sales (outliers). The presence of high-selling stores could be due to better locations, stronger customer demand, or more effective promotions.

Since sales vary a lot across stores, using only simple linear models may not work well, meaning we may need machine learning models that can capture nonlinear relationships. The right-skewed nature suggests that we might need to apply transformers instead of traditional models because traditional models predict sales just based on the last few days while a transformer based model can look at the entire year of sales data, find seasonal patterns, and make better long-term predictions.

Sales Distribution

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The sales distribution histogram shows a substantially right-skewed pattern, with a long tail stretching to the right and the majority of daily sales grouped towards the lower end. This skewness implies that although most store days have reasonable sales, there are sporadic surges in income that are noticeably higher, usually around holidays or promotional seasons.

Average Sales Trend

Sales Time Trend for Top 3 Store

Due to the large quantity of stores, it was not possible to properly visualize the actual sales trend of all the years without the noise, which is why conditional sales trend plotting was a better option.

The weekly sales trend graph confirms the presence of temporal variation in customer behavior. Sales are not uniform across the calendar but show clear seasonal patterns, with noticeable peaks occurring towards the end of each year, especially in the months of November and December. This could be attributed to increase in consumer spenditures during holiday seasons and end-of-year promotions.

Additionally, the graph showing sales time trends for the top 3 performing stores demonstrates that high-performing stores also experience volatility over time. Despite being top performers, their sales figures fluctuate due to external factors such as holidays, promotions, or local events. However, similar to the overall trend, even these stores consistently experience sales surges at year-end, reinforcing the seasonal nature of retail demand.

Daily Sales Distribution

Sales fluctuate throughout the week, with Sunday showing high sales, possibly due to the closure of competitor stores on that day. However, sales on Sunday also vary significantly across different stores, indicating differences in customer traffic or store-specific factors. Monday records the highest overall sales, which could be driven by increased shopping activity after the weekend.

Monthly Sales Distribution

The median sales seem stable across all the months.The sales variation increases in November and December and this might be because of the holiday shopping spikes. The outliers above the whiskers show that some stores had exceptionally high sales in certain months. April, May and June have the highest sales outliers and this might be explained by the transition from cold to hot season, and people are shopping summer supplies.

Yearly Sales Distribution

The sales for all the years are quiet similar. The only difference is that 2015 has more ignificant outliers than the rest of the years, and Rossmann hit the highest sales point in that same year.

Store Performance Analysis

The distribution of store types

Store type A dominate the dataset, while others like B and C are just a few of them.

The distribution of store assortments

Store assortment A and C dominate the dataset while Assortment type B are not many.

Relationship between Store Type and Store Assortment

   
      a   b   c
  a 381   0 221
  b   7   9   1
  c  77   0  71
  d 128   0 220

From this table:

Store Type “a” is the most common. It has 381 stores with Assortment “a” and 221 stores with Assortment “c”. There are no stores with Assortment “b” in this type.

Store Type “b” is rare. It has only 17 stores in total (7 with Assortment “a”, 9 with “b”, and 1 with “c”). This is significantly lower than other store types.

Store Type “c” is moderately represented. It has 77 stores with Assortment “a” and 71 with “c”, but none with “b”. This indicates it mostly operates with Assortments “a” and “c”.

Store Type “d” has an even mix of Assortments. It has 128 stores with Assortment “a” and 220 with “c”. Similar to Type “a”, it lacks Assortment “b”.

Assortment “b” is the least common overall. Only 9 stores of Type “b” have it. Other store types do not have Assortment “b” at all.

This uneven distribution of store type and store assortment type raises questions. Why is Store Type “b” so rare? Since the data source did not provide an explanation to this, it is safe to assume that the store type “b” might be a test format, a discontinued type, or serving a niche market or even a business strategy.

Weekly Sales Trend by Store Type

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Weekly Sales Trend by Assortment Type

train_filtered_store %>%
  group_by(Date = floor_date(Date, "week"), Assortment) %>%
  summarise(Avg_Sales = mean(Sales, na.rm = TRUE), .groups = 'drop') %>%
  ggplot(aes(x = Date, y = Avg_Sales, color = Assortment)) +
  geom_line(size = 0.8) +
  facet_wrap(~ Assortment) +
  labs(title = "Weekly Sales Trend by Assortment Type",
       x = "Date", y = "Average Sales") +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Store Type and Assortment Impact on Sales

Store Type Sales Analysis:
- Store Type B (free form layout)has the highest sales and from the graph you can see that the sales grow overtime, while type A (grid layout), Type C (racetrack layout) and D (circulation layout) have moderately similar partners and the growth in sales is not as obvious as Type B.
Assortment Sales Analysis:
- Extra assortments (b) sales grow over time just like Store type b, while basic assortments (Type A) more consistent and Extended assortments (c) also has slight growth over time .
Store Type +Assortment Sales Analysis:
- B.A & B.B have the highest median sales, meaning that stores of Type B with Assortments A & B tend to perform better than others in terms of sales. Store Type B also has the highest variability, meaning some of these stores perform extremely well, while others do not. B.B in particular has a very wide range, suggesting that some of these stores generate exceptionally high sales, but others fluctuate. Possible Explanation: Store Type B might be high-performing premium or high-traffic stores that attract more customers per visit.
- Store type A and store type D have more stable sales. Their boxes are shorter, meaning sales are less variable across different stores. A.A, A.C, D.A, and D.C have similar medians, suggesting these store-assortment types have consistent performance across locations.
- Store type C has moderate sales but high variability. C.A, C.C, and D.C show a wider range of sales values, indicating that some stores perform very well while others lag. They might be located in less predictable markets, where sales depend on external factors like location or competition.

Competition Impact on Sales

Usually, the shorter the distance between competitors, the low the sales. However, with this graph, the stores with nearby competitors tend to have higher sales than those with fewer competitors. It is possible that those stores are located in high traffic areas, meaning that there will be many customers anyway.

Customer Insights

Weekly Customer trend

Weekly Customer Trend by Store Type

Weekly Customer Trend by Assortment

Relationship Between Sales and Customers

`geom_smooth()` using formula = 'y ~ x'

There is a significant and constant correlation between the number of customers and the revenue made by each store, as shown by the customer trend across the dataset, which closely reflects the overall sales trend. The number of clients increases during peak seasons, just as sales tend to grow towards the end of the year due to seasonal demand, especially during holidays and promotional periods. Additionally, the distribution of customers and sales exhibit a similar pattern when store performance is examined by StoreType and Assortment, indicating that specific store types and assortment types inherently draw higher foot traffic. Store type b and assortment b, for example, typically have greater customer counts, which is consistent with their sales. This finding is further supported by the plot between sales and customers, which shows a strong positive linear association. It makes sense that there would be a positive correlation between the number of consumers who visit a business and the likelihood of purchases, particularly in a retail setting where sales are closely correlated with foot traffic. According to the observation, in rare cases, some stores have a large number of consumers but lower sales per client, while others have fewer customers but more spending every visit. However, in general, sales increase with the number of clients, confirming the significance of customer volume as a major source of income and confirming the inclusion of customer-related patterns in the demand forecasting model.

Impact of Short-Term Promotion On Sales

Short-term promotions increase sales compared to non-promotional periods.

Long-Term Promotion Impact on Sales

It is interesting to see that the median sales value appears lower for stores with long-term promotions compared to those without.

The median sales value appears almost identical in both cases, indicating no major effect of seasonal promotions on median sales. The spread of sales and the presence of high outliers are similar for both categories. This suggests that while seasonal promotions might impact certain stores or specific times, they do not affect overall sales.

Effect of Holidays

Public, christmas, Easter holidays show a drop in sales and this is because the stores are most of the time closed. School holidays slightly increase sales, possibly due to families shopping more when schools are closed.

Data limitations

This research need Economic indicators data because macroeconomic factors (like inflation, unemployment) are very important for indepth analysis of the sales history and demand forecasting. Also, customer behavior data would improve forecasting. While we have Customers, more granular data (e.g., basket size, product preferences) would improve forecasting. For example, a high demand of certain product would be the reason to an increase in sales (eg sun glasses, sunscreen creams during summer). Maybe lack of those particular products would decrease sales in that period.

Anomaly Detection

I used the Interquartile Range (IQR) approach to detect anomalies in the training dataset in order to guarantee the accuracy of the data used for modelling. Extreme values introduced by anomalies, especially in the Sales and Customers variables, can seriously skew prediction models by misleading the learning algorithms.

Values falling below the first quartile (Q1) minus 1.5 times the IQR or above the third quartile (Q3) plus 1.5 times the IQR are identified by the IQR method, a non-parametric approach to outlier detection. Due of its ability to withstand non-normal distributions, this method is frequently employed ( Dr. Vikas , Dr. Darshan , Dr. Sachine , & Dr. Rajivkumar, 2024).

I visualized these anomalies after computing the IQR criteria, and I found that there were high sales values, particularly at the end of 2014, which makes sense given that it was the holiday season. I decided not to exclude these anomalies because they might be seasonal or just real advertising increases. Instead, their existence was noted and taken into account during the modelling phase, when reliable algorithms like Random Forest and XGBoost are better suited to handle these kinds of variations.

Modeling

Comparing conventional and sophisticated predictive models for sales forecasting is one of the main goals of this thesis, which ultimately aims to increase supply chain efficiency and demand forecasting accuracy. Better inventory planning, fewer stockouts or overstocking, and smart decision-making in distribution and procurement are all made possible by accurate demand forecasts.

In order to do this, I used and contrasted three different modelling techniques: Random Forest and XGBoost, which are cutting-edge machine learning models renowned for their effectiveness in structured data challenges, and Linear Regression, which is a representative of conventional statistical models. I sought to determine if sophisticated models actually provide quantifiable advantages over more straightforward options by analysing their forecast accuracy and resilience to the characteristics of sales data.

Data Pre-Processing and Feature engineering

Meaningful feature engineering and thorough data pre-processing are essential steps in creating reliable forecasting models. These procedures guarantee that the input data is clear, regularly organised, and enhanced with insights pertinent to the topic.

Consistency in data types

To begin with, a thorough check and transformation of data types was performed across the three datasets: train, test, and store. To enable proper handling during modelling, categorical variables like Store, StateHoliday, and PromoInterval were transformed into factors. All numerical variables (such as sales, customers, and competition distance) were explicitly cast into their corresponding numeric or integer types, while date variables were parsed into date format to allow for temporal changes.

Particularly in R, incorrect data types can result in inaccurate feature interpretations, memory inefficiencies, or modelling problems, which makes this standardization is essential.

Additional information on each store, including its assortment type, competition distance, and promotional strategies, can be found in the store.csv file. A left join was done on the Store key to add this metadata to the train and test datasets. This made it possible to incorporate outside factors like:

To differentiate between retail format and assortment breadth, use StoreType and Assortment.
To simulate competitive pressure, use CompetitionDistance and CompetitionOpenSinceMonth/Year.
To record recurrent promotional effects, use Promo2, Promo2SinceWeek/Year, and PromoInterval.

By combining these variables, the feature set was increased and the models were able to learn from sources other than transaction-level data.

Feature engineering

In order to increase the accuracy of demand forecasting, this study’s feature selection and computation were informed by both domain expertise and machine learning concepts. In order to capture cyclical sales patterns, time-based elements were taken straight out of the Date column. Monthly seasonality, promotional times, holiday effects, and more general temporal sales trends are all reflected in features like month, week, day, and year. Furthermore, the WeekOfMonth feature was developed to differentiate sales differences within specific months, taking into consideration intra-month variances such pay periods and end-of-month spikes. The model can recognise and learn from consistent time-driven behaviours in retail demand thanks to these temporal cues.

Along with temporal components, factors related to competition and promotion were included in order to measure the impact of outside business dynamics. CompetitionOpenSince and Promo2Since were computed to indicate how long it has been (in weeks and months, respectively) since a long-term promotion started or a rival opened. With the use of this strategy, the models were able to predict how sales performance may change in the event of new competitors or protracted marketing campaigns. Because missing values in these fields were carefully handled—replaced with zeros in scenarios where there was no promotion or competition—the model was able to distinguish between “unknown” and “not applicable” circumstances.

The mean log-transformed sales summed by category groups comprised another significant class of designed attributes. These consist of features like MeanLogSalesByStateHoliday, MeanLogSalesByStore, and MeanLogSalesByAssortment. To prevent large dimensionality and to portray the past behaviour of related entities, aggregates were utilised in place of one-hot encoding. Log1p(Sales) was used to stabilise variance for better learning by normalising skewed data and lessening the effect of extreme outliers.

Category	Variable Identifier
Time	Day, Week, Month, Year, WeekofMonth
Sales Info	Sales, LogSales (train only)
Store info	StoreType, Assortment, CompetitionDistance
Promotion Info	Promo, Promo2, Promosince, Promo2SinceYear, Promo2SinceWeek, PromoInterval
Holiday Info	CompetitionOpenSince, CompetitionOpenSinceMonth, CompetionOpenSinceYear
Mean Aggregates	All MeanLogSalesBy.. features (except customers based)

Time based features

Time-based features are crucial because sales are inevitably time-dependent. I pulled a number of granular time variables from the Date field. Standard components to capture seasonal and annual effects include month, week, day, and year. Because each extracted temporal characteristic added to the prediction power of the model and was unique, they were all kept. Seasonal trends, such the busiest shopping months (like December), are captured by the month feature. The day variable aids in capturing end-of-month behaviours and pay cycle effects, while the week variable offers resolution for weekly promotions. Growth over time and macroeconomic conditions are accommodated by the year variable. Finally, some calendar features might overlook the fine-grained within-month fluctuations that WeekOfMonth records. WeekOfMonth divides each month into four to five weekly halves. By combining these variables, the model may learn how sales are dispersed over the course of the year, representing a range of temporal variation scales from daily patterns to annual trends.

The model can learn recurrent patterns such as weekend drops, end-of-month surges, or Q4 spikes with the use of these temporal features.

Competition and Promo features

The following Strategies were designed to take into consideration the age of competition and promotions:

CompetitionOpenSince — the number of months that have passed since the opening of the rival store.
Promo2Since — the number of weeks that have passed since the commencement of the current recurrent offer.

These were computed by comparing the current date with Promo2SinceWeek/Year and CompetitionOpenSinceYear/Month. Missing values were set to zero, such as in cases when there was no competition. This change made it possible for models to evaluate the effects of market saturation or sustained marketing campaigns on sales.

Mean Log Sales features

Mean log sales features were created based on several groupings rather than depending only on raw categorical variables, which can result in dimensionality problems: MeanLogSalesByAssortment, MeanLogSalesByStore, MeanLogSalesByStateHoliday, and so forth.

The model may learn from previous store, holiday, and assortment-level behaviour thanks to these aggregates, which encode historical sales performance. To manage skewed distributions and stabilise variance, I employed log-transformed sales (log1p(Sales)). Particularly in situations when the test set is devoid of target labels (such as actual sales), these manufactured aggregates enhance generalisation.

Training Models

The models were taught to forecast the sales value for every historical observation (i.e., for every date-store pair) throughout the training phase. This indicates that the models gained an understanding of the relationship between real sales results and factors like calendar patterns, holidays, and promotions. The objective was to train the model to capture generalisable correlations between characteristics and sales levels, not to anticipate the exact sales for the following day. In order to simulate how effectively the trained models will function on fresh, future data, they were evaluated on unseen records from the 2015 sample during validation. Estimating model performance prior to final deployment on the real test set requires this validation step.

Linear Regression (Baseline Model)

Linear Regression is a widely used statistical method that models the relationship between a dependent variable and one or more independent variables assuming a linear relationship. It is a crucial baseline for predicting jobs because of its ease of use and interpretability (Udbhav, Karthik, Rohini, & Ramakanth , 2021).
To prevent high-cardinality problems and factor-level mismatches in the test set, Store was removed from the feature set in this investigation. However, I employed engineering features like MeanLogSalesByStore that provide a numerical summary of store behaviour in order to preserve store-level impacts. By using this method, the model was able to take advantage of store-specific patterns without having to directly include intricate categorical variables. This model offers a helpful standard for comparing performance, despite its inability to represent non-linear interactions.

Random Forest

Random Forest is a non-parametric, tree-based ensemble method that builds multiple decision trees and aggregates their results to improve generalization. It manages outliers, missing values, and variable interactions with resilience, making it ideal for identifying complex, non-linear patterns in the data (Lena , Moritz , & Markus , 2024).
The ranger backend and the caret package were used to train the Random Forest model. Parallel processing was used to cut down on training time, and 3-fold cross-validation was used to adjust the model to balance bias and variance. Random Forest was chosen because it performed well in regression tests requiring structured business data, was resilient to noise, and was interpretable (based on variable importance).

XGBoost (Advanced Gradient Boosting Model)

Extreme Gradient Boosting, or XGBoost, is a highly optimised gradient boosting decision tree implementation. It is renowned for its accuracy and speed, particularly in production settings and contests using structured datasets (Lena , Moritz , & Markus , 2024).
In this thesis, a model matrix was used to train XGBoost on a one-hot encoded version of the data. The model was trained on a feature-rich and cleaned dataset using default hyperparameters. It is an effective method for identifying minute patterns in sales trends because it combines regularisation, tree pruning, and second-order optimisation. The robustness of the model was also enhanced by its integrated handling of missing values.

Model Performance Comparison

To evaluate model performance, two commonly used regression metrics were used: Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE). When evaluating the impact of significant variances in anticipated sales, RMSE is helpful because it highlights greater inaccuracies. MAPE makes it easier to compare forecast accuracy across stores and time periods by expressing it as a percentage (Jonatasv, 2024). To replicate the model’s capacity to generalise to previously unseen data, these measures were calculated on a held-out validation set.

              Model      RMSE      MAPE
1 Linear Regression 1438.3852 16.452824
2     Random Forest  741.3801  7.536129
3           XGBoost  782.8541  8.086746

The outcomes clearly show that sophisticated machine learning models are better to conventional regression. When compared to Linear Regression, Random Forest and XGBoost both obtained noticeably lower RMSE and MAPE values, demonstrating their improved capacity to capture the intricate dynamics of sales. Curiously, Random Forest fared marginally better than XGBoost in this instance, perhaps as a result of its inherent averaging mechanism working better with high-variance data. However, both tree-based models showed their usefulness for demand forecasting tasks and produced good predictive performance.

Actual vs. Predicted Sales — Interpretation

Line plots and scatter plots over time and stores were used to compare actual against expected sales in order to assess model performance beyond measures like RMSE and MAPE. These charts provide important interpretive information about each model’s prediction ability and constraints.

Both Random Forest and XGBoost were able to capture the general sales trends, including seasonality and promotion effects, according to line plots that compared the predicted and real sales over time for the sampled stores. Interestingly, XGBoost’s projections closely matched the real sales curves and were more steady and smooth. Because Random Forest relies on bootstrapped trees and lacks sequential learning, it occasionally displayed higher variance, especially during high-volume promotional weeks.

However, a condensed assessment of model performance was offered by the scatter plot of expected versus actual sales. Perfect forecasts should ideally fall on the diagonal line that is dashed at 45 degrees. Both Random Forest and XGBoost predictions grouped more tightly around the diagonal, suggesting lower prediction error and better fit, but Linear Regression had wider spread and divergence from this line (particularly for larger sales values).

All things considered, the visualisation confirmed that sophisticated models, particularly those based on trees, are better at capturing non-linear links and interactions in sales data. These findings support the model selection procedure and are in line with the evaluation metrics previously provided.

These results provide credence to the idea that sophisticated models like Random Forest and XGBoost can greatly improve sales forecasting in retail settings like Rossmann. Organisations can optimise resource allocation, decrease uncertainty, and become more responsive to shifting consumer demands by incorporating such models into supply chain planning systems. A data-driven argument for switching from conventional forecasting techniques to more contemporary, data-intensive machine learning frameworks is presented in this thesis.

Feature Importance After Prediction

Random Forest Feature importance

# Define the features of interest
important_features <- c("DayOfWeek", "Promo", "StateHoliday", "SchoolHoliday", 
                        "CompetitionDistance", "WeekOfMonth", "month", "week", 
                        "day", "year", "CompetitionOpenSince", "Promo2Since", 
                        "MeanLogSalesByStoreType", "MeanLogSalesByStore", 
                        "MeanLogSalesByStateHoliday", "MeanLogSalesByAssortment", 
                        "MeanLogSalesByPromoInterval", "MeanLogSalesByStorePromoDOW", 
                        "MeanLogSalesBySchoolHoliday2Type")

# Extract and filter the variable importance data frame
importance_rf <- varImp(rf_model, scale = TRUE)
rf_df <- importance_rf$importance
rf_df$Feature <- rownames(rf_df)
rf_df_filtered <- rf_df %>% 
  filter(Feature %in% important_features)


ggplot(rf_df_filtered, aes(x = reorder(Feature, Overall), y = Overall, fill = Overall)) +
  geom_col() +
  coord_flip() +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  theme_minimal() +
  labs(title = "Random Forest: Filtered Feature Importance", 
       x = "Feature", y = "Importance Score", fill = "Score")

XG Boost Feature Importance

Final predictions on the test data set

Random Forest for Final Sales Prediction

Final sales forecasts for the test dataset were also produced using the Random Forest model as it performed the best among all the three models. This strategy aims to evaluate how well both sophisticated machine learning approaches captured sales dynamics in terms of consistency and performance.

To guarantee compatibility with the Random Forest training structure, the test data initially went through a rigorous preprocessing step. Promo, SchoolHoliday, StateHoliday, and other categorical variables were transformed into factors and matched the levels utilised in the training data. In order to maintain the integrity of the temporal patterns that were learnt during training, the variable DayOfWeek, which was originally numeric, was also converted into a labelled component with consistent ordering.

To indicate missing information and avoid prediction mistakes, missing values in important designed features—Promote2Since and CompetitionOpenSince in particular—were imputed with -1. The Random Forest model, which had previously been trained on the same predictor set used in the final modelling step, was then fed the cleaned and aligned test data.

The final output was created by combining store and date information with predicted sales values that were produced using the predict() function. This made it simple to analyse and maybe visualise daily store-level forecasts. The outcomes were meant to test the Random Forest algorithm’s stability in unknown future scenarios.

Two plots were made in order to visually assess the model’s predictions. The model’s capacity to track general sales trends was demonstrated in the first plot, which contrasted real and projected daily sales for every store. In the second plot, sales in August and September of several years were the main focus. In order to assess how effectively the model represented seasonal trends and promotional dynamics, it superimposed the 2015 predicted values on top of historical actual data from 2013 and 2014. The accuracy and realism of the random forest model’s predictions were intuitively revealed by these visualisations.

XG Boost for Final Sales Prediction

  Store       Date XGB_Predicted_Sales
1     1 2015-09-17            3813.935
2     3 2015-09-17            5791.308
3     7 2015-09-17            7889.774
4     8 2015-09-17            4521.000
5     9 2015-09-17            5641.348
6    10 2015-09-17            4788.364

To provide final sales projections for the test dataset, the XGBoost model was used in addition to Random forest because of its excellent performance in previous tests. A number of important preprocessing processes were completed before prediction. Just like random forest, to ensure consistency and prevent errors during model matrix creation, all missing values in the engineering features Promo2Since and CompetitionOpenSince were imputed with -1.

The test set’s categorical variables were then properly matched to the training data’s. To match the levels utilised during model training, factors like Promo, SchoolHoliday, and StateHoliday were explicitly converted. Additionally, in order to replicate the structure of the training data, DayOfWeek, which was numeric in the raw dataset, was converted into a factor with suitably labelled levels.

Following data alignment, the test set was transformed into a matrix format that the XGBoost algorithm could use. To guarantee that the matrices had the same dimensions and column ordering, any columns that were absent from the test matrix but present in the training matrix were added with zero values. In order to prevent mismatch mistakes during prediction, this step was essential.

After the data matrices were ready, the test matrix was converted into a DMatrix object, which is the format that XGBoost needs in order to make predictions, and then it was run through the trained model to produce sales projections. A thorough results table was then produced by combining these anticipated values with test set metadata (such as dates and store identifiers). The 2 graphs look much similar to the ones that were produced by random forest model.

Retailers are better equipped to predict demand and adjust their inventory levels thanks to these sales forecasts. By using these forecasting models at the product level, accuracy would be further improved and overstocking and understocking may be prevented. Furthermore, these insights might be helpful for launching new store locations by providing a data-driven baseline for calculating the necessary stock levels based on anticipated demand patterns.

The study ensured a more thorough examination of the advantages and disadvantages of ensemble-based forecasting models for retail sales prediction by incorporating this alternative model into the prediction pipeline.

Discussion

This study aimed to increase supply chain efficiency by employing sophisticated machine learning approaches to improve demand forecasting accuracy. The conversation discusses the shortcomings of traditional approaches, compares the performance of forecasting models, and considers the influence of temporal and external factors. Lastly, it talks about how this methodology can be used practically to optimise the supply chain, especially when dealing with overstocking and understocking problems.

Which machine learning techniques among traditional and advanced models produce the most precise supply chain optimization demand forecasts?

The findings clearly show that advanced machine learning models perform noticeably better at predicting retail demand than traditional methods. The Random Forest model had the lowest Mean Absolute Percentage Error (MAPE = 7.54%) and Root Mean Square Error (RMSE = 741.38) of all the models that were examined. With an RMSE of 782.85 and a MAPE of 8.09%, XGBoost came in second. With an RMSE of 1438.39 and a MAPE of 16.45%, the Linear Regression model, which was utilised as the traditional baseline, fared noticeably worse.

These results demonstrate that sophisticated models—in particular, ensemble tree-based algorithms—are more capable of identifying intricate, nonlinear correlations in retail data. This is consistent with earlier research that indicated XGBoost to be better at managing seasonality and promotion effects, such as Lena et al. (2024). Additionally, it backs up Feizabadi’s (2020) article, which suggested using tree-based models to capture regional and hierarchical sales patterns.

What are the main drawbacks of conventional forecasting models, and how may these be addressed by advanced models?

Traditional models, such as linear regression, usually make assumptions about homoscedasticity, predictor independence, and linearity, all of which are commonly broken in retail datasets. Interactions between variables, such as how the impact of promotions may change based on the day of the week or type of store, are difficult for these models to account for. Furthermore, big, feature-rich datasets are difficult for standard methods to scale (Udbhav et al., 2021).

On the other hand, advanced algorithms such as Random Forest and XGBoost are ensemble-based and non-parametric, which allows them to extract complex patterns from high-dimensional data. They don’t require a lot of preprocessing to deal with missing values, multicollinearity, and variable interactions. Additionally, they enable more flexible incorporation of seasonal effects, lag features, and categorical encodings, which makes them ideal for retail settings with variable demand and a variety of influencing factors. As a result, sophisticated models overcome the drawbacks of traditional methods by providing improved generalisation and increased accuracy, especially when working with vast and varied datasets like Rossmann’s.

What effects do outside variables (such as promotions, seasonality, and economic indicators) have on the accuracy of demand forecasting? How do seasonal trends affect inventory needs?

Check with the graph! It was discovered that outside factors including promotions, state and school holidays, and the timing of competitive openings had a major impact on sales trends. The models were able to learn and incorporate these impacts into their forecasts by using calendar-based components (month, week, {day,year), as well as features likePromo2SinceandCompetitionOpenSince`.

Sales spikes were consistently higher during promotional times. This confirms earlier Meshram (2024) studies that highlight the instability promotions add to demand trends. Similarly, the final quarter of the year had particularly strong seasonality and holiday effects, which supported anticipated trends in the retail sector.

These elements are essential for inventory planning as well as precise forecasting. Accurate demand forecasting guarantees that there is adequate stock on hand to satisfy client demands during periods of high demand or significant marketing efforts. On the other hand, the same models aid in minimising extra inventory during times of low demand.

What effects does enhanced demand forecasting have on supply chain effectiveness and inventory optimization?

A more cost-effective and responsive supply chain is made possible by improved forecasting. Businesses can lower inventory holding costs (by avoiding overstocking) and stockout costs (by avoiding understocking) by minimising forecast mistakes. As a result, supply chain operations become more stable, operational waste is reduced, and customer satisfaction increases.

When compared to linear regression, the use of sophisticated models in this study resulted in forecast error reductions of almost 50%. These benefits may result in better cash flow, more precise ordering decisions, and more efficient use of shelf and warehouse space.

From a managerial standpoint, better forecast accuracy supports personnel planning, logistics scheduling, and procurement decisions. Additionally, it lays the groundwork for future uses that rely on accurate demand signals, including personalised promotions or dynamic pricing (Feizabadi, 2020).

How does the proposed methodology reduce overstocking and understocking in the retail supply chain?

Inaccurate projections, a lack of store-level personalisation, and a disregard for outside influences were the main reasons of inventory imbalance that were directly addressed by the approach created for this study. The forecasts grew more precise and detailed by incorporating rich feature engineering—which includes temporal, store-level, and promotion-based variables—into high-performing machine learning models.

For instance, the model was able to learn past performance under comparable promotional or seasonal conditions thanks to features like MeanLogSalesByStore, MeanLogSalesByPromoInterval, and MeanLogSalesByStateHoliday. The effects of duration and cyclicality were further taken into consideration by time-based features such as month, WeekOfMonth, and Promo2Since.

Consequently, localised demand at each shop was better reflected in forecasts, enabling customised inventory planning. In the end, this improves product availability, lowers waste, and boosts overall supply chain responsiveness by decreasing overstocking in low-demand areas and understocking in high-demand ones.

How can store-level forecasting be optimized across different regions or store types to reflect local demand variation?

In order to answer this question, this study used characteristics like StoreType, Assortment, and PromoInterval that distinguish stores based on their structure and behaviour. These characteristics gave the model contextual information to allow for store-level forecast customisation, in addition to store-specific historical averages.

Large retailers like Rossmann, who operate in a variety of markets, require this store-level granularity. It guarantees that regional variances, including local purchasing patterns, competitive dynamics, and demographic impacts, are included in demand forecasts. Additionally, by supporting data-driven replenishment plans, this granularity lessens the need for rule of thumb heuristics or arbitrary thresholds (Price, 2020). Since we have seen that the sales trends of various stores vary, the methods employed in this study can also be applied to specific stores to closely manage their particular demands.

What is the trade-off between model complexity and interpretability in a business context, and how should retailers approach this decision?

Although advanced algorithms such as Random Forest and XGBoost provide better accuracy, their interpretability may be limited due to their “black-box” nature (Martins & Kayode, 2024). However, while linear regression offers transparency, it performs poorly. A significant obstacle to data-driven decision-making is this trade-off. Retailers have to weigh these issues according to the decision’s circumstances. It may be better to use simpler models or explainable versions of complex models (such SHAP values) for strategic planning, when stakeholder trust is crucial. Accuracy may be more important when making operational decisions, like daily replenishment (Mariana & Carlos, 2023). This study takes a hybrid strategy, which means that it applies interpretable features (such Promo, StoreType, and aggregated log sales) within tree-based models and benchmarks using a linear model. This is in line with contemporary best practices in retail analytics since it permits performance improvements without totally compromising transparency.

The results of this study validate how sophisticated machine learning models may revolutionise retail demand forecasting. The approach provides actionable insights for inventory optimisation in addition to increased forecast accuracy through careful feature engineering, model selection, and validation. Improved supply chain efficiency is directly supported by the model’s increased capacity to estimate demand at a fine level thanks to the utilisation of store-specific historical data, promotion tracking, and calendar variables. The approach eventually results in leaner operations, more profitability, and improved customer experience by decreasing overstocking and understocking.

Conclusion

This thesis used the Rossmann store dataset as a case study to investigate the relative efficacy of machine learning and traditional methods for retail demand forecasting. The results provide compelling evidence that sophisticated machine learning models, especially Random Forest and XGBoost, outperform conventional linear regression in terms of accurately forecasting daily retail sales. The Random Forest model had the lowest forecast error (RMSE: 741.38; MAPE: 7.54%), and XGBoost came in second. Both models significantly outperformed linear models.

The method’s ability to carefully integrate temporal, category, and external features through feature engineering was one of its main advantages. Promotions, holidays, competition age, and store metadata were among the variables that added contextual complexity to the models. While aggregated historical sales metrics provide a reliable reference point for capturing store-level tendencies, time-based elements like month, week, and week-of-month enabled the models to identify seasonal and periodic sales patterns.

The study also shows how improved forecasting helps achieve operational goals: more accurate demand forecasts result in better inventory decisions, reduce the risk of overstocking and understocking, and guarantee that customer needs are met more consistently. This alignment between predictive analytics and business performance is particularly useful in fast-moving retail environments where planning cycles are short and margins for error are small. Accurate sales forecasting is essential for retailers to anticipate future demand and adjust their inventory management strategies accordingly. By forecasting future sales patterns, businesses can proactively plan stock levels to satisfy customer needs while reducing the risks of overstocking and understocking, two inefficiencies—excess inventory tying up capital and understocking resulting in missed sales opportunities.

Furthermore, using predicting models at the individual product level would provide even more accuracy, even though this study concentrated on aggregated sales. Retailers may optimise shelf space and boost consumer happiness by customising stock replenishment for each item with such detailed forecasts. Demand forecasting can also be quite helpful when making strategic choices like opening new stores. When properly modelled, historical trends from current stores can provide a solid foundation for projecting anticipated sales at new sites. This makes it possible to plan inventory more precisely and be operationally ready from day one, which eventually increases the supply chain’s overall resilience and efficiency.

Additionally, the results emphasize the importance of model selection in balancing interpretability and performance. Although linear regression provides transparency, it is not powerful enough to manage the complexities of retail data. However, even though they are more opaque, tree-based models like Random Forest and XGBoost can reveal subtle associations that would otherwise go unnoticed. Techniques like feature importance ratings can help close the gap between interpretability and performance when explainability is required.

In summary, this study supports the use of ensemble-based machine learning techniques as effective instruments for supply chain planning and demand forecasting in the retail industry. Additionally, it establishes the foundation for upcoming research on probabilistic forecasting, real-time data integration, and hybrid models. To stay competitive in a market that is becoming more and more unstable, retailers who want to increase operational agility and consumer response should think about implementing these data-driven forecasting frameworks.