Forecasting Seasonality in Industrial Production: A Comparative Study of Statistical and Machine Learning Models

Author

AS

Abstract

This study evaluates classical and modern forecasting techniques for predicting the monthly industrial production of ice cream and frozen desserts in the U.S., using Federal Reserve data from 1972 to 2025. The analysis compares traditional time series models (ARIMA, ETS), machine learning models (XGBoost), and deep learning methods (Multilayer Perceptron, MLP), along with simple and weighted ensemble forecasts. Models were trained on pre-2015 data and tested on a 10-year holdout set (2015–2025) using RMSE, MAE, and MAPE as performance metrics.

Among individual models, the MLP achieved the lowest forecast errors by capturing nonlinear relationships and seasonal effects. XGBoost also performed well, benefiting from engineered features like lag values and calendar variables. ARIMA provided a reliable statistical baseline, whereas ETS underperformed due to difficulties modeling multiplicative seasonality. The best overall performance came from a weighted ensemble that blended ARIMA’s interpretability with the accuracy of MLP and XGBoost.

The findings highlight the benefits of combining classical interpretability with modern predictive power. In cyclical industries, ensemble machine learning models can significantly enhance forecast accuracy, supporting improved production planning, inventory control, and marketing decisions. While the results are promising, limitations include the use of a single holdout split, exclusion of external variables, and the black-box nature of neural networks. Future work should incorporate rolling-origin evaluation, additional data sources, and more advanced architectures to further improve robustness and business relevance.

I. Introduction

Nearly a quarter of a century ago, two enduring challenges in sales forecasting were identified (H. Lorie 1957). The first is combining statistical analysis with subjective judgment. Executives rely on intuitive, market-based insights, while analysts build statistical models from historical or survey data. These groups often work in isolation—executives distrust models that miss business-specific nuances, and statisticians doubt unquantified opinions. The core issue is developing objective methods that integrate both perspectives to improve accuracy and efficiency. The second challenge is evaluating forecasts not only for statistical accuracy—measured by deviations from actual outcomes while considering variability—but also for economic usefulness, recognizing that some errors have higher decision-making or financial costs.

Sales forecasting research has three primary objectives. First, it aims to develop more accurate and reliable prediction methods, incorporating both traditional statistical techniques and emerging approaches such as Artificial Intelligence (AI) and Machine Learning (ML), with accuracy remaining the key criterion in method selection (Lawrence, O’Connor, and Edmundson 2000). This involves integrating quantitative models with qualitative insights from experienced practitioners, while also addressing factors that undermine accuracy—such as market volatility, data quality limitations, human judgment biases, and the inherent complexity of sales cycles. Second, it seeks to improve forecasting performance in dynamic or uncertain environments, including contexts shaped by new product launches, economic fluctuations, or disruptive events such as pandemics. Third, it examines the role of forecasts in supporting strategic and operational decisions, including planning, resource allocation, inventory management, and budgeting (Wacker and Lummus 2002).

The importance of sales forecasting lies in its ability to reduce uncertainty and enable organizations to respond effectively to future market conditions. By anticipating demand patterns, firms can mitigate the risks of overproduction, stockouts, or resource misallocation (Harahap et al. 2025). Accurate forecasts enhance supply chain efficiency, improve customer satisfaction, and contribute to more stable revenue streams (GhorbanTanhaei et al. 2024). At both the industry and macroeconomic levels, reliable forecasting facilitates coordinated production, supports supply chain sustainability, ensures smoother distribution, and strengthens resilience during periods of volatility or disruption (Rahman Mahin et al. 2025). Ultimately, sales forecasting serves not only as a managerial tool but also as a mechanism for sustaining competitiveness and promoting long-term growth in dynamic markets.

Data

In this paper, I examine the Industrial Production of Ice Cream and Frozen Desserts (IPN31152N1) in the United States, using data from the Federal Reserve Economic Data (FRED) platform. The Federal Reserve’s monthly industrial production index, along with related capacity indexes, capacity utilization rates, and measures from the construction sector, accounts for much of the variation in national output over the business cycle. These measures provide valuable insight into structural developments in the economy.

The Industrial Production (IP) index specifically measures the real output of all relevant establishments located in the United States, regardless of ownership, but excludes those in U.S. territories. For the ice cream and frozen dessert category, the data are classified under Manufacturing: Nondurable Goods. While ice cream sales are intuitively seasonal, the seasonality of production is less obvious. I chose this time series because it connects to an intuitive, relatable product while also challenging assumptions about production patterns. This was one of the first datasets that made the concept of seasonality clear to me, making it a fitting example for exploring forecasting techniques. From a business perspective, this dataset is valuable for firms in the ice cream and frozen dessert industry—and in other cyclical industries—as it can be used to track production trends, assess market conditions, and inform decisions on capacity planning, supply chain management, and marketing strategy.

Industry Context: U.S. Ice Cream Market

In 2024, ice cream makers in the United States produced 1.31 billion gallons of product, generating an estimated $11.6 billion in economic impact. The industry supports over 26,700 direct jobs and contributes nearly $1.9 billion in wages, according to the International Dairy Foods Association (n.d.) (IDFA). Most manufacturers are long-standing family-owned businesses, with many in operation for over 50 years.

Production is highly seasonal, peaking between March and September, while demand is driven largely by families and local markets. Interestingly, while industry observers note a growing interest in plant-based and non-dairy options, consumer preferences remain anchored in traditional flavors and formats: vanilla, chocolate, and strawberry are the top three, and premium and regular ice cream account for 80% of the market.

From a forecasting standpoint, this industry offers a compelling case study: it has high seasonal variability, strong legacy production patterns, and an increasing influence of marketing, packaging innovation, and environmental factors. As the global ice cream market is projected to grow from $79.08 billion in 2024 to $132.32 billion by 2032, understanding production patterns and demand cycles will become even more critical for maintaining competitiveness and planning operations.

II. Literature Review

Forecasting sales and production in seasonal industries like ice cream and frozen desserts has been widely studied using both classical and modern approaches. Traditional time series models such as Winters’ method (Holt-Winters exponential smoothing) have proven effective in modeling seasonality and trend, delivering low forecasting errors in ice cream sales contexts (Trakroo 2021).

Beyond classical methods, hybrid and fuzzy logic-based models have shown potential. For instance, Firmansyah et al. (2021) combined Trend Moment and Fuzzy Tsukamoto Inference System techniques to achieve high accuracy in forecasting both sales (71%) and inventory (85%) for ice cream products. These hybrid methods are particularly promising for capturing nonlinearities and expert-based rules in data-scarce environments.

More broadly, Makridakis, Hibon, and Moser (1979) provided a foundational comparison of forecasting techniques—such as decomposition, regression, and econometric models—emphasizing that model performance is often context-dependent. A systematic review by Agostino et al. (2020) reinforced this point, highlighting the diversity of forecasting models applied across manufacturing domains and calling for greater alignment between methodological rigor and business relevance.

The emergence of big data has further motivated the use of machine learning and data mining techniques for sales forecasting. (M, B, and A 2023) showed that machine learning approaches can enhance accuracy and uncover complex patterns in customer behavior, though they also noted the need for improved interpretability and model transparency. However, challenges persist in model selection, scalability, and the incorporation of domain knowledge, particularly when dealing with real-world uncertainty, irregular seasonality, and structural breaks.

Winklhofer, Diamantopoulos, and Witt (1996) also highlighted the underutilization of forecasting in strategic decision-making, noting that many organizations continue to rely heavily on judgmental methods despite advances in predictive analytics. This disconnect is echoed in Lawrence, O’Connor, and Edmundson (2000), who found that integrating expert input with statistical models often leads to better results—but that methodological integration remains inconsistent in practice. As models have grown more complex, there has been a notable evolution in deep learning architectures for time series forecasting. A recent survey by Kim et al. (2025) traces this shift from traditional recurrent and attention-based models to emerging hybrid, diffusion, and foundation models, expanding the methodological landscape and offering new perspectives on modeling complex temporal dependencies.

Despite the growing toolkit of forecasting methods, several gaps remain. Many studies overlook the potential of ensemble techniques that combine multiple models to balance bias and variance. Others fail to apply modern machine learning algorithms—such as gradient boosting or neural networks—to industrial index of ice cream and frozen dessert production time series enhanced with engineered features like lags and calendar variables. Moreover, few studies adopt robust evaluation strategies like rolling-origin validation or assess the economic consequences of forecast errors beyond conventional statistical metrics.

This project addresses several of these limitations. It evaluates both classical models (ETS, ARIMA) and modern machine learning approaches (XGBoost, MLP) on a seasonal industrial time series—monthly production of ice cream and frozen desserts in the U.S.—using feature engineering and a holdout validation framework. It further incorporates ensemble methods, both simple and weighted, to mitigate model-specific risks, and offers performance metrics aligned with practical business needs. By bridging traditional and contemporary forecasting approaches, this study contributes to both methodological advancement and improved decision-making in a cyclical, data-rich industry.

III. Methodology

Our data is aggregated at a monthly level from January 1972 to June 2025. There are no missing values or gaps in the time series.

We create features such as lag variables, rolling averages, and seasonal indicators. This can be confirmed by looking at the number of observations and the means of each variable - lag_x for the month specific lag and roll_x for the month specific rolling average.


Descriptive Statistics: Industrial Production (IP) – Ice Cream & Frozen Desserts
=================================================================================
Statistic              Min  Pctl(25) Median   Mean   Pctl(75)  Max   St. Dev.  N 
---------------------------------------------------------------------------------
IP Index              58.85  90.42   108.45  110.43   127.44  196.82  26.94   642
Lag (1 Month)         58.85  90.41   108.42  110.40   127.45  196.82  26.95   641
Lag (12 Mon)          58.85  90.33   108.34  110.42   127.54  196.82  27.15   630
Rolling Mean (3 Mon)  60.38  91.68   108.38  110.48   125.38  189.19  25.47   640
Rolling Mean (12 Mon) 78.32  98.87   109.77  110.69   122.49  156.90  19.19   631
Month Number            1      3       6      6.47      9       12     3.45   642
Year                  1,972  1,985   1,998  1,998.25  2,012   2,025   15.46   642
---------------------------------------------------------------------------------
Source: Federal Reserve Economic Data (FRED). IP Index, 2017=100.                


We do not incorporate any external data, such as holidays or macroeconomic indicators. This can be easily done in future iterations of the paper. Also, for model development, we use a simple train/test split (holdout) instead of our preferred rolling‑origin (walk‑forward) evaluation (expanding or sliding window; repeatedly refit and forecast forward) due to time and processing power constraints.

Data before 2015 January is train dataset, and after 2015 January (included) is test data.

Models

A. Traditional Models

An ARIMA (AutoRegressive Integrated Moving Average) model is a classical time series forecasting method that captures three key components: autocorrelation, differencing, and past errors. The AR (autoregressive) part models the relationship between a variable and its past values; the I (integrated) component involves differencing the series to achieve stationarity (removing trends); and the MA (moving average) part captures the influence of past forecast errors. ARIMA is particularly useful when the data shows autocorrelation and non-stationary behavior but lacks strong seasonality. Seasonal extensions (SARIMA) are also available to handle repeating seasonal patterns. ARIMA models are widely used due to their solid statistical foundation and interpretability.

The ETS (Error, Trend, Seasonality) model is a family of exponential smoothing models used for time series forecasting. It decomposes a time series into three components: the error term, the trend, and seasonality. Each component can be additive or multiplicative, allowing the model to adapt to various data behaviors. ETS models update their forecasts recursively as new data becomes available, placing more weight on recent observations. They are especially effective for series with clear trend or seasonal patterns, and unlike ARIMA, ETS does not require the series to be stationary. ETS models are intuitive, data-driven, and perform well on many practical forecasting problems.

Modelling

We fit ARIMA and ETS model.

ARIMA

Our best ARIMA model is an ARIMA(1,0,4)(0,1,1)[12] model. The ARIMA model includes one non-seasonal autoregressive term, four non-seasonal moving average terms, and one seasonal moving average term with a seasonal difference.

Series: IPN31152N 
Model: ARIMA(1,0,4)(0,1,1)[12] 

Coefficients:
         ar1      ma1      ma2     ma3      ma4     sma1
      0.9582  -0.1714  -0.0374  0.1981  -0.0934  -0.6878
s.e.  0.0158   0.0480   0.0454  0.0473   0.0490   0.0370

sigma^2 estimated as 16.63:  log likelihood=-1425.11
AIC=2864.22   AICc=2864.45   BIC=2893.78
# A mable: 1 x 1
                      ARIMA
                    <model>
1 <ARIMA(1,0,4)(0,1,1)[12]>
ETS

Our best EST model is an ETS(M,A,M) model, with smoothing parameters \(\alpha = 0.616\), \(\beta = 0.0001\), and \(\gamma = 0.267\). The ETS model includes multiplicative errors, an additive trend, and multiplicative seasonality. This means forecast errors grow proportionally with the level of the series, the trend increases linearly over time, and seasonal effects scale with the level—becoming more pronounced as production rises. This formulation is well-suited for time series where both the trend and seasonal variation evolve over time in a level-dependent way.

Series: IPN31152N 
Model: ETS(M,A,M) 
  Smoothing parameters:
    alpha = 0.616484 
    beta  = 0.0001020388 
    gamma = 0.2665631 

  Initial states:
     l[0]      b[0]     s[0]     s[-1]     s[-2]    s[-3]    s[-4]    s[-5]
 81.20399 0.1836714 0.733484 0.7749967 0.9104114 1.083894 1.217576 1.262227
    s[-6]    s[-7]    s[-8]     s[-9]    s[-10]    s[-11]
 1.281969 1.078896 1.011344 0.9941792 0.8886318 0.7623903

  sigma^2:  0.0012

     AIC     AICc      BIC 
4573.886 4575.114 4646.069 
# A mable: 1 x 1
           ETS
       <model>
1 <ETS(M,A,M)>

Model fit summary is below, and we can see ARIMA is a better in-sample model (lower AIC/BIC and higher log-likelihood indicate better in-sample model fit).

Model Fit Statistics (ETS and ARIMA)
Model Sigma² AIC BIC Log Likelihood
ETS 0.00 4573.89 4646.07 -2269.94
ARIMA 16.63 2864.22 2893.78 -1425.11
Note:

After fitting the model on train data, we predict onto the test data and evaluate performance using RMSE, MAE, and MAPE. We first plot the fitted and predicted values, along with PIs.

Plot

Table: Forecast Error Metrics on Test Data

Then, we evaluate the model performance metrics. Performance metrics on test data also shows ARIMA performs better than ETS on all three metrics:

  1. RMSE (Root Mean Squared Error) penalizes larger errors more. It is sensitive to outliers, and is useful when large deviations are particularly costly.

  2. Mean Absolute Error (MAE) is the average of absolute forecast errors. It is straightforward and interpretable, and not sensitive to direction or scale of errors.

  3. Mean Absolute Percentage Error (MAPE) is average percentage error relative to actual values. It helps understand errors in percentage terms, but is unstable when actuals ≈ 0.

Forecast Accuracy on Test Data
Model RMSE MAE MAPE
ARIMA(1,0,4)(0,1,1)[12] 8.855 7.233 6.752
ETS(M=0.616, A=0.0001, M=0.267) 14.261 11.754 10.738
Note: Jan 2015-June 2025

Apart from ARIMA and ETS, dynamic derivative of one or the other (inclusion of one or more variables) can also be built. There are lots of cyclical and anti-cylcical macroecnomic variables we can include.

B. Machine Learning Models

Next, we move to machine learning models. We start with XGBoost (Extreme Gradient Boosting), which is a powerful machine learning algorithm based on decision trees, optimized for speed and performance. It builds models in a sequential, additive way to minimize prediction error. At each stage, we fit a new decision tree to the residuals (errors) of the current model, to correct its mistakes. This process is repeated iteratively, improving the model by learning from past errors, and is widely used for structured/tabular data forecasting and classification tasks.

Modelling

First, we print the xgboost model summary.

##### xgb.Booster
raw: 114.7 Kb 
call:
  xgb.train(params = params, data = dtrain, nrounds = nrounds, 
    watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
    early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
    save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
    callbacks = callbacks, objective = "reg:squarederror", eta = 0.1, 
    max_depth = 3)
params (as set within xgb.train):
  objective = "reg:squarederror", eta = "0.1", max_depth = "3", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 7 
niter: 100
nfeatures : 7 
evaluation_log:
  iter train_rmse
 <num>      <num>
     1 102.887868
     2  92.810808
   ---        ---
    99   2.483033
   100   2.467742

The trained XGBoost regression model belongs to class xgb.Booster. It was trained using the xgb.train() function with 100 boosting iterations (niter = 100). The model uses 7 predictor variables, which include lag features, rolling averages, and seasonal indicators derived from the time series data. The training parameters include a learning rate (eta) of 0.1, a maximum tree depth of 3 (max_depth = 3), and a squared error loss function (objective = "reg:squarederror"), which is standard for continuous regression tasks. The model includes a callback for evaluation logging (cb.evaluation.log()), allowing access to the evaluation metrics (e.g., RMSE) recorded during each round of training.

Plot

Then, we plot the fitted values on train data and predicted values in test data.

Table: Forecast Error Metrics on Test Data

Finally, we update our metrics table with the XGBoost model performance metrics and combine it with the existing metrics from ARIMA and ETS models. XGBoost beats the traditional time series models unanimously by far.

Forecast Accuracy on Test Data (Jan 2015 – June 2025)
Model RMSE MAE MAPE
ARIMA(1,0,4)(0,1,1)[12] 8.855 7.233 6.752
ETS(M=0.616, A=0.0001, M=0.267) 14.261 11.754 10.738
XGBoost 5.832 4.904 4.602
Note: XGBoost trained using lag, rolling mean, and calendar features.

Apart from XGBoost, LightGBM or CatBoost can also be built with the same pipeline.

C. Deep Learning Models

Deep learning models are a type of machine learning that use neural networks with multiple layers to learn patterns in data. Inspired by the human brain, they are especially useful for handling large and complex datasets.

A neural network typically has three types of layers: the input layer, which takes in features like lagged values or rolling averages; one or more hidden layers, where the model learns patterns by adjusting weights and biases; and an output layer, which makes the final prediction. The more hidden layers a model has, the “deeper” it is, allowing it to learn more complex relationships.

In our project, we used a simple neural network to fit the data and evaluate its performance. Specifically, we used a Multilayer Perceptron (MLP), which is the most basic type of neural network. It has fully connected layers and works well for structured data. While an MLP is one kind of neural network, the term “neural network” also includes more advanced types like CNNs (for image data), RNNs (for sequential data), and LSTMs and GRUs2 (which are designed for time-dependent data).

a 7-5-1 network with 46 weights
options were - linear output units  decay=0.001
 b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 
  1.64  -0.86  -0.14   2.18   0.35  -0.39  -0.07  -0.08 
 b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2 
 19.43  -1.46   0.24   4.36  -1.88  12.02   0.08   0.37 
 b->h3 i1->h3 i2->h3 i3->h3 i4->h3 i5->h3 i6->h3 i7->h3 
 -7.92  -2.66   3.89   8.63  -8.25  -4.03  -0.43  -0.53 
 b->h4 i1->h4 i2->h4 i3->h4 i4->h4 i5->h4 i6->h4 i7->h4 
 -1.49  -0.11  -0.06   0.78   0.02  -0.41   0.01   0.01 
 b->h5 i1->h5 i2->h5 i3->h5 i4->h5 i5->h5 i6->h5 i7->h5 
  3.75   6.45   6.92 -11.64  -1.15   2.06   2.53   2.59 
  b->o  h1->o  h2->o  h3->o  h4->o  h5->o 
 22.18  36.59  21.81  17.74 176.11   4.85 

The fitted MLP (Multilayer Perceptron) model is a 7-5-1 neural network, meaning it has 7 input features, 1 hidden layer with 5 neurons, and 1 output neuron for regression. The model includes a total of 46 weights, which are the parameters it learns during training. These include weights from inputs to hidden units, biases for hidden and output layers, and weights from the hidden layer to the output. The model uses linear output units, appropriate for continuous regression tasks, and a small weight decay of 0.001 to prevent overfitting.

The summary shows how each of the 7 input features connects to each of the 5 hidden neurons, along with a bias term (b->h). For example, the weights for i1->h1 to i7->h1 indicate the strength and direction of influence each input has on the first hidden neuron. Similarly, the final layer shows the weights from each hidden neuron (h1->o to h5->o) to the output, along with an output bias (b->o). Large positive weights (like 55.87 or 89.68) suggest strong influence of those hidden units on the output. Altogether, this network captures nonlinear relationships in the data by transforming inputs through the hidden layer before producing a continuous output forecast.

Plot

Table: Forecast Error Metrics on Test Data

Forecast Accuracy on Test Data (Jan 2015 – June 2025)
Model RMSE MAE MAPE
ARIMA(1,0,4)(0,1,1)[12] 8.855 7.233 6.752
ETS(M=0.616, A=0.0001, M=0.267) 14.261 11.754 10.738
XGBoost 5.832 4.904 4.602
MLP 4.613 3.955 3.629
Note: XGBoost and MLP trained using lag, rolling mean, and calendar features.

As expected, neural network beats traditional models and performs slightly better than machine learning methods. We can fit more complex neural networks like LSTMs or GRUs (to be done in future).

D. Ensemble Techniques

Individual models (ARIMA, ETS, XGBoost, MLP) capture different patterns (trend, seasonality, nonlinear relationships). Combining them reduces variance and bias, often resulting in better performance than any single model. This is especially helpful when no single model consistently outperforms others across time. This ensures the ensemble leverages the relative strengths of each model without relying on just one.

To improve predictive accuracy, we construct two ensemble models by combining the forecasts from four individual models: ARIMA, ETS, XGBoost, and MLP (Neural Network).

  1. Simple Average Ensemble: We compute the mean forecast across all four models at each time point. This method gives equal weight to each model and assumes that, on average, their errors may cancel out. It is straightforward, robust, and often performs surprisingly well.

  2. Weighted Average Ensemble: We also compute a weighted average of the forecasts, where the weights are determined by the inverse of each model’s RMSE on the test data. Models with lower RMSE (i.e., more accurate) receive higher weights. This method favors models that performed better historically, while still blending insights from multiple approaches.

    Mathematically, for the weighted ensemble, the weight for model \(i\) is:

    \[w_i = \frac{\frac{1}{RMSE_i}}{\Sigma_j \frac{1}{RMSE_i}} \]

Weights Used in Inverse-RMSE Weighted Ensemble
Model Weight
ARIMA 0.196
ETS 0.122
XGBoost 0.298
MLP 0.384

Table: Forecast Error Metrics on Test Data

As expected, we put higher weight on the better performing models. The weight on worse performing models in not zero though, and they are contributing to the overall fit.

Forecast Accuracy on Test Data (Jan 2015 – June 2025)
Model RMSE MAE MAPE
ARIMA(1,0,4)(0,1,1)[12] 8.855 7.233 6.752
ETS(M=0.616, A=0.0001, M=0.267) 14.261 11.754 10.738
XGBoost 5.832 4.904 4.602
MLP 4.613 3.955 3.629
Ensemble (Mean) 6.405 5.201 4.855
Ensemble (Weighted) 5.113 4.182 3.900
Note: XGBoost and MLP trained using lag, rolling mean, and calendar features. Ensemble models combine ARIMA, ETS, XGBoost, and MLP forecasts using simple or inverse-RMSE weighted averages.

Weighted ensemble performs better than vanilla (simple average) ensemble.

E. GARCH Modeling

I am using monthly industrial production data, which is low frequency. The series shows strong seasonality, some trend, but not volatility clustering. The variance looks fairly stable (though amplitudes shift over decades). Thus, GARCH is not the optimal modle here.

IV. Model Evaluation

In-Sample Fit and Out-of-Sample Forecasts, along with forecast error metrics table are in previous section.

To complement our forecast error metrics table on the test data, we also examine residuals—the difference between actual values and model predictions. Analyzing residuals helps us assess whether the model has captured the structure in the data or whether meaningful patterns remain unexplained.

For a good forecasting model, residuals should behave like white noise. That means:

  • They should have a mean close to zero

  • They should show no autocorrelation (i.e., no patterns or structure over time)

  • They should have constant variance (no fanning or clustering)

  • They should be normally distributed (at least approximately, for reliable inference)

  • For test residuals, we especially want to ensure there is no bias — if residuals consistently under- or over-predict in a certain range or time period, the model is misfitting

If these conditions hold in both the training and test residual plots, it suggests the model has done a good job and is not missing any obvious signals. Conversely, visible patterns, trends, or non-random structure in the residuals may indicate underfitting, model misspecification, or the presence of features that the model failed to account for.

Residual Charts

In the in-sample residuals, we observe that ARIMA and MLP maintain relatively tight bands around zero, suggesting effective fitting with minimal bias and no obvious patterns left unexplained. ETS shows slightly larger variability and potential autocorrelation, while XGBoost displays isolated spikes likely due to overfitting localized noise in the training data.

In contrast, the out-of-sample residuals highlight each model’s generalization ability. ARIMA continues to perform reasonably well but with some visible seasonal under- and over-prediction patterns. ETS displays more pronounced fluctuations, indicating possible misspecification. Both MLP and XGBoost produce tighter, less autocorrelated residuals on the test set, suggesting better adaptability to unseen data. Overall, residuals reinforce that ARIMA is a strong baseline, while machine learning models—especially XGBoost—appear to generalize better, capturing nonlinearities not modeled by ETS or ARIMA.

Variable Importance

Machine learning models like XGBoost evaluate the contribution of each feature to predictive performance during training. The feature importance plot above shows that the 12-month lag (lag_12) and 3-month rolling average (roll_3) were the most influential variables in XGBoost’s forecasting decisions.

Similarly, MLP model variable importance chart suggest the value of the variable 12 months ago (lag_12) is the most influential. This makes sense for seasonal data like monthly industrial production. The calendar month (month_num) is also very important, highlighting strong seasonal effects. Recent values (1-month lag lag_1 and short rolling average roll_3) have moderate predictive power. The long-term trend (captured via year) has some relevance. The 12-month rolling average (roll_12) is the least influential in the neural network configuration.

Both the XGBoost and MLP models tell a consistent story in terms of feature relevance. lag_12, which captures the value from the same month in the previous year, emerges as the most important predictor in both models — reflecting the strong seasonality inherent in monthly industrial production data. On the other hand, roll_12, the 12-month rolling average, is the least important feature across both models.

This alignment across two fundamentally different modeling approaches — a tree-based ensemble and a neural network — strengthens our confidence in the dominant role of seasonal lag and the limited additional signal from long-term smoothing.

V. Results and Analysis

Overall, the MLP model performs the best individually—achieving the lowest RMSE, MAE, and MAPE—by effectively capturing both nonlinearity and seasonality. These patterns were partially missed by the simpler statistical models. The weighted ensemble model slightly outperforms MLP by blending the robustness of ARIMA with the predictive power of MLP and XGBoost. ETS performs the worst, likely due to its rigid handling of seasonal components. ARIMA, though older, still provides a strong and interpretable baseline for time series forecasting.

Model-by-Model Evaluation: Strengths and Weaknesses

Model Summary Evaluation (Strengths and Weaknesses)
ARIMA(1,0,4)(0,1,1)[12] Provides solid baseline performance (amongst classical model) with interpretable coefficients and effective seasonal modeling through SARIMA terms. Residuals show stability with small bias, though some under- and over-prediction exists. Misses nonlinear patterns and lags behind newer models in accuracy.
ETS (M,A,M) Performs worst overall. While capturing trend and seasonality, its multiplicative structure likely overfits. Residuals are volatile and autocorrelated—clear signs of misspecification. Poor fit to seasonal scale.
XGBoost Substantial gains over classical models due to its ability to model nonlinear interactions between lag, rolling average, and calendar features. Residuals are tighter, though there’s mild overfitting. Top features include lag_12, roll_3, and month_num. Slightly complex to tune.
MLP (Neural Network) Best performer on all error metrics (RMSE, MAE, MAPE). Captures complex nonlinear relationships and generalizes well, with tight residuals and low autocorrelation. However, it functions as a black-box and is harder to interpret. Feature importance aligns with XGBoost.
Simple Mean Ensemble Improves over ARIMA and ETS by averaging predictions. Simple to implement but underperforms relative to top models like MLP and XGBoost due to equal weighting.
Weighted Ensemble (inv-RMSE) Combines strengths of MLP and XGBoost while reducing ETS influence. Achieves the low overall forecast errors. However, relies on careful weighting, reducing transparency.

VI. Discussion

The Multilayer Perceptron (MLP) outperforming classical models like ARIMA and ETS as well as gradient boosting. Its ability to capture nonlinear patterns and interactions among lagged, rolling, and seasonal features contributed to superior accuracy and more stable residuals. While ARIMA provided a reasonable baseline and ETS struggled with multiplicative seasonality, both models showed higher residual variance and less adaptability to recent structural changes in the data. That said, the MLP’s performance depends on careful feature engineering and may lack interpretability and transparency compared to traditional statistical models. Nonetheless, its results—combined with the robustness of the weighted ensemble—highlight the strength of modern machine learning approaches in forecasting complex time series.

Based on the results, businesses in the ice cream and frozen dessert industry should adopt machine learning models—particularly neural networks or ensemble approaches—for forecasting production and demand, as these models demonstrated superior accuracy and robustness to seasonality and structural changes. The strong performance of models using lagged and calendar-based features suggests that firms should systematically track and integrate historical patterns, such as seasonal peaks and rolling trends, into their planning processes. Additionally, adopting ensemble forecasts can help mitigate risk by combining the strengths of different modeling approaches. Businesses should also consider expanding their data infrastructure to incorporate external drivers (e.g., weather, consumer sentiment) that can further improve forecast reliability. Investing in these capabilities can enhance supply chain planning, reduce overproduction or stockouts, and support more responsive marketing and inventory strategies throughout the year.

While the project demonstrated strong forecasting performance—especially from machine learning models like the MLP—several limitations and areas for improvement remain. First, the analysis relied on a single holdout validation split; future work should adopt rolling-origin or time-series cross-validation to better assess out-of-sample stability. Second, the models used only lagged, rolling, and calendar features; incorporating exogenous variables such as temperature, macroeconomic indicators, or promotional data could enhance accuracy, particularly for models like ARIMAX or XGBoost. Third, while machine learning models performed well, they require careful tuning and lack the interpretability of classical models, making it harder to extract business insights. Lastly, the assumption of homoscedastic and normally distributed errors may not always hold in the future (even though they have held historically), say due to structural breaks or increased volatility due to tariffs or wars—suggesting potential value in exploring GARCH or regime-switching models in future work.

VII. Conclusion

This study demonstrates the power of modern machine learning approaches—particularly the Multilayer Perceptron (MLP) and ensemble methods—in improving sales forecast accuracy over classical time series models like ARIMA and ETS. By leveraging lagged, rolling, and seasonal features, these models captured complex patterns that traditional methods missed. While limitations remain—such as the need for better cross-validation and the integration of external drivers—this analysis provides a strong foundation for advancing forecasting practices in cyclical industries. Future research can build on this work by incorporating richer data sources, testing alternative neural architectures, and evaluating performance in real-time production settings.

Bibliography

Agostino, Icaro Romolo Sousa, Wesley Vieira Da Silva, Claudimar Pereira Da Veiga, and Adriano Mendonça Souza. 2020. “Forecasting Models in the Manufacturing Processes and Operations Management: Systematic Literature Review.” Journal of Forecasting 39 (7): 1043–56. https://doi.org/10.1002/for.2674.
Association, International Dairy Foods. n.d. “Ice Cream Sales & Trends.” Accessed August 15, 2025. https://www.idfa.org/ice-cream-sales-trends.
Firmansyah, Riko, Sukma Puspitorini, Pariyadi Pariyadi, and Tamrin Syah. 2021. “Sales and Stock Purchase Prediction System Using Trend Moment Method and FIS Tsukamoto.” Arcitech: Journal of Computer Science and Artificial Intelligence 1 (1): 15. https://doi.org/10.29240/arcitech.v1i1.3057.
GhorbanTanhaei, Hamed, Payam Boozary, Sogand Sheykhan, Maryam Rabiee, Farzam Rahmani, and Iman Hosseini. 2024. “Predictive Analytics in Customer Behavior: Anticipating Trends and Preferences.” Results in Control and Optimization 17 (December): 100462. https://doi.org/10.1016/j.rico.2024.100462.
H. Lorie, James. 1957. “Two Important Problems in Sales Forecasting.” The Journal of Business 30 (3): 172–79. https://www.jstor.org/stable/2350723.
Harahap, Afif Zuhri Muhammad Khodri, Mohd Kamarul Irwan Abdul Rahim, Noor Malinjasari, Suzila Mat Salleh, and Rabiatul Adawiyah Ma’arof. 2025. “Enhancing the Inventory Management Through Demand Forecasting.” International Journal of Research and Innovation in Social Science IX (I): 2737–44. https://doi.org/10.47772/IJRISS.2025.9010221.
Kim, Jongseon, Hyungjoon Kim, HyunGi Kim, Dongjun Lee, and Sungroh Yoon. 2025. “A Comprehensive Survey of Deep Learning for Time Series Forecasting: Architectural Diversity and Open Challenges.” arXiv. https://doi.org/10.48550/arXiv.2411.05793.
Lawrence, Michael, Marcus O’Connor, and Bob Edmundson. 2000. “A Field Study of Sales Forecasting Accuracy and Processes.” European Journal of Operational Research 122 (1): 151–60. https://doi.org/10.1016/S0377-2217(99)00085-5.
M, Umamaheswari, Raguraman B, and Monica A. 2023. “Design Thinking Approaches to Industrial Sales Forecasting with Advanced Data Analytical Tool.” In 2023 International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS), 938–43. Erode, India: IEEE. https://doi.org/10.1109/ICSSAS57918.2023.10331725.
Makridakis, Spyros, Michele Hibon, and Claus Moser. 1979. “Accuracy of Forecasting: An Empirical Investigation.” Journal of the Royal Statistical Society. Series A (General) 142 (2): 97. https://doi.org/10.2307/2345077.
Rahman Mahin, Md. Parvezur, Munem Shahriar, Ritu Rani Das, Anuradha Roy, and Ahmed Wasif Reza. 2025. “Enhancing Sustainable Supply Chain Forecasting Using Machine Learning for Sales Prediction.” Procedia Computer Science 252: 470–79. https://doi.org/10.1016/j.procs.2025.01.006.
Trakroo, Divyanshi. 2021. “Forecasting Seasonal Products: A Case of Ice-Cream Sales.” In Advances in Business and Management Forecasting, edited by Kenneth D. Lawrence and Ronald K. Klimberg, 29–34. Emerald Publishing Limited. https://doi.org/10.1108/S1477-407020210000014003.
Wacker, John G., and Rhonda R. Lummus. 2002. “Sales Forecasting for Strategic Resource Planning.” International Journal of Operations & Production Management 22 (9): 1014–31. https://doi.org/10.1108/01443570210440519.
Winklhofer, Heidi, Adamantios Diamantopoulos, and Stephen F. Witt. 1996. “Forecasting Practice: A Review of the Empirical Literature and an Agenda for Future Research.” International Journal of Forecasting 12 (2): 193–221. https://doi.org/10.1016/0169-2070(95)00647-8.

Footnotes

  1. Board of Governors of the Federal Reserve System (US), Industrial Production: Manufacturing: Nondurable Goods: Ice Cream and Frozen Dessert (NAICS = 31152) [IPN31152N], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/IPN31152N, August 15, 2025.↩︎

  2. GRUs, or Gated Recurrent Units, are a simpler and faster alternative to LSTMs, capable of capturing both short-term and long-term dependencies in time series without requiring as many parameters. They are particularly useful in time series forecasting when computational efficiency is important.↩︎