Forecasting Final Project

Author

Tin Vu

#loading packages and API key
library(fpp3)
library(fredr)
library(ggplot2)
library(patchwork)
library(ggtime)
library(kableExtra)
library(knitr)
library(xgboost)
library(nnet)

Abstract

This project forecasts the U.S. new defense aircraft and parts orders (UDAPNO) using monthly data from 1995. The goal is to evaluate if traditional time series models, dynamic regression models, machine learning models, neural network models, and ensemble can produce meaningful short run forecasts for a defense related industry that is affected by seasonality, procurement cycles, and irregular contract driven spikes.

The project compares several forecasting approaches, including NAIVE, SNAIVE, ETS, TSLM, SARIMA, dynamic regression with ARIMA errors, XGBoost, MLP, NNETAR, and multiple ensemble models. I have utilized external variables like national security uncertainty, non-defense aircraft orders, and defense unfilled orders as contextual predictors and also utilized some lagged variables in some models.

The result shows that the broad ensemble performs best overall with both the lowest RMSE and MAE. This suggests that combining forecasts from different models helps with minimizing dollar based errors. XGBoost is the strongest individual model by RMSE, suggesting that lag variables rolling averages, seasonal indicators, trend, and external regressors are helpful for reducing forecast errors. The ETS model still remains the strongest traditional individual model since it has the lowest MAPE, so it seems to be quite effective in minimizing percentage based errors.

Overall, the results show that defense aircraft and parts orders are forecastable to a meaningful degree, but struggle to accurately predict sudden shocks. The strongest results come from combining multiple models, so averaging for this time series seems to improve robustness if the goal is to reduce large dollar forecast errors.

Introduction

Motivation and real world relevance

Defense aircraft and parts production is an important component of the U.S. defense industrial base. However a challenge for both policy makers and firms is that the procurement demand is not smooth. Going through budget processing which can jump during different rearmament cycles, modernization programs, or period of elevated geopolitical risk. This makes defense aerospace orders an interesting but challenging forecasting problem.

In this scenario, an accurate short horizon forecast of defense aerospace orders could be significant since it may assist prime contractors and important suppliers in capacity planning, inventory/long lead time procurement, and backlog management.

It is also important to note that war or geopolitical tension can require many types of defense production beyond aircraft, including ships, vehicles, munitions, electronics, and other military equipment. Therefore, defense aircraft and parts orders should be viewed as one important piece of the broader defense-demand environment rather than a complete measure of total defense needs.

Forecasting monthly defense aircraft and parts orders is still useful because this series gives insight into demand conditions within the defense aerospace sector. Even if sudden contract awards are difficult to predict exactly, forecasting can still provide a useful guide to the general direction and scale of expected orders.

This project uses the Manufacturers’ New Orders: Defense Aircraft and Parts series as the main forecasting target. Because the series contains both regular seasonal patterns and irregular spikes, I compare several types of models. These include traditional time-series models, dynamic regression models with external predictors, machine learning models, neural network models, and ensemble forecasts.

Research objective

The purpose of this project is to understand how well time series forecasting methods can predict monthly U.S. new defense aircraft and parts orders. The main outcome variable is orders_def, which measures Manufacturers’ New Orders: Defense Aircraft and Parts in millions of dollars.

I compare several different types of models in this project. First I estimate simple and traditional time series models like NAIVE, SNAIVE, ETS, TSLM, and SARIMA. These models are helpful in identifying series that can be forecasted using just the trend, seasonal patterns, and auto-correlation structure. Next are the dynamic regression models with ARIMA error or ARIMAX models, which utilized lagged external predictors like national security uncertainty, non-defense aircraft orders, and defense unfilled orders. After those are the machine learning and neural network models, which include XGBoost, MLP, and NNETAR. These more advanced models are used to determine if non-linear models paired with additional engineered features would improve the forecast accuracy. Lastly are the ensemble models which work by averaging predictions from multiple specified models.

To test the accuracy of each model, they are evaluated using an 80/20 train-test split and forecast accuracy is compared using RMSE, MAE, and MAPE. This works well since RMSE and MAE since the series is measured in millions of dollars, so large forecasting errors for those metrics have practical significance, especially when it comes to planning and timing new defense aircraft and parts orders. MAPE is also included to compare percentage-based forecast accuracy.

More broadly, this project wants to determine which modeling approaches are sufficient enough to be utilize in a defense related industry shaped by different factors like seasonality, procurement timing, and sudden shocks in contracting activity. A more practical perspective for this project would be to determine if these forecasts can provide a useful short run guide for real world decisions, examples would be production planning, supplier coordination, backlog management, inventory timing, and labor allocation within the defense aerospace industry. Even if forecasts aren’t completely perfect, it may still be valuable in identifying the general direction and scale of expected orders which assists firms and analysts in making more informed planning decisions.

Literature Review

Forecasting defense aircraft and parts orders accurately requires a basic understanding of both the defense demand setting and the forecasting methods used in the project. Because defense related spending is often impacted by policy decisions, military needs, and varying procurement timings. Ramey (2011) emphasizes that military spending changes often anticipate before they appear in actual spending data. So defense related orders may actually respond to more so to expectations and policy timing rather than only past order values. Nekarda and Ramey (2010) show how government spending can have industry level effects as well since defense aircraft and parts orders can be seen as a specific industry demand series rather than only as part of total government spending. This project uses national security uncertainty as one external predictor which Baker, Bloom, and Davis’ economic policy uncertainty index is also quite useful since it provides a way to measure uncertainty related to national security conditions.

Moving onto the actual models used in the project. The first few being more traditional time series forecasting methods which are able to provide a standard baseline for this project. Hyndman and Athanasopoulos (2021) describe a forecasting workflow that begins with data visualization, decomposition, model fitting, and accuracy evaluations. These are the steps that were taken during this project to get consistent and replicable results. An important model is the ETS model which is useful for modeling error, trend, and seasonality through exponential smoothing in the time series. While ARIMA/SARIMA models are helpful in capturing any auto-correlation, differencing, and seasonal shock behavior. Dynamic regression models with ARIMA errors adds another aspect to capturing information from the time series since it uses external predictors to explain errors in the data.

The usage of machine learning and neural network methods are also important since defense aircraft and parts orders will most likely have non-linear relationships that simpler models may not be able to fully capture. Chen and Guestrin (2016) explain how XGBoost is a scalable tree-boosting method that is able to model more complex relationships using many predictors. So including lag variables, rolling averages, trend, seasonal indicators, and external variables in the XGBoost model will most likely be the best approach. Neural networks can be used in time series forecasting as well since they are also able to model non-linear patterns but typically require a lot of sampling/training data to perform well. Zhang (2003) argues that ARIMA and neural network models are able to capture different types of structure. ARIMA captures more of the linear patterns and neural networks focus more on the non-linear patterns.

Lastly, forecast ensembles are important because this project finds that the broad ensemble performs the best overall. Bates and Granger (1969) show that combining forecasts is able to potentially reduce forecast errors when compared to just using one model. Some of the stronger forecasting approaches were combinations of multiple forecasting methods since each model can cover different aspects of the time series structure. So the use of ensemble forecasts will be very important, especially since defense aircraft and parts orders have both predictable seasonal structure and sudden procurement shocks.

Methodology

Data preparation

This project utilizes data from the Federal Reserve Economic Data database using the fredr package and converted into a tsibble format using yearmonth() as the time index. The main data series used is the Manufacturers’ New Orders: Defense Aircraft and Part (UDAPNO). I start the forecasting series in January 1995 to focus on more recent times but to also capture important shocks. Besides the main series, the Economic Policy Uncertainty: National Security (EPUNATSEC), Manufacturers’ New Orders: Nondefense Aircraft and Parts (UNAPNO), Manufacturers’ Unfilled Orders: Nondefense Aircraft and Parts (UNAPUO), and Manufacturers’ Unfilled Orders: Defense Aircraft and Parts (UDAPUO) series are helpful for supporting the overall interpretation of the cyclical U.S. defense spending during different times.

The Manufacturers’ New Orders: Defense Aircraft and Part (UDAPNO) or def_orders is a monthly non-seasonally adjusted data set which contains the value of defense aircraft and parts new orders. The decision to use the NSA form is important since the SNAIVE model looks at the raw seasonal pattern which isn’t present in the seasonally adjusted version. It allows for more flexibility and captures the full seasonal pattern over time. The other time series are used as external regressors in some forecasting models and may help support the greater economic context by comparing defense orders with more general aerospace demand, backlog conditions, and national security uncertainty.

#Manufacturers' New Orders: Defense Aircraft and Parts
UDAPNO_raw <- fredr(series_id = "UDAPNO", 
      observation_start = as.Date("1995-01-01")
) 

def_orders <- UDAPNO_raw |>
  transmute(Month = yearmonth(date), 
            orders_def = value) |>
  as_tsibble(index = Month)

#Economic Policy Uncertainty Index: Categorical Index: National security
EPUNATSEC_raw <- fredr(series_id = "EPUNATSEC", 
      observation_start = as.Date("1995-01-01")
) 

natsec_epu <- EPUNATSEC_raw |>
  transmute(Month = yearmonth(date), 
            national_security_epu = value) |>
  as_tsibble(index = Month)

#Manufacturers' New Orders: Nondefense Aircraft and Parts
UNAPNO_raw <- fredr(series_id = "UNAPNO", 
      observation_start = as.Date("1995-01-01")
) 

nondef_orders <- UNAPNO_raw |>
  transmute(Month = yearmonth(date), 
            orders_nondef = value) |>
  as_tsibble(index = Month)

#Manufacturers' Unfilled Orders: Nondefense Aircraft and Parts
UNAPUO_raw <- fredr(series_id = "UNAPUO", 
      observation_start = as.Date("1995-01-01")
) 

nondef_unfilled_orders <- UNAPUO_raw |>
  transmute(Month = yearmonth(date), 
            orders_nondef_unfilled = value) |>
  as_tsibble(index = Month)

#Manufacturers' Unfilled Orders: Defense Aircraft and Parts
UDAPUO_raw <- fredr(series_id = "UDAPUO", 
      observation_start = as.Date("1995-01-01")
) 

def_unfilled_orders <- UDAPUO_raw |>
  transmute(Month = yearmonth(date), 
            orders_def_unfilled = value) |>
  as_tsibble(index = Month)


#main data set 
#https://fred.stlouisfed.org/series/UDAPNO

#extra data sets
#https://fred.stlouisfed.org/series/EPUNATSEC
#https://fred.stlouisfed.org/series/UNAPNO
#https://fred.stlouisfed.org/series/UNAPUO
#https://fred.stlouisfed.org/series/UDAPUO

Plots

#defense orders
def_orders_plot <- autoplot(def_orders, orders_def) +
  labs(
    title = "Manufacturers' New Orders: Defense Aircraft and Parts",
    y = "Millions of dollars",
    x = "Year"
) + theme_minimal()

#extra plots
natsec_epu_plots <- autoplot(natsec_epu, national_security_epu) +
  labs(
    title = "Economic Policy Uncertainty Index: Categorical Index: National security",
    y = "Index",
    x = "Year"
) + theme_minimal()

nondef_orders_plots <- autoplot(nondef_orders, orders_nondef) +
  labs(
    title = "Manufacturers' New Orders: Nondefense Aircraft and Parts",
    y = "Millions of dollars",
    x = "Year"
) + theme_minimal()

nondef_unfilled_orders_plots <- autoplot(nondef_unfilled_orders, orders_nondef_unfilled) +
  labs(
    title = "Manufacturers' Unfilled Orders: Nondefense Aircraft and Parts",
    y = "Millions of dollars",
    x = "Year"
) + theme_minimal()

def_unfilled_orders_plots <- autoplot(def_unfilled_orders, orders_def_unfilled) +
  labs(
    title = "Manufacturers' Unfilled Orders: Defense Aircraft and Parts",
    y = "Millions of dollars",
    x = "Year"   
) + theme_minimal()

def_orders_plot / nondef_orders_plots / def_unfilled_orders_plots

nondef_orders_plots / nondef_unfilled_orders_plots

The first plot shows the main forecasting series, Manufacturers’ New Orders: Defense Aircraft and Parts (orders_def). The series is measured in millions of dollars and shows both repeated seasonal movement and large irregular spikes. These spikes are important because they suggest that defense aircraft orders may be affected by contract timing, procurement cycles, and occasional large awards.

The additional plots provide economic context rather than direct forecasting evidence at this stage. National security uncertainty is included as a broad measure of the defense-related policy environment, but it should not be interpreted as a direct one-to-one driver of aircraft orders. War and geopolitical tension can require many forms of defense production beyond aircraft, including ships, vehicles, electronics, munitions, and other equipment. Nondefense aircraft orders and defense unfilled orders are included to compare the main series with broader aerospace demand and backlog conditions.

STL decomposition

A STL decomposition on the defense aircraft and parts orders series is used to separate the trend, seasonal, and remainder components. The plot shows that there is a long run trend and a very noticeable seasonal pattern. The remainder component contains some large irregular movements which suggests that defense aircraft and parts orders are not driven only be smooth seasonal behavior. This series seems to be affected by sudden shifts, so things like procurement timing and large contract awards may be causing this. The STL plots suggests that models that can handle both seasonality and changing patterns over time would be effective. So models like SNAIVE, ETS, and TSLM are reasonable choices for this time series to get the most accurate predictions.

STL_def_orders <- def_orders |>
  model(stl = STL(orders_def ~ season(window = "periodic"), robust = TRUE)) |>
  components()

autoplot(STL_def_orders) + labs(
  title = "Defense Orders: STL decomposition",
  x= "Year")

ACF/PACF

The ACF and PACF plots are used to examine the time dependence in orders_def. The STL decomposition shows the trend, seasonal pattern, and irregular remainder visually, while the ACF and PACF help identify whether auto-regressive or moving-average behavior may be present in the series.

The ACF shows how current defense aircraft and parts orders are correlated with past values of the series. Significant spikes in the ACF suggest that past shocks or unusual order movements may continue to affect later months. This is especially relevant for this project because defense aircraft orders can experience sudden contract-driven spikes. If those shocks influence nearby months, then a moving-average component may be useful in a later ARIMA model.

The PACF shows the relationship between current orders and a specific lag after controlling for the shorter lags in between. Significant PACF spikes suggest possible auto-regressive terms, meaning current defense orders may depend directly on recent past orders. So the PACF is useful for setting up the best possible AR terms and the ACF is useful for setting up the best possible MA terms and shock behavior.

The repeated structure around seasonal lags also supports the earlier evidence from the STL decomposition that the series has a monthly seasonal pattern. This gives us a good reason to attempt applying seasonal forecasting models such as ETS with seasonality, SNAIVE, TSLM with seasonal indicators, or seasonal ARIMA.

def_orders |>
  gg_tsdisplay(orders_def, plot_type = "partial", lag = 36) +
  labs(
    title = "Defense Aircraft and Parts Orders: ACF/PACF"
)

Data train/test split

I divided the defense aircraft and parts orders series into an 80/20 train-test split. The first 80% of the observations are used to train the model and the 20% are used to test the accuracy of the model. This split is important it indicates how well the models perform and then comparing it the actually real life values. The training split is used to estimate the forecasting models while the test split is used to compare the forecasting accuracy. This results in a more realistic measure of the forecasting model’s performance than only looking at in sample fit.

#80/20 split
n <- nrow(def_orders)
n_train <- floor(0.8 * n)

train <- def_orders |> slice_head(n = n_train)
test  <- def_orders |> slice_tail(n = n - n_train)

h <- nrow(test)

Forecasting Models

After looking at the defense aircraft and parts order series with the plots, STL decomposition, and ACF/PACF, the usage of several different forecasting models will be useful to look at different aspects of the series. The goal is to compare traditional time series models, dynamic regressions models, machine learning models, neural network, and ensemble models to determine which models perform the best.

The first models are the most simple and most traditional time series models, the NAIVE and SNAIVE models. NAIVE uses the most recent observation and SNAIVE uses the value from the same month in the previous year to assists forecasts. These models are helpful as a baseline for determining if more advance model are needed and if they actually explain the time series more.

Next are the ETS, TSLM, and SARIMA models. ETS captures level and seasonal behavior through exponential smoothing. TSLM uses a time trend and monthly seasonal indicators which makes it more interpretable but overall less flexible. And then SARIMA is included to capture auto-correlation, differencing, and seasonal shock behavior.

I also estimate ARIMAX models, which are dynamic regression models with ARIMA errors. These models utilize lagged external predictors such as national security uncertainty, non-defense aircraft orders, and defense unfilled orders. This tests whether outside information improves the forecast beyond the past values of new defense aircraft and parts orders.

Lastly XGBoost, MLP, and NNETAR models, along with several ensemble forecasts are created. XGBoost uses lag variables, rolling averages, seasonal indicators, trend, and external predictors to capture non-linear patterns. MLP and NNETAR are neural network benchmarks to determine if they can perform better than the simpler models. The ensemble models average forecasts from multiple models to determine if whether combining different model types improves robustness for this volatile series.

Fitting ETS Modeling

The ETS model selected specifications for the defense aircraft and parts orders series was ETS(A,N,A), which stands for additive errors, no trend, and additive seasonality. This specification is reasonable for orders_def since the STL decomposition showed a strong seasonal structure. While the long run pattern is not smooth enough to require a simple fixed trend in the ETS form.

ETS would be a strong choice since defense aircraft orders frequently shift over time and don’t follow a perfectly smooth pattern. So a more flexible model should accurately forecast better than a simple benchmark.

The additive error component means that forecast errors are measured in the original units of the data, which are millions of dollars. The no-trend component means the model does not impose a persistent upward or downward trend. While the additive seasonal component means the model adds a recurring monthly seasonal effect.

The smoothing parameters also help explain how the model responds to new information. The low value of alpha means that the level updates slowly, so the model does not overreact to one-month contract spikes. The very small value of gamma means that the seasonal pattern is almost fixed over time.

This makes sense for the defense aircraft and parts orders series because the monthly seasonal pattern seems to be relatively stable and many of the largest movements come from irregular shocks rather than changing seasonality.

#Fitting models
fit_models <- train |>
  model(
    NAIVE = NAIVE(orders_def),
    SNAIVE = SNAIVE(orders_def),
    ETS = ETS(orders_def),
    TSLM = TSLM(orders_def ~ trend() + season())
)

fc_plot <- fit_models |> forecast(new_data = test)

ets_param_tbl <- tibble(
  Item = c("ETS model", "alpha", "gamma"),
  Value = c("ETS(A,N,A)", "0.087", "0.0001")
)

ets_param_tbl |>
  kable(caption = "ETS Auto-Selected Model and Smoothing Parameters")

ETS Auto-Selected Model and Smoothing Parameters
Item	Value
ETS model	ETS(A,N,A)
alpha	0.087
gamma	0.0001

#creating plot
fit_models <- train |>
  model(
    NAIVE = NAIVE(orders_def),
    SNAIVE = SNAIVE(orders_def),
    ETS = ETS(orders_def),
    TSLM = TSLM(orders_def ~ trend() + season())
)

fc_plot <- fit_models |> forecast(new_data = test)

Forecasting Accuracy and Ensemble Comparison

After fitting the models and generating a forecasts over the test split, I compare their performance using RMSE, MAE, and MAPE. These measures show how close each model’s forecast is to the actual values. The RMSE or root mean squared error measures the average size of the forecast errors but applies more weight to large mistakes. RMSE is useful when large misses are a bigger concern than small ones. So RMSE would be important for defense aircraft and parts orders since large sharp spikes occur due to sudden contracts, so a model that doesn’t predict those spikes will be penalized more. MAE or mean absolute error measures the average absolute difference between the forecasted and actual values. Mistakes are treated more evenly in MAE and is typically more interpretable since it stays in the original units of the data or millions of dollars in this instance. MAPE or mean absolute percentage error measures the error as a percentage of the actual value. MAPE is effective for giving a more relative sense of forecast performance and when comparing across models. But the main downside of MAPE is that percentage errors may not be as stable when actual values are very low and the data contains large swings.

The results show that the ETS model performed the best among the four. It had the lowest RMSE at 1353.8, lowest MAE at 933.1, and lowest MAPE at 17.5. The next best performing model was the TSLM and then the SNAIVE model which was significantly better than NAIVE. This indicates that seasonality was an important aspect of the time series which was quite evident when using the STL decomposition plot. The results show that a more flexible model like ETS and TSLM is better at handling seasonal structure and irregular contract driven movement in the defense aircraft and parts orders series than the SNAIVE and NAIVE models.

accuracy_tbl <- fc_plot |>
  accuracy(test, measures = list(RMSE = RMSE, MAE = MAE, MAPE = MAPE)) |>
  select(.model, RMSE, MAE, MAPE) |>
  arrange(RMSE)

ensemble <- fc_plot |>
  as_tibble() |>
  filter(.model %in% c("ETS", "SNAIVE", "TSLM")) |>
  group_by(Month) |>
  summarise(.mean = mean(.mean), .groups = "drop")

ensemble_ts <- ensemble |>
  as_tsibble(index = Month)

ensemble_accuracy_tbl <- test |>
  as_tibble() |>
  left_join(ensemble, by = "Month") |>
  summarise(
    .model = "Ensemble",
    RMSE = sqrt(mean((orders_def - .mean)^2)),
    MAE  = mean(abs(orders_def - .mean)),
    MAPE = mean(abs((orders_def - .mean) / orders_def)) * 100
)

accuracy_all <- bind_rows(
  accuracy_tbl,
  ensemble_accuracy_tbl
) |>
  arrange(RMSE)

accuracy_all |>
  kable(
    digits = 1,
    caption = "Forecast Accuracy Comparison"
  )

Forecast Accuracy Comparison
.model	RMSE	MAE	MAPE
Ensemble	1297.8	957.2	19.2
ETS	1353.8	933.1	17.5
TSLM	1376.8	1144.5	25.9
SNAIVE	1850.0	1544.1	31.6
NAIVE	2866.4	2627.8	63.3

Ensemble

By averaging the point forecasts from ETS, SNAIVE, and TSLM I was able to create an ensemble forecast. The purpose was to check if combining several reasonable models would improve forecasting performance since each model captures slightly different aspects in the time series. ETS is the most flexible and adaptive, while SNAIVE emphasizes on the repeating seasonal patterns, and TSLM captures the trend and seasonality in a more structured regression form. So by averaging the different models, the ensemble could potentially minimize the effect of a model making an inaccurate forecast in a given month since ensembles groups by month and takes the mean forecast value for each month.

When compared to the other forecasting models, the ensemble performed very well. It had the lowest RMSE at 1297, second lowest MAE at 957.2 and second lowest MAPE at 19.2. The ensemble improved on the ETS model in terms of RMSE but is slightly worse in terms of MAE and MAPE. So forecast averaging helped improve the robustness for the defense aircraft and parts orders series, especially for RMSE and MAE which are dollar based error measures which help with economic intuition for the volatile defense procurement series.

Forecast Comparison Plots

To visually compare the forecasting model’s performance, I plot the forecasted values against the actual test split period values. I focused on the two best models being the ETS and ensemble model. The actual values in the test split period are shown with a dashed line and the vertical dashed line marks the point where the training split ends and the test split starts.

The first set of plots compares the ETS forecast and the ensemble forecast. and the ETS plot includes 80% and 95% prediction intervals, which shows the uncertainty around the forecast. The intervals themselves are fairly wide since defense aircraft and parts orders seem to be volatile and contain large irregular procurement spikes.

But most of the actual test values fall within the ETS prediction intervals, while the model still has some difficulty predicting the timing and sizes of the largest spikes. This is most likely due to the difficulty for any univariate models to fully predict the large defense orders due to irregular contract timing. But ETS is still able to consistently track the general direction of the series which is quite good.

The ensemble forecast also tracks the general movement of the actual series reasonably well while in some periods seem to fit better than the ETS model. This would make sense since the accuracy results performed slightly better in the RMSE and MAE than the ETS model.

So the ensemble plot also supports the idea of averaging the point forecasts from ETS, SNAIVE, and TSLM is beneficial for defense aircraft and parts orders. Like the ETS model, the ensemble still misses some sudden jumps in the actual data because large contract awards are difficult to predict using only past time-series patterns.

So both forecasting plots support the idea that both the ETS and ensemble models perform well with the ETS model providing a slightly better percentage based accuracy while the ensemble provides a slightly better dollar based accuracy.

The second plot compares three models that were used in the ensemble, ETS, TSLM, and SNAIVE. ETS provides a stable seasonal forecast, TSLM seems to produce more structured trend and seasonality forecast, and SNAIVE repeats the seasonal pattern from the previous year. The plot shows how each model is able to capture different aspects of the time series. SNAIVE follows the monthly seasonal pattern but doesn’t seem to adjust too well and isn’t as flexible. While ETS and TSLM are able to track the overall level of the series much closer.

Overall, the forecast plots support the accuracy table results. ETS is the strongest individual model, while the ensemble provides a more robust forecast by combining the strengths of ETS, TSLM, and SNAIVE. Both models are able to capture the general seasonal movement in defense aircraft and parts orders, but neither fully predicts the largest irregular spikes, which are likely connected to contract timing and procurement shocks.

plot_start <- yearmonth("2019 Jun")
forecast_start <- min(test$Month)

ETS_forecast_plot <- autoplot(train, orders_def) + 
  autolayer(test, orders_def, linetype = "dashed") +
  autolayer(fc_plot |> filter(.model == "ETS"), level = c(80, 95)) + 
  geom_vline(xintercept = forecast_start, linetype = "dashed") +
  coord_cartesian(xlim = c(plot_start, max(test$Month))) + 
  labs(
    title = "ETS Forecast vs Actual",
    x = "Month",
    y = "Millions of dollars"
) + theme_minimal()

Ensemble_forecast_plot <- autoplot(train, orders_def) +
  autolayer(test, orders_def, linetype = "dashed") +
  autolayer(ensemble_ts, .mean, colour = "red") +
  geom_vline(xintercept = forecast_start, linetype = "dashed") +
  coord_cartesian(xlim = c(plot_start, max(test$Month))) +
  labs(
    title = "Ensemble Forecast vs Actual",
    x = "Month",
    y = "Millions of dollars"
) + theme_minimal()

ETS_forecast_plot / Ensemble_forecast_plot

fc_plot |>
  filter(.model %in% c("ETS", "TSLM", "SNAIVE")) |>
  autoplot(train, level = NULL) +
  autolayer(test, orders_def, linetype = "dashed") +
  geom_vline(xintercept = forecast_start, linetype = "dashed") +
  coord_cartesian(xlim = c(plot_start, max(test$Month))) +
  labs(
    title = "ETS, TSLM, and SNAIVE Forecasts vs Actual Test Data",
    x = "Month",
    y = "Millions of dollars"
  ) +
  theme_minimal()

Residual diagnostics for ETS

Because ETS is the best individual model, looking at its residual diagnostic is useful because it shows whether the remaining forecast errors behave like random noise or if there are still some structure in the residuals. It will plot the innovation residuals, residual ACF, and residual distribution.

The residual plot suggests that the ETS model mostly captures the patterns in the series but not all of the irregular variation. The residual series is centered around zero which means that the model isn’t not systematically over predicting or under predicting over time. Some large residual spikes remain which makes sense for a defense procurement series when sudden contracts awards can cause sudden jumps that are difficult to forecast accurately.

The residual ACF overall appears small which suggests that ETS removes most of the time dependence in the data. However, the residual histogram shows it is not perfectly symmetrical most likely due to the irregular nature of defense orders and large positive order spikes.

The residual diagnostics support ETS as a reasonable forecasting model for this project but also shows that some unpredictability remains. Intuitively, factors like procurement timing, large sudden contracts, and changing geopolitical landscape would create shocks that are difficult for a single univariate forecasting model to accurately capture.

fit_models |>
  select(ETS) |>
  gg_tsresiduals(lag = 24)

fit_models |>
  select(ETS) |>
  augment() |>
  features(.innov, ljung_box, lag = 24)

# A tibble: 1 × 3
  .model lb_stat lb_pvalue
  <chr>    <dbl>     <dbl>
1 ETS       20.8     0.652

The Ljung-Box test gives a p-value of 0.652, which is greater than 0.05. This means that there is not strong evidence of remaining auto-correlation in the ETS residuals.

SARIMA

fit_models <- train |>
  model(
    NAIVE = NAIVE(orders_def),
    SNAIVE = SNAIVE(orders_def),
    ETS = ETS(orders_def),
    TSLM = TSLM(orders_def ~ trend() + season()),
    SARIMA = ARIMA(orders_def)
)

fc_plot <- fit_models |>
  forecast(h = h)

accuracy_tbl <- fc_plot |>
  accuracy(test) |>
  select(.model, RMSE, MAE, MAPE) |>
  arrange(RMSE)

fit_models |>
  select(SARIMA) |>
  report()

Series: orders_def 
Model: ARIMA(0,1,1)(0,0,2)[12] 

Coefficients:
          ma1    sma1    sma2
      -0.9496  0.2645  0.1956
s.e.   0.0185  0.0588  0.0511

sigma^2 estimated as 2859269:  log likelihood=-2646.98
AIC=5301.96   AICc=5302.1   BIC=5316.76

The ARIMA model automatically selected a seasonal ARIMA specification: ARIMA(0,1,1)(0,0,2)[12]. This specification uses 1 regular difference, 1 non-seasonal moving average term, and two seasonal moving average terms for monthly seasonality. Because the data is monthly, the seasonal period is 12, so the seasonal part of the model captures shock patterns that repeat around yearly lags.

This automatic selection supports the earlier ACF/PACF results since the ACF showed repeated seasonal structure and some signs that shocks may carry forward over time. The selected SARIMA model does not rely on auto-regressive terms and instead uses a moving average term to predict recent and seasonal shocks. This may make the most sense for defense aircraft and parts orders since sudden contract awards or procurement time may change which would then create short term shocks that influence the rest of the series.

When compared to some of the other models, SARIMA performs better than SNAIVE and NAIVE benchmarks. The SARIMA model seems to be able to capture more time series structure than a basic seasonal repeat or last value forecast. However, SARIMA does not outperform the ETS, TSLM, the ensemble model. So SARIMA seems to be useful as a seasonal autocorrelation model but may not be effective enough to accurately handles the irregular procurement shocks in the test period. So overall, SARIMA is decent traditional time series model but still performs worse than the ETS and ensemble model

Dynamic Regression/ARIMAX

A dynamic regression is useful for expanding on univariate forecasting models by adding external predictors. In this project, I estimate a dynamic regression model with ARIMA errors which is also known as ARIMAX. I believe this to be useful since external regressors may improve the forecast of defense aircraft and parts orders because additional data/information may be able to explain the whole scenario better than just a single data set.

The external predictors used in this section are national security uncertainty, nondefense aircraft and parts orders, and defense aircraft and parts unfilled orders. National security uncertainty is included as a broad measure of the defense-related policy environment. Nondefense aircraft orders are included to represent broader aerospace demand. Defense unfilled orders are included as a backlog measure, meaning orders that have already been placed but have not yet been completed or filled.

#combining data sets
model_data <- def_orders |>
  left_join(natsec_epu, by = "Month") |>
  left_join(nondef_orders, by = "Month") |>
  left_join(def_unfilled_orders, by = "Month") |>
  left_join(nondef_unfilled_orders, by = "Month")

#creating lagged predictors
model_data <- model_data |>
  mutate(
    national_security_epu_lag1 = lag(national_security_epu, 1),
    national_security_epu_lag3 = lag(national_security_epu, 3),
    
    orders_nondef_lag1 = lag(orders_nondef, 1),
    orders_nondef_lag3 = lag(orders_nondef, 3),
    
    orders_def_unfilled_lag1 = lag(orders_def_unfilled, 1),
    orders_def_unfilled_lag3 = lag(orders_def_unfilled, 3),
    
    orders_nondef_unfilled_lag1 = lag(orders_nondef_unfilled, 1),
    orders_nondef_unfilled_lag3 = lag(orders_nondef_unfilled, 3)
  ) |>
  drop_na()

I utilized lagged versions of these variables instead of just the same month values because some regressors may be able to explain more of the structure by looking at past values as almost a precursor. So forecasting model should use information that would have been available before the forecasted period.

#train/test split for dynamic regression
train_end <- max(train$Month)

dynamic_train <- model_data |>
  filter(Month <= train_end)

dynamic_test <- model_data |>
  filter(Month > train_end)

h_dynamic <- nrow(dynamic_test)

#fitting dynamic ARIMA models
fit_dynamic <- dynamic_train |>
  model(
    ARIMAX_1 = ARIMA(
      orders_def ~ national_security_epu_lag1 +
        orders_nondef_lag1
    ),
    
    ARIMAX_3 = ARIMA(
      orders_def ~ national_security_epu_lag3 +
        orders_nondef_lag3
    ),
    
    ARIMAX_BACKLOG_1 = ARIMA(
      orders_def ~ national_security_epu_lag1 +
        orders_def_unfilled_lag1
    ),
    
    ARIMAX_BACKLOG_3 = ARIMA(
      orders_def ~ national_security_epu_lag3 +
        orders_def_unfilled_lag3
    )
)

#forecast Dynamic Models
fc_dynamic <- fit_dynamic |>
  forecast(new_data = dynamic_test)

#dynamic accuracy table
dynamic_accuracy_tbl <- fc_dynamic |>
  accuracy(dynamic_test) |>
  select(.model, RMSE, MAE, MAPE) |>
  arrange(RMSE)

dynamic_accuracy_tbl

# A tibble: 4 × 4
  .model            RMSE   MAE  MAPE
  <chr>            <dbl> <dbl> <dbl>
1 ARIMAX_3         1480. 1116.  22.8
2 ARIMAX_BACKLOG_3 1572. 1172.  21.3
3 ARIMAX_1         1701. 1266.  25.4
4 ARIMAX_BACKLOG_1 1797. 1382.  26.3

By comparing the 1 month and 3 month lag, the 1 month lag seems to capture a more immediate relationship while the three month lag captures more of a short delayed effect. Theoretically this is somewhat reasonable since contract timing, budget decisions, and production planning will affect the defense procurement so new defense aircraft orders may not be immediate.

The dynamic regression results how that the 3 month lag models performs better than the 1 month lag models and the best model is the ARIMAX_3 which uses a 3 month lagged national security uncertainty and 3 month lagged nondefense aircraft orders. So the implementation of broader uncertainty and aerospace demand may have a delayed effect on defense aircraft and parts order.

However, the ARIMAX models still do not outperform ETS, TSLM, or the ensemble. So external variables may provide useful economic context, but in actual implementation, they do not fully explain the sudden contract-driven spikes in the time series. This may also be because my external predictors are not as connected as I believe them to be, so implementation of other data sets could potentially perform better than the ones I have chosen.

So the dynamic regression results are helpful since they show that external predictors can improve over some simple benchmarks, but the best performing forecasts still come from models that directly capture the seasonal and time-dependent structure of the defense orders series.

accuracy_all <- bind_rows(
  accuracy_tbl,
  dynamic_accuracy_tbl,
  ensemble_accuracy_tbl
) |>
  arrange(RMSE)

accuracy_all |>
  mutate(
    RMSE = round(RMSE, 1),
    MAE  = round(MAE, 1),
    MAPE = round(MAPE, 1)
  ) |>
  rename(
    Model = .model
  ) |>
  kable(
    format = "latex",
    booktabs = TRUE,
    caption = "Forecast Accuracy Comparison",
    align = c("l", "r", "r", "r")
  ) |>
  kable_styling(
    latex_options = c("hold_position", "striped"),
    full_width = FALSE,
    position = "center"
  )

XGBoost

# XGBoost feature data
set.seed(123)

xgb_data <- model_data |>
  mutate(
    trend_num = row_number(),
    month_factor = factor(month(as.Date(Month)), levels = 1:12),
    
    orders_def_lag1  = lag(orders_def, 1),
    orders_def_lag2  = lag(orders_def, 2),
    orders_def_lag3  = lag(orders_def, 3),
    orders_def_lag6  = lag(orders_def, 6),
    orders_def_lag12 = lag(orders_def, 12),
    
    orders_def_roll3 = (
      orders_def_lag1 +
      orders_def_lag2 +
      orders_def_lag3) / 3,
    
    orders_def_roll6 = (
      lag(orders_def, 1) +
      lag(orders_def, 2) +
      lag(orders_def, 3) +
      lag(orders_def, 4) +
      lag(orders_def, 5) +
      lag(orders_def, 6)) / 6) |>
  drop_na()

#Split XGBoost data
train_end <- max(train$Month)

xgb_train <- xgb_data |>
  filter(Month <= train_end)

xgb_test <- xgb_data |>
  filter(Month > train_end)

#create the XGBoost model matrix
xgb_formula <- ~ month_factor + trend_num +
  orders_def_lag1 + orders_def_lag2 + orders_def_lag3 +
  orders_def_lag6 + orders_def_lag12 +
  orders_def_roll3 + orders_def_roll6 +
  national_security_epu_lag3 + orders_nondef_lag3 +
  orders_def_unfilled_lag3 - 1

x_train <- model.matrix(xgb_formula, data = xgb_train)
y_train <- xgb_train$orders_def

#fit xgboost model
set.seed(123)

xgb_model <- xgboost(
  data = x_train,
  label = y_train,
  objective = "reg:squarederror",
  nrounds = 100,
  eta = 0.05,
  max_depth = 3,
  subsample = 0.8,
  colsample_bytree = 0.8,
  verbose = 0
)

#recursive xgboost forecast
future_months <- xgb_test |>
  as_tibble() |>
  select(
    Month,
    trend_num,
    month_factor,
    national_security_epu_lag3,
    orders_nondef_lag3,
    orders_def_unfilled_lag3
  )

history <- train |>
  as_tibble() |>
  select(Month, orders_def)

xgb_preds <- numeric(nrow(future_months))

for (i in seq_len(nrow(future_months))) {
  
  recent_values <- history$orders_def
  
  new_row <- tibble(
    month_factor = factor(
      future_months$month_factor[i],
      levels = 1:12
    ),
    trend_num = future_months$trend_num[i],
    
    orders_def_lag1  = recent_values[length(recent_values)],
    orders_def_lag2  = recent_values[length(recent_values) - 1],
    orders_def_lag3  = recent_values[length(recent_values) - 2],
    orders_def_lag6  = recent_values[length(recent_values) - 5],
    orders_def_lag12 = recent_values[length(recent_values) - 11],
    
    orders_def_roll3 = mean(tail(recent_values, 3)),
    orders_def_roll6 = mean(tail(recent_values, 6)),
    
    national_security_epu_lag3 = future_months$national_security_epu_lag3[i],
    orders_nondef_lag3 = future_months$orders_nondef_lag3[i],
    orders_def_unfilled_lag3 = future_months$orders_def_unfilled_lag3[i]
  )
  
  x_new <- model.matrix(xgb_formula, data = new_row)
  x_new <- x_new[, colnames(x_train), drop = FALSE]
  
  pred <- predict(xgb_model, x_new)
  pred <- max(0, pred)
  
  xgb_preds[i] <- pred
  
  history <- bind_rows(
    history,
    tibble(
      Month = future_months$Month[i],
      orders_def = pred
    )
  )
}

#create xgboost forecast table
xgb_fc_tbl <- test |>
  as_tibble() |>
  select(Month, orders_def) |>
  mutate(
    .model = "XGBoost",
    .mean = xgb_preds
  )

#calculate XGBoost accuracy
xgb_accuracy_tbl <- xgb_fc_tbl |>
  summarise(
    .model = "XGBoost",
    RMSE = sqrt(mean((orders_def - .mean)^2)),
    MAE = mean(abs(orders_def - .mean)),
    MAPE = mean(abs((orders_def - .mean) / orders_def)) * 100
  )

xgb_accuracy_tbl

# A tibble: 1 × 4
  .model   RMSE   MAE  MAPE
  <chr>   <dbl> <dbl> <dbl>
1 XGBoost 1253.  957.  20.2

#calculate xgboost accuracy
accuracy_all <- bind_rows(
  accuracy_tbl,
  dynamic_accuracy_tbl,
  xgb_accuracy_tbl,
  ensemble_accuracy_tbl
) |>
  arrange(RMSE)

accuracy_all

# A tibble: 11 × 4
   .model            RMSE   MAE  MAPE
   <chr>            <dbl> <dbl> <dbl>
 1 XGBoost          1253.  957.  20.2
 2 Ensemble         1298.  957.  19.2
 3 ETS              1354.  933.  17.5
 4 TSLM             1377. 1145.  25.9
 5 ARIMAX_3         1480. 1116.  22.8
 6 ARIMAX_BACKLOG_3 1572. 1172.  21.3
 7 SARIMA           1614. 1182.  22.5
 8 ARIMAX_1         1701. 1266.  25.4
 9 ARIMAX_BACKLOG_1 1797. 1382.  26.3
10 SNAIVE           1850. 1544.  31.6
11 NAIVE            2866. 2628.  63.3

The XGBoost model performs very well and actually is the best current individual model. It outperforms the more traditional and simple time series model. It produces the lowest RMSE among all the models we have tested. It seems like the lag variables, rolling averages, seasonal indicators, and external predictors help the model perform well and reduces large dollar based forecast errors. This may be significant for defense aircraft and parts order because the series contains large contract shock and RMSE punishes larger errors more.

XGBoost however does not have the lowest MAE or MAPE but I would say that it is still very competitive since they are very close to the best performing value. A good balance between having the lowest RMSE and still having a low MAE and MAPE value. ETS performs better in a percentage error base metrics, so ETS is the best simple model for proportional forecast accuracy. But overall, XGBoost is very useful since it captures more non-linear relationships in the engineered features, while ETS remains useful because it directly models the seasonal time-series structure.

#plot xgboost forecast
xgb_forecast_plot <- autoplot(train, orders_def) +
  autolayer(test, orders_def, linetype = "dashed") +
  geom_line(
    data = xgb_fc_tbl,
    aes(x = Month, y = .mean)
  ) +
  geom_vline(xintercept = forecast_start, linetype = "dashed") +
  coord_cartesian(xlim = c(plot_start, max(test$Month))) +
  labs(
    title = "XGBoost Forecast vs Actual",
    x = "Month",
    y = "Millions of dollars"
  ) +
  theme_minimal()

xgb_forecast_plot

#feature importance plot
xgb_importance <- xgb.importance(
  feature_names = colnames(x_train),
  model = xgb_model
)

xgb_importance |>
  as_tibble() |>
  slice_max(Gain, n = 10) |>
  ggplot(aes(x = reorder(Feature, Gain), y = Gain)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "XGBoost Feature Importance",
    x = "Feature",
    y = "Gain"
  ) +
  theme_minimal()

The XGBoost feature importance plot shows that the most important predictors are trend_num and orders_def_lag12. This suggests that the model relies heavily on both the long run movement of defense aircraft and parts orders and the value from the same month in the previous year. orders_def_lag12 shows that there seems to be a strong evidence that defense aircraft and parts orders contain significant annual seasonal patterns.

Other important predictors are month_factor12 and orders_def_lag1, which are more recent lag values so short run order history and monthly seasonal effect can also better explain the forecast.

External variables with a lag seem to also be fairly important, both orders_nondef_lag3 and orders_def_unfilled_lag3 suggests that broader aerospace demand and unfilled orders seem to improve the model and explain more errors.

Overall, the feature importance results show that XGBoost performs well because it combines trend, seasonality, recent order behavior, and selected external predictors. But the strongest predictor is still based on the defense orders series it self. So the past orders of defense aircraft and parts orders are the most important aspect when it comes to make accurate predictions.

MLP

#mlp formula
mlp_formula <- ~ month_factor + trend_num +
  orders_def_lag1 + orders_def_lag2 + orders_def_lag3 +
  orders_def_lag6 + orders_def_lag12 +
  orders_def_roll3 + orders_def_roll6 +
  national_security_epu_lag3 + orders_nondef_lag3 +
  orders_def_unfilled_lag3 - 1

# Create training matrix
mlp_x_train_raw <- model.matrix(mlp_formula, data = xgb_train)
mlp_y_train <- xgb_train$orders_def

# Save scaling values from training data
x_means <- apply(mlp_x_train_raw, 2, mean)
x_sds <- apply(mlp_x_train_raw, 2, sd)
x_sds[x_sds == 0] <- 1

# Scale x variables
mlp_x_train <- scale(
  mlp_x_train_raw,
  center = x_means,
  scale = x_sds
)

# Scale y variable
y_mean <- mean(mlp_y_train)
y_sd <- sd(mlp_y_train)

mlp_y_train_scaled <- (mlp_y_train - y_mean) / y_sd

# Combine into one training data frame
mlp_train_tbl <- as_tibble(mlp_x_train) |>
  mutate(orders_def_scaled = mlp_y_train_scaled)

#fit the MLP model
set.seed(123)

mlp_model <- nnet(
  orders_def_scaled ~ .,
  data = mlp_train_tbl,
  size = 5,
  decay = 0.01,
  linout = TRUE,
  maxit = 1000,
  trace = FALSE
)

#recursive mlp forecast
future_months_mlp <- xgb_test |>
  as_tibble() |>
  select(
    Month,
    trend_num,
    month_factor,
    national_security_epu_lag3,
    orders_nondef_lag3,
    orders_def_unfilled_lag3
  )

history <- train |>
  as_tibble() |>
  select(Month, orders_def)

mlp_preds <- numeric(nrow(future_months_mlp))

for (i in seq_len(nrow(future_months_mlp))) {
  
  recent_values <- history$orders_def
  
  new_row <- tibble(
    month_factor = factor(
      future_months_mlp$month_factor[i],
      levels = 1:12
    ),
    trend_num = future_months_mlp$trend_num[i],
    
    orders_def_lag1  = recent_values[length(recent_values)],
    orders_def_lag2  = recent_values[length(recent_values) - 1],
    orders_def_lag3  = recent_values[length(recent_values) - 2],
    orders_def_lag6  = recent_values[length(recent_values) - 5],
    orders_def_lag12 = recent_values[length(recent_values) - 11],
    
    orders_def_roll3 = mean(tail(recent_values, 3)),
    orders_def_roll6 = mean(tail(recent_values, 6)),
    
    national_security_epu_lag3 = future_months_mlp$national_security_epu_lag3[i],
    orders_nondef_lag3 = future_months_mlp$orders_nondef_lag3[i],
    orders_def_unfilled_lag3 = future_months_mlp$orders_def_unfilled_lag3[i]
  )
  
  x_new_raw <- model.matrix(mlp_formula, data = new_row)
  x_new_raw <- x_new_raw[, colnames(mlp_x_train_raw), drop = FALSE]
  
  x_new_scaled <- scale(
    x_new_raw,
    center = x_means,
    scale = x_sds
  )
  
  pred_scaled <- predict(
    mlp_model,
    newdata = as.data.frame(x_new_scaled)
  )
  
  pred <- as.numeric(pred_scaled) * y_sd + y_mean
  pred <- max(0, pred)
  
  mlp_preds[i] <- pred
  
  history <- bind_rows(
    history,
    tibble(
      Month = future_months_mlp$Month[i],
      orders_def = pred
    )
  )
}

#create mlp forecast table
mlp_fc_tbl <- test |>
  as_tibble() |>
  select(Month, orders_def) |>
  mutate(
    .model = "MLP",
    .mean = mlp_preds
)

#calculate mlp accuracy
mlp_accuracy_tbl <- mlp_fc_tbl |>
  summarise(
    .model = "MLP",
    RMSE = sqrt(mean((orders_def - .mean)^2)),
    MAE = mean(abs(orders_def - .mean)),
    MAPE = mean(abs((orders_def - .mean) / orders_def)) * 100
  )

mlp_accuracy_tbl

# A tibble: 1 × 4
  .model  RMSE   MAE  MAPE
  <chr>  <dbl> <dbl> <dbl>
1 MLP    2827. 2196.  46.0

After applying a MLP model, the results were fairly poor when compared to the other forecasting models. The MLP forecasts were pretty unstable and produced large errors during the test period. I believed that it would perform better since it uses lag variables, rolling averages, seasonal indicators, and external predictors. I would say the potential reasons for why the MLP model performed poorly were because the monthly defense aircraft and parts orders series may have been too volatile and that the model itself might have not had enough data to accurately train the model. Especially for a more simple neural network model. The result is still useful because it shows that a more flexible model does not automatically lead to better forecast accuracy.

#adding ML{} to combined accuracy table
accuracy_all <- bind_rows(
  accuracy_tbl,
  dynamic_accuracy_tbl,
  xgb_accuracy_tbl,
  mlp_accuracy_tbl,
  ensemble_accuracy_tbl
) |>
  arrange(RMSE)

#plot MLP forecast
mlp_forecast_plot <- autoplot(train, orders_def) +
  autolayer(test, orders_def, linetype = "dashed") +
  geom_line(
    data = mlp_fc_tbl,
    aes(x = Month, y = .mean)
  ) +
  geom_vline(xintercept = forecast_start, linetype = "dashed") +
  coord_cartesian(xlim = c(plot_start, max(test$Month))) +
  labs(
    title = "MLP Forecast vs Actual",
    x = "Month",
    y = "Millions of dollars"
  ) +
  theme_minimal()

mlp_forecast_plot

NNETAR

Because my MLP model performed poorly, I decided to use a different neural network forecasting model to attempt to get a better performing neural network model. I estimated a neural network auto-regression model, NNETAR. This model also uses lagged values and fits a feed forward neural network to attempt to capture more non-linear relationships. NNETAR will ideally perform better than MLP since it is more directly designed for time series forecasting and automatically uses lagged values.

# Neural network autoregression model
set.seed(123)

fit_nnetar <- train |>
  model(
    NNETAR = NNETAR(orders_def)
  )

fc_nnetar <- fit_nnetar |>
  forecast(h = h)

nnetar_accuracy_tbl <- fc_nnetar |>
  accuracy(test) |>
  select(.model, RMSE, MAE, MAPE)

nnetar_accuracy_tbl

# A tibble: 1 × 4
  .model  RMSE   MAE  MAPE
  <chr>  <dbl> <dbl> <dbl>
1 NNETAR 1797. 1453.  28.0

After running the NNETAR model, it performs slightly better than the MLP model but still doesn’t perform as well I thought it would. The accuracy metrics are weaker than the best performing models like XGBoost, ensemble, ETS, and TSLM. Although neural networks can capture more non-linear patterns, it seems like the monthly defense aircraft and parts orders time series may not be large or stable enough for the neural network to perform better than more structured forecasting models.

fc_nnetar |>
  autoplot(train) +
  autolayer(test, orders_def, linetype = "dashed") +
  geom_vline(xintercept = forecast_start, linetype = "dashed") +
  coord_cartesian(xlim = c(plot_start, max(test$Month))) +
  labs(
    title = "NNETAR Forecast vs Actual",
    x = "Month",
    y = "Millions of dollars"
  ) +
  theme_minimal()

accuracy_all <- bind_rows(
  accuracy_tbl,
  dynamic_accuracy_tbl,
  xgb_accuracy_tbl,
  mlp_accuracy_tbl,
  nnetar_accuracy_tbl,
  ensemble_accuracy_tbl
) |>
  arrange(RMSE)

tail(accuracy_all)

# A tibble: 6 × 4
  .model            RMSE   MAE  MAPE
  <chr>            <dbl> <dbl> <dbl>
1 ARIMAX_1         1701. 1266.  25.4
2 ARIMAX_BACKLOG_1 1797. 1382.  26.3
3 NNETAR           1797. 1453.  28.0
4 SNAIVE           1850. 1544.  31.6
5 MLP              2827. 2196.  46.0
6 NAIVE            2866. 2628.  63.3

Final Ensemble Models

After estimating the individual forecasting models I then create the final ensemble forecasts which works by averaging point forecasts from multiple models. The idea behind ensemble models is that different models can capture difference parts of the time series since each model can make different types of errors. So ideally, by averaging their forecasts together, the ensemble should reduce the effects of mistakes from any one model in a given month by the average of better performing models.

This may be helpful for defense aircraft and parts order since the series contains regular seasonal structure and sudden contract shocks. As an example, the ETS model is better suited to capture the more stable seasonal time series behavior. The TSLM model could potentially capture more trend and monthly seasonal effect while XGBoost may be useful for capturing more non-linear relationships, especially when introducing lagged predictors and external predictors. ARIMAX adds a different source of information because it incorporates external predictors with ARIMA errors. And lastly NNETAR provides a neural network benchmark. So each model in the ensemble covers different problems and combining them could potentially improve how sensitive and robust the model is.

I assemble three ensemble specifications. The first is the traditional ensemble which I used previously, this simply averages the ETS, TSLM, and SNAIVE models. The next is the best model ensemble which averages the XGBoost, ETS, and TSLM. This version combines the strongest models that I have tested so far. And lastly, the third is broad ensemble which averages XGBoost, ETS, TSLM, ARIMAX, and NNETAR which is a combination of most of the models.

Each ensemble is created by grouping the forecasts by month and then take the average of the point forecasts for that month. The ensemble forecasts are then evaluated on the same test period using RMSE, MAE, and MAPE.

# Combine point forecasts from the main models
ensemble_inputs <- bind_rows(
  fc_plot |>
    as_tibble() |>
    filter(.model %in% c("ETS", "TSLM", "SNAIVE")) |>
    select(Month, .model, .mean),
  
  fc_dynamic |>
    as_tibble() |>
    filter(.model == "ARIMAX_3") |>
    select(Month, .model, .mean),
  
  xgb_fc_tbl |>
    select(Month, .model, .mean),
  
  fc_nnetar |>
    as_tibble() |>
    select(Month, .model, .mean)
)

# Traditional ensemble: ETS + TSLM + SNAIVE
traditional_ensemble <- ensemble_inputs |>
  filter(.model %in% c("ETS", "TSLM", "SNAIVE")) |>
  group_by(Month) |>
  summarise(.mean = mean(.mean), .groups = "drop") |>
  mutate(.model = "Traditional Ensemble")

# Best models ensemble: XGBoost + ETS + TSLM
best_model_ensemble <- ensemble_inputs |>
  filter(.model %in% c("XGBoost", "ETS", "TSLM")) |>
  group_by(Month) |>
  summarise(.mean = mean(.mean), .groups = "drop") |>
  mutate(.model = "Best Model Ensemble")

# Broad ensemble: XGBoost + ETS + TSLM + ARIMAX_3 + NNETAR
broad_ensemble <- ensemble_inputs |>
  filter(.model %in% c("XGBoost", "ETS", "TSLM", "ARIMAX_3", "NNETAR")) |>
  group_by(Month) |>
  summarise(.mean = mean(.mean), .groups = "drop") |>
  mutate(.model = "Broad Ensemble")

ensemble_accuracy <- function(ensemble_tbl, model_name) {
  test |>
    as_tibble() |>
    left_join(
      ensemble_tbl |> select(Month, .mean),
      by = "Month"
    ) |>
    summarise(
      .model = model_name,
      RMSE = sqrt(mean((orders_def - .mean)^2)),
      MAE = mean(abs(orders_def - .mean)),
      MAPE = mean(abs((orders_def - .mean) / orders_def)) * 100
    )
}

final_ensemble_accuracy_tbl <- bind_rows(
  ensemble_accuracy(traditional_ensemble, "Traditional Ensemble"),
  ensemble_accuracy(best_model_ensemble, "Best Model Ensemble"),
  ensemble_accuracy(broad_ensemble, "Broad Ensemble")
) |>
  arrange(RMSE)

final_ensemble_accuracy_tbl

# A tibble: 3 × 4
  .model                RMSE   MAE  MAPE
  <chr>                <dbl> <dbl> <dbl>
1 Broad Ensemble       1204.  904.  18.2
2 Best Model Ensemble  1207.  913.  19.2
3 Traditional Ensemble 1298.  957.  19.2

accuracy_all_final <- bind_rows(
  accuracy_tbl,
  dynamic_accuracy_tbl,
  xgb_accuracy_tbl,
  mlp_accuracy_tbl,
  nnetar_accuracy_tbl,
  final_ensemble_accuracy_tbl
) |>
  arrange(RMSE)

accuracy_all_final |>
  mutate(
    RMSE = round(RMSE, 1),
    MAE = round(MAE, 1),
    MAPE = round(MAPE, 1)
  ) |>
  rename(Model = .model) |>
  kable(
    format = "latex",
    booktabs = TRUE,
    caption = "Final Forecast Accuracy Comparison",
    align = c("l", "r", "r", "r")
  ) |>
  kable_styling(
    latex_options = c("hold_position", "striped"),
    full_width = FALSE,
    position = "center"
  )

ensemble_plot_tbl <- bind_rows(
  traditional_ensemble,
  best_model_ensemble,
  broad_ensemble
)

ggplot() +
  geom_line(
    data = train |> filter(Month >= plot_start),
    aes(x = Month, y = orders_def)
  ) +
  geom_line(
    data = test,
    aes(x = Month, y = orders_def),
    linetype = "dashed"
  ) +
  geom_line(
    data = ensemble_plot_tbl,
    aes(x = Month, y = .mean, color = .model)
  ) +
  geom_vline(xintercept = forecast_start, linetype = "dashed") +
  labs(
    title = "Final Ensemble Forecasts vs Actual",
    x = "Month",
    y = "Millions of dollars",
    color = "Model"
  ) +
  theme_minimal()

The best ensemble model is the broad ensemble model which performs better than the individual models and even the more advanced forecasting models. This is most likely because it averages forecasts from different models. So XGBoost may be capturing non-linear relationships, ETS could be capturing the stable seasonal structure, TSLM is looking at trend and seasonal aspects, ARIMAX is looking at the external predictors, and the NNETAR model may be looking at other aspects of the data that the other ensemble models aren’t capturing. Even though there are some models that perform poorly individually, it may still be able to improve the ensemble if their forecasts errors are different from the better performing models. The broad ensemble has the lowest RMSE and MAE which are dollar based errors but ETS still has the lowest MAPE which is more percentage error based.

Results and Analysis

The final accuracy comparison shows that the broad ensemble model is the best performing forecasting model for dollar based error measures since it produces the lowest RMSE and MAE. This is significant for defense aircraft and parts orders are measured in millions of dollars and contains large contract driven spikes in the data set. The best model ensemble also performs fairly well and is pretty close to the broad ensemble which suggests that including the best models helps with forecasting accuracy. Including ARIMAX and NNETAR into the broad ensemble model seems to have slightly improve the best ensemble model so those two additional models were able to explain slightly more errors in the time series.

Among the individual models, XGBoost performs best by RMSE. This suggests that the engineered features, including lag variables, rolling averages, seasonal indicators, trend, and external predictors, help with reducing large forecasting errors. The use of XGBoost feature importance plot shows how trend_num and orders_def_lag12 are the most important predictors for the time series and adds some transparency to the XGBoost model and how it is processing the data. The XGBoost model is heavily relying on the long run movement of the series and the value from the same month in the previous year. This is also consistent with the ACF/PACF plots earlier which supported the idea of the time series having strong annual seasonality and time dependent structure.

The strongest simple individual model is still the ETS model, it doesn’t have the lowest RMSE or MAE anymore but still does have the lowest MAPE at 17.51 which is impressive since it performs better than the current best overall model for that accuracy metric. This may be because because it is modeling the level and seasonal structure of the time series since it selected ETS(A,N,A). This specification is better at explaining seasonality and no smooth long run trend. The low alpha and gamma values also suggest that the model updates the levels slowly and keeps the seasonal patterns more stable over time.

The comparison between simple models and more advance models like XGBoost show the significance in having a more flexible model. Simple models like NAIVE perform poorly because it only uses recent values and doesn’t take seasonality into consideration. SNAIVE performs much better since it also takes into account the value of the same month from the previous year which is useful for monthly seasonal time series. TSLM performs better because it also includes trend and seasonal indicators but is still not as flexible as ETS since its assumes the trend and seasonal effects stable over time.

Then moving onto the slightly more advance and flexible models, like ARIMA/SARIMA. It performs better than the simple NAIVE and SNAIVE models since it captures more time series structure than the simpler models. The ARIMA/SARIMA model selected an ARIMA(0,1,1)(0,0,2)[12] specification. It uses differencing and moving average terms to help capture some of the recent and seasonal shock behavior. Although it is more flexible than the ETS model, it still doesn’t perform better which suggests that ETS is more specialized and fits this particular time series better than SARIMA does. SARIMA is useful for modeling auto-correlation and seasonality but isn’t able to accurately capture the procurement spikes in forecasting period.

The dynamic regression models are useful to introduce external regressors to help explain the time series. Especially including a 3 month lag ARIMAX model performs better than a 1 month lag model. This suggests that external predictors may affect defense aircraft and parts orders with a delay instead of immediately. The current best dynamic regression model is the ARIMAX_3 which utilizes a 3 month lagged national security uncertainty and 3 month lagged non-defense aircraft orders. However, the ARIMAX models do not outperform models like XGBoost, ETS, TSLM, and the ensemble models. So external regressors can provide important interpretations and economic meaning but may not fully explain the contract shocks in the time series.

The neural network models are introduced as benchmarks to observe how well this method can perform with a fairly limited data sample size and fairly volatile shocks. The MLP model performs poorly, only performing better than the NAIVE model. This model struggles because of the time series contains large irregular shocks and a small sample size for the model to learn on. The NNETAR model performs better since it is designed more directly for time series forecasting but still doesn’t perform as well as the XGBoost, ETS, or the ensemble models. This suggests that more flexible models don’t always perform better than simpler models, especially when it has requirement or restrictions that may reduce performance effectiveness.

Overall, the results show that defense aircraft and parts orders contain significant structure but the series also contains strong irregular procurement shocks that make it difficult for models to accurately forecast. The broad ensemble performs the best because it combines different types of models and averages values monthly to minimize errors better than any other individual model can.

Conclusion/Discussion

The results show that monthly new defense aircraft and parts orders contain meaningful forecastable structure but aren’t able to perfectly predict irregular procurement shocks and level changes. The time series has clear seasonal and time dependent structure, which explains why models such as ETS, TSLM, and XGBoost perform better than the more simple forecasting models like NAIVE or SNAIVE.

The strongest overall model is the broad ensemble model which combines XGBoost, ETS, TSLM, ARIMAX_3, and NNETAR. It gives the lowest RMSE and MAE so it performs well at reducing dollar based forecasting errors. The broad ensemble suggests that a combination of different model types help with capturing different aspects of the data.

The best individual model is the XGBoost model since it has the lowest RMSE. Since it performs well, the inclusion of lag variables, rolling averages, seasonal indicators, trend, and external predictors may be helpful for reducing some of the large forecast errors. Arguably, the ETS model is also the best model as well since it still had the lowest MAE and MAPE out of the individual models, but also has the lowest MAPE out of all the models, including the ensemble models. So the ETS model performs well in reducing percentage based errors. The ARIMAX model doesn’t perform as well in actual accuracy metrics but suggests that external predictors still provide useful context economically. The poor performances of the neural network models also suggest that more flexible models do not always result in better performance, especially when there isn’t enough sampling data.

Overall, the Broad Ensemble is the most useful final model when the goal is to reduce large dollar based errors, XGBoost performs well at reducing large dollar based errors as well for an individual model, and ETS remains a strong model for reducing percentage based errors. These forecasts could potentially support the defense aircraft and parts order in things like capacity planning, supplier coordination, and inventory decisions but still slightly struggle with being able to fully explain irregular procurement shocks.

Limitations

A limitation of this project is that the external predictors that are included in this project are relatively broad and not super specific. Variables like national security, non-defense aircraft orders, and unfilled orders may be able to explain some of the time series but may not be able to provide the best information to help with predicting the timing more accurately and quickly.

Another limitation is that the data contains large shocks which seems to reflect a large influx of contract awards which sometimes seems quite random. So this makes accurately predicting the shocks more difficult since models typically struggle to deal with large random aspects in the time series data. Even the best model can potentially still miss the exact timing and size of large shocks.

The last limitation is that the neural network models may rely too much on needing enough sample data which may be why both the MLP and NNETAR models perform so poorly. So there most likely isn’t enough data or it isn’t stable enough to get good performances from these models.

Real life implications review

Even with the current limitations, the project still has some real world value. It produces a reasonably accurate short run horizon forecast for the new defense aircraft and parts orders which could potential help firms with capacity planning, supplier coordination, labor scheduling, and inventory decisions. Another use could be to assist analysts and policy makers monitor how the changes in the defense environment may translate into manufacturing activity.

Ideally, using the broad ensemble forecast may be useful when the goal is trying to reduce large dollar forecasting errors which is useful because large misses in a defense manufacturing series can have meaningful implications for production planning and supplier coordination. ETS is useful because it is simple and more interpretable but is also useful when you want to reduce percentage based errors. XGBoost has the lowest RMSE values for an individual model which is also quite useful if you only want to use a singular model.

These results show that the firms shouldn’t rely on only one forecasting model since defense aircraft and parts orders are affected by different things like seasonality, backlogs, broader aerospace demands, and sudden order shocks. Since different models can capture different aspects of the time series, using an ensemble forecast may provide the most stable results to most accurately predict new defense aircraft and parts orders.

Future work/improvements

There are some ways this project could be improved in future work. Adding more specific defense related predictors like contract award data, industrial production measures, or firm level production schedules could help with explaining some of the more sudden spikes in the defense aircraft and parts orders than some of the broader external variables that I used.

Another limitation would be that scenario based forecasting may provide important information during uncertain geopolitical stress since defense procurement is fairly volatile. So scenario forecasting may help during policy planning and can evaluate the best scenario to go with which helps policy makers and industrial firms to organize and time production better.

Overall, the UDAPNO series is forecastable in a meaningful way in the short run but the remaining error reflects that defense procurement is affected by both the regular seasonal behavior and unpredictable contract driven shocks.

Bibliography

Baker, S. R., Bloom, N., & Davis, S. J. Economic Policy Uncertainty Index: Categorical Index: National Security [EPUNATSEC]. FRED, Federal Reserve Bank of St. Louis.

Bates, J. M., & Granger, C. W. J. (1969). The combination of forecasts. Operational Research Quarterly, 20(4), 451–468.

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.

Christopher J. Nekarda and Valerie A. Ramey, “Industry Evidence on the Effects of Government Spending,” NBER Working Paper 15754 (2010), https://doi.org/10.3386/w15754.

Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts.

Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). The M4 Competition: Results, findings, conclusion and way forward. International Journal of Forecasting, 34(4), 802–808.

Nekarda, C. J., & Ramey, V. A. (2010). Industry evidence on the effects of government spending. NBER Working Paper No. 15754.

Ramey, V. A. (2011). Identifying government spending shocks: It’s all in the timing. Quarterly Journal of Economics, 126(1), 1–50.

U.S. Census Bureau. Manufacturers’ New Orders: Defense Aircraft and Parts [UDAPNO]. FRED, Federal Reserve Bank of St. Louis.

Valerie A. Ramey, “Identifying Government Spending Shocks: It’s All in the Timing,” NBER Working Paper 15464 (2011), https://doi.org/10.3386/w15464.

Zhang, G. P. (2003). Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing, 50, 159–175.