Naive, Basic, Dynamic, and Advanced Forecasting Models
Author
AS
Published
today
basic_models
Libraries
I often like to centralize my libraries, particularly if I am going to cite them later. (I don’t like to forget any citations of the packages.) Use the cite() function for full details.
library(eia)# for EIA data (not used below, but kept for parity with the original)
EIA_KEY not found. See `vignette("api", "eia")` for key storage options.
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ magrittr::extract() masks tidyr::extract()
✖ dplyr::filter() masks stats::filter()
✖ purrr::flatten() masks jsonlite::flatten()
✖ kableExtra::group_rows() masks dplyr::group_rows()
✖ tsibble::interval() masks lubridate::interval()
✖ dplyr::lag() masks stats::lag()
✖ purrr::set_names() masks magrittr::set_names()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Keys
I have API keys for foundational TS data (FRED and EIA). Obviously, I keep them hidden. :)
Data
So the easiest way to get FRED data is to invoke the fredr package.
# If you have a FRED key, uncomment and set it here:# fredr_set_key(Sys.getenv("FRED_API_KEY"))fredr_set_key("8a9ec1330374c1696f05cc8e526233b5")# replace with your own key pleaseunemp<-fredr( series_id ="UNRATE", observation_start =as.Date("2020-04-01"), observation_end =as.Date("2025-03-31"))
I keep it easy and drop unnecessary variables. Tsibbles are also particularly about dates, so I force the months to be compliant.
unemp$Month<-seq(as.Date("2020/4/1"), as.Date("2025/3/31"), by ="1 month")unemp$Month<-yearmonth(unemp$Month)unemp$series_id<-NULLmydata<-unempstr(mydata)
I am going to look at the data and fit a LOESS to the train set. I will then (for simplicity) use a linear forecast of the last 6 months of LOESS data and forecast the test set. Finally, I calculate metrics for this super simple forecasting method.
library(tsibble)library(dplyr)library(ggplot2)# Create the tsibblemyts<-mydata%>%as_tsibble(index =Month)# Split train/testtrain<-myts[1:48, ]test<-myts[49:nrow(myts), ]# Convert Month to numeric for modelingmyts2<-myts%>%mutate(x_numeric =as.numeric(Month))train2<-myts2[1:48, ]test2<-myts2[49:nrow(myts2), ]# Fit LOESS on training dataloess_model<-loess(value~x_numeric, data =train2, span =0.75)# Get LOESS fitted values for all train datatrain2<-train2%>%mutate( loess_fit =predict(loess_model))# Select last 6 months of training for linear extensionlast6<-tail(train2, 6)# Fit linear model on last 6 LOESS pointstail_model<-lm(loess_fit~x_numeric, data =last6)# Predict over test set with confidence intervalspredictions<-predict(tail_model, newdata =test2, interval ="confidence", level =0.95)# Bind predictions into test2test2<-test2%>%mutate( linear_extension =predictions[, "fit"], lower =predictions[, "lwr"], upper =predictions[, "upr"])# Compute forecast error metricserrors<-test2$value-test2$linear_extensionME<-mean(errors)MPE<-mean(errors/test2$value)*100MAE<-mean(abs(errors))RMSE<-sqrt(mean(errors^2))MAPE<-mean(abs(errors)/test2$value)*100# For MASE: compute naive forecast MAE on training datatrain_naive_errors<-train2$value[-1]-train2$value[-nrow(train2)]naive_mae<-mean(abs(train_naive_errors))MASE<-MAE/naive_mae# Create a formatted metrics labelmetrics_label<-sprintf("Test Set Forecast Metrics:\nME = %.3f\nMPE = %.2f%%\nMAE = %.3f\nRMSE = %.3f\nMAPE = %.2f%%\nMASE = %.3f",ME, MPE, MAE, RMSE, MAPE, MASE)# Plot everything with metrics annotationggplot()+geom_line(data =train2, aes(x =Month, y =value, color ="Train Actual"), linewidth =0.6)+geom_line(data =test2, aes(x =Month, y =value, color ="Test Actual"), linewidth =0.6)+geom_line(data =train2, aes(x =Month, y =loess_fit, color ="LOESS Fit"), linewidth =0.6)+geom_line(data =test2, aes(x =Month, y =linear_extension, color ="Linear Extension"), linewidth =0.8)+geom_ribbon( data =test2,aes(x =Month, ymin =lower, ymax =upper, fill ="95% CI"), alpha =0.3)+annotate("text", x =min(test2$Month)-10, # adjust x position if needed y =max(train2$value, na.rm =TRUE), # adjust y position if needed label =metrics_label, hjust =0, vjust =1, size =4)+scale_color_manual( name ="Legend", values =c("Train Actual"="black","Test Actual"="red","LOESS Fit"="green","Linear Extension"="blue"))+scale_fill_manual( name ="", values =c("95% CI"="gray"))+scale_y_continuous(limits =c(0, NA))+labs( title ="Monthly Unemployment Rate with LOESS, Linear Extension Forecast, and Error Metrics", x ="Month", y ="Unemployment Rate")
We can see the leveling that occurs from a high of near 15% during the COVID spike to under 5%. Pushing a linear forecast forward from a LOESS fit of the previous 6 months is better than a simple naive forecast (MASE ≈ 0.8). The forecast is biased negatively (ME and MPE), and the MAE, RMSE, and MAPE suggest reasonably low variance. And we haven’t even decomposed the data yet.
The gg_season plot reveals two things, a trend and a seasonal component. We can see that 2020 exhibited the highest unemployment followed by 2021, 2025, 2024, 2022, 2023. So we conclude that 2020 and 2021 exhibit true nonstationarity and that the trend effect is nonlinear. There is no obvious seasonal component from observing the data, but we will look further.
The gg_subseries plot appears to show a declining unemployment rate associated with the months April through December, but a word of caution… The outlier years have affected that appearance dramatically.
The lag plot shows obvious autocorrelation at lag 1 and some at 2 and 3 with it decreasing rapidly beyond that. The 45 degree reference line is the line if (y_t = y_{t-k}) where (k) is the lag. You can see how there is increasing spread and deviation from the 45 degree line as we increase lags. We will test to see what lags are appropriate at some point.
The ACF plot shows significant autocorrelation through 9 periods, finally dropping off at period 10. Recall, ACF is the correlation of (y_t) with (y_{t-k}) (after removing the mean). The PACF plot shows significant correlation only for 1 period. This plot removes the preceding autocorrelation to see if there is remaining autocorrelation. If we want the PACF for (y_{t-k}), then we regress (y_t) on lags (1, 2, , k-1) and regress (y_{t-k}) on lags (1, 2, , k-1). We correlate the residuals of both = PACF! What we don’t have is white noise.
Transformations
Our next step is transformation of the data prior to decomposition. In our case, we can use a Box–Cox transformation since the data are strictly non‑negative. That transformation is defined as:
lambda<-myts%>%features(value, features =guerrero)%>%pull(lambda_guerrero)# Plot Box–Cox transformed series with LaTeX lambda labelmyts%>%autoplot(box_cox(value, lambda))+labs( y ="", title =latex2exp::TeX(paste0("Transformed series with $\\lambda = ",round(lambda, 2),"$")))
We can see here that the lambda selected is almost a reciprocal. But that lambda helped none.
Decomposition
We can decompose our data into different components: Seasonal, Trend, and Residual.
One of the most commonly used methods is STL (Seasonal-Trend decomposition using LOESS), which estimates the trend and seasonal components via LOESS smoothing.
Looking at the full data, the multiplicative has the lower ME and MPE (bias measures). It has the lower RMSE, MAE, and MAPE (variance measures). It has the smaller comparative measures (MASE and RMSSE). RMSSE is the ratio of mean squared error of the model versus a naive. Multiplicative wins. We use the log going forward.
Now, we cannot forecast with a traditional STL. We have to use other decomposition methods. Why? The LOESS component… We would need more traditional decompositions to do so.
Forecasting Models (Basic)
So next we move to some basic forecasting models. We will use log(value) to help us since that appears to be a better solution by multiplicative decomposition.
The first model is nothing more than a mean model. We acquire the mean of the TS and use that for forecasting. Simple.
The second model is a drift model (not a linear fit). We acquire the first and last observations, and use these to generate a drift component. Easy.
The third model is a naive model. Next period forecast is last period observation. Obviously, it runs out of juice for longer forecasts.
The fourth model is a seasonal naive. Next January forecast is this January’s observation. This type of forecast can work well for data that only has seasonality (no trend).
The fifth model is a time series linear regression model.
The sixth model is an Error, Trend, and Seasonality Model. We allow the software to auto‑choose the terms.
The seventh model is an ARIMA, autoregressive integrated moving average.
Models 8 is a simple ensembles.
Reports
Of course, we want complete reports for the estimates.
for(iin1:8){fit<-get(paste0("m", i))# cleaner than eval(parse(...))report(fit)}
Next, we visualize each model’s forecast against the original data. The custom plotting function compares the fitted values and forecasts over the test period.
# Ensure log-transformed values are presentmyts$logvalue<-log(myts$value)test$logvalue<-log(test$value)# Define plotting functionmyplot<-function(model){forecast(model, new_data =test)|># Fix: forecast should use model, not testautoplot(myts)+geom_line(aes(y =.fitted, color =.model), data =augment(model))+ggtitle(names(model))}# Generate plots for models m1 to m8for(iin1:8){print(myplot(get(paste0("m", i))))}
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_line()`).
Accuracy Metrics
We calculate common forecast accuracy metrics for each model using the test set. Results are stored and compared by MAPE.
# Initialize empty matrix for storing metricsmetrics_all<-NULLfor(iin1:8){model_forecast<-get(paste0("m", i))|>forecast(new_data =test)model_accuracy<-model_forecast|>accuracy(test)metrics_all<-rbind(metrics_all, model_accuracy)}# Clean and rank by MAPEmetrics_all<-metrics_all[metrics_all$RMSE>0, ]metrics_all$MASE<-metrics_all$RMSSE<-metrics_all$.type<-NULL# Display ranked accuracy tablemetrics_all|>arrange(MAPE)