Machine Learning for Time Series Forecasting

Author

Arvind Sharma

Published

November 23, 2025

This document demonstrates various time series forecasting approaches on monthly unemployment rate data from FRED, comparing traditional statistical methods (ETS, ARIMA), machine learning models (Random Forest, XGBoost), and deep learning architectures (RNN, LSTM, CNN-LSTM, Transformer).


Neural Networks: Quick Reference Guide

WATCH: But what is a neural network? | Deep learning chapter 1

There are four major types of neural networks you need to know about, each designed for different problems:

Model Type What It Learns Best For Main Weakness
Feedforward (MLP) Dense layers Tabular data Ignores structure
CNN Spatial patterns Images, audio Not for text/time
LSTM Temporal patterns Sequences, time series Slow, replaced by transformers
Transformer Global relationships via attention NLP, vision, multimodal High compute cost

Quick decision rule: Feedforward for tables, CNNs for images, LSTMs for sequences when you need something lightweight, Transformers for everything else (especially language).


Understanding Each Architecture

1. Feedforward Networks (MLPs - Multi-Layer Perceptrons)

But what is a neural network? | Deep learning chapter 1

This is your basic neural network. Data goes in one end, passes through some hidden layers, comes out the other. Pretty straightforward.

How they work: Each layer has neurons that connect to everything in the next layer. The math is simple - you multiply inputs by weights, add a bias term, then run it through an activation function like ReLU or sigmoid.

Best for: Regular tabular data - spreadsheets, databases, that kind of stuff. Classification, regression, anything where the order or position of features doesn’t really matter.

Limitations: They’re terrible with images and sequences. An MLP just sees pixels as a flat list of numbers. It has no concept that nearby pixels form shapes or that words in a sentence relate to each other. A 28×28 image needs 784 input connections per neuron - very inefficient.


2. CNNs (Convolutional Neural Networks)

CNNs were basically invented for images. The breakthrough was using convolutions - small filters that slide across an image looking for patterns. Early layers detect edges and textures, deeper layers recognize complex shapes and objects.

The genius of CNNs: They understand spatial relationships. They know that neighboring pixels matter more than distant ones. And they use the same filters everywhere, so a face detector works whether the face is top-left or bottom-right.

https://developersbreach.com/convolution-neural-network-deep-learning/

Another similar picture with several CNN key components explained at https://glasswing.vc/blog/ai-atlas-16-convolutional-neural-networks-cnns/

Average pooling : It involves average calculation for each patch of the feature map.

https://saturncloud.io/blog/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way/

Flattening of 3*3 image matrix into 9*1 vector

Watch the video on: Convoluting a 5x5x1 image with a 3x3x1 kernel to get a 3x3x1 convolved feature - https://saturncloud.io/blog/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way/

Efficiency advantage: Compare a regular neural network on a 28×28 image (needs 784 connections per neuron) to CNNs that use tiny 3×3 filters shared across the whole image. Way more efficient.

Best for: Basically everything visual - MNIST digits, ImageNet classification, object detection systems like YOLO, facial recognition, medical image analysis, even audio spectrograms.


3. LSTMs (Long Short-Term Memory)

LSTMs handle sequences and time-based data. Text, stock prices, sensor readings, music, anything where order matters and past events influence future ones.

The problem they solve: Regular RNNs have this problem where they forget stuff from earlier in the sequence - it’s called vanishing gradients. LSTMs fix this with a gate system:

  • Forget gate - decides what to throw away

  • Input gate - controls new information

  • Output gate - determines what gets passed forward

This allows them to maintain long-term dependencies and remember important information over many time steps.

Best for: Time series forecasting, text generation, speech recognition, music composition, sensor data analysis - any problem where past events influence future outcomes.

The downside: They’re slow. You have to process sequences one step at a time, can’t parallelize it, so training takes forever. That’s why Transformers ate their lunch in NLP. Still useful for resource-constrained sequential problems.

This diagram shows how an LSTM processes a sequence one timestep at a time. Each green block represents an LSTM cell that receives the current input \(𝑥_𝑡\) and the previous hidden state \(h_{𝑡-1}\). Inside the expanded middle cell, the forget, input, and output gates work together to decide what information to forget, what new information to store, and what to output. The cell state (horizontal line) carries long-term memory forward, while the hidden state (vertical arrows) passes short-term information to the next timestep. This structure allows LSTMs to capture long-range dependencies in sequential data without suffering from vanishing gradients.

4. Transformers

Transformers run modern AI. GPT, BERT, ChatGPT, Vision Transformers - all the big models now use this architecture.

The key innovation: Self-attention. Instead of reading text word by word, every word can directly look at every other word in the sentence. The model learns which words matter and how they relate. There’s an attention equation that computes relevance scores between all word pairs.

Why they dominate:

  1. Parallelization - Process entire sequences at once on GPUs (unlike sequential LSTMs)
  2. Long-range dependencies - Better at understanding relationships between distant elements
  3. Scalability - More parameters + more data = better results
  4. Versatility - Work across different data types

Best for: Natural Language Processing (translation, summarization, question-answering), Vision Tasks (Vision Transformers/ViT), speech recognition (Whisper), and multimodal stuff (GPT-4 combining text and images).

Limitations: High computational cost and large memory requirements. Can be overkill for simpler problems.


Evolution Timeline

The Historical Arc:

  1. 1980s-1990s - Feedforward networks dominate
  2. 1997 - LSTMs enable sequence modeling
  3. 1998 - CNNs revolutionize computer vision (LeNet)
  4. 2017 - Transformers introduced (“Attention is All You Need”)
  5. 2020s - Transformers dominate most tasks

Current State:

  • Transformers are the default choice for most new research
  • CNNs still excel at pure vision tasks (faster, more efficient than Vision Transformers for some applications)
  • LSTMs remain useful for resource-constrained sequential problems
  • Feedforward networks are the baseline for tabular data

Key Principles
  1. Match architecture to problem structure - Don’t use a CNN for tabular data or an MLP for images
  2. Transformers are powerful but expensive - Great results, but require significant compute
  3. Start simple - Try a feedforward network first for tabular data, then add complexity if needed
  4. CNNs understand space, LSTMs understand time, Transformers understand everything
  5. Modern AI = Mostly Transformers - If you’re learning one architecture deeply, make it Transformers

Practical Implementation: Time Series Forecasting

The remainder of this document demonstrates how to implement and compare all these approaches on real unemployment data from FRED.

Data Preparation

library(fredr)
library(dplyr)
library(lubridate)
library(slider)
set.seed(123)

Get FRED Data

# Set FRED API key
key <- fredr_set_key("8a9ec1330374c1696f05cc8e526233b5")

series_id <- "UNRATE"  # Unemployment Rate

start_date <- floor_date(Sys.Date() %m-% years(5), unit = "month")

raw_data <- fredr(
  series_id = series_id,
  observation_start = start_date,
  frequency = "m"
)

# Keep only date and value, rename for convenience
data <- raw_data %>%
  select(date, value) %>%
  arrange(date)

Feature Engineering

Creating informative features for machine learning models: Raw time series data is just a sequence of values and dates. Feature engineering transforms this into a rich set of predictive variables that machine learning models can learn from. We create lag features (previous values like lag_1), rolling statistics (3-month moving average), differences (change from previous period), and time-based features (year, month as categorical). These features are essential for tree-based models (Random Forest, XGBoost) because trees can’t inherently understand temporal order - they need explicit features that encode historical patterns and trends. Without these engineered features, tree models would just see a single number with no context. Neural networks (RNN, LSTM) can theoretically learn temporal patterns from raw sequences, but engineered features often help them too by providing explicit signals about seasonality and trends. Statistical models (ETS, ARIMA) don’t use these features at all - they model the raw time series structure directly.

# Create time-based features
data <- data %>%
  mutate(
    year = year(date),
    month = month(date),
    month_factor = factor(month, levels = 1:12, labels = month.abb)
  )

# Create lag and rolling features
data <- data %>%
  mutate(
    lag_1 = dplyr::lag(value, 1),
    rolling_mean_3 = slide_dbl(
      value,
      mean,
      .before = 2,
      .complete = TRUE
    ),
    diff_1 = value - lag_1
  )

Handle Missing Values

# Get first non-missing 3-month rolling mean
first_roll_mean <- data %>%
  filter(!is.na(rolling_mean_3)) %>%
  dplyr::slice(1) %>%
  pull(rolling_mean_3)

data_ml <- data %>%
  mutate(
    rolling_mean_3 = if_else(
      is.na(rolling_mean_3),
      first_roll_mean,
      rolling_mean_3
    ),
    lag_1 = if_else(
      is.na(lag_1),
      value,
      lag_1
    ),
    diff_1 = if_else(
      is.na(diff_1),
      0,
      diff_1
    )
  )

Train-Test Split

# Get unique years
years_in_data <- sort(unique(year(data_ml$date)))
years_in_data
[1] 2020 2021 2022 2023 2024 2025
# Last year = test, previous 4 = train
last_year <- max(years_in_data)
train_years <- (last_year - 4):(last_year - 1)

train_data <- data_ml %>% filter(year(date) %in% train_years)
test_data <- data_ml %>% filter(year(date) == last_year)

Scale Features

Why scaling is required: Feature scaling standardizes all variables to the same range, which is critical for models that use distance metrics or gradient-based optimization. Neural networks (RNN, LSTM, CNN-LSTM, Transformer) are extremely sensitive to feature scale because they use gradient descent - unscaled features cause exploding/vanishing gradients and slow convergence. Machine learning models like Random Forest and XGBoost are tree-based and don’t require scaling since they split on feature values directly. However, we scale here because we’ll use the same features across all models. Statistical models (ETS, ARIMA) operate on the raw time series and don’t use these engineered features at all.

# Choose numeric feature columns
num_cols <- c("value", "lag_1", "rolling_mean_3", "diff_1")

# Fit scaler on training data
train_means <- sapply(train_data[num_cols], mean, na.rm = TRUE)
train_sds <- sapply(train_data[num_cols], sd, na.rm = TRUE)

scale_with_params <- function(x, mean_vec, sd_vec) {
  as.data.frame(scale(x, center = mean_vec, scale = sd_vec))
}

train_scaled <- scale_with_params(train_data[num_cols], train_means, train_sds)
test_scaled <- scale_with_params(test_data[num_cols], train_means, train_sds)

# Reattach scaled features
train_final <- train_data %>%
  select(date, year, month, month_factor) %>%
  bind_cols(train_scaled)

test_final <- test_data %>%
  select(date, year, month, month_factor) %>%
  bind_cols(test_scaled)

# Get mean and sd for value only
value_mean <- train_means["value"]
value_sd <- train_sds["value"]

inv_scale <- function(z) z * value_sd + value_mean

Machine Learning Models

Tree-based models like Random Forest and XGBoost work by recursively splitting data based on feature values. They learn non-linear relationships naturally without needing feature scaling. Random Forest builds many independent decision trees and averages their predictions, which reduces overfitting through ensemble voting. XGBoost (Extreme Gradient Boosting) builds trees sequentially, where each new tree corrects errors made by previous trees. XGBoost is generally more powerful but requires careful tuning, while Random Forest is more robust out-of-the-box. Both handle tabular data extremely well and can capture complex interactions between features like lag values, rolling averages, and seasonal patterns.

Random Forest

Ensemble learning through bagging: Random Forest improves upon single decision trees by using bootstrap aggregating (bagging). A single decision tree is fast and interpretable but tends to overfit - it memorizes training data by growing deep and learning noise. Bagging addresses this by training many trees on different random subsets of the data (bootstrap samples) and averaging their predictions. Random Forest adds an extra layer of randomization: at each split, it only considers a random subset of features (mtry = 2 here means each split looks at 2 random features). This decorrelates the trees, preventing them from all making the same mistakes. The result is a more robust ensemble where individual tree errors cancel out, giving better generalization than any single tree. With 500 trees, we get stable predictions without the overfitting problem of deep individual trees.

library(ranger)

y_test <- inv_scale(test_final$value)
rf_formula <- value ~ lag_1 + rolling_mean_3 + diff_1 + month_factor

set.seed(123)
rf_model <- ranger(
  formula = rf_formula,
  data = train_final,
  num.trees = 500,
  mtry = 2,
  min.node.size = 3,
  importance = "impurity"
)

# Predictions on test set
rf_pred <- predict(rf_model, data = test_final)$predictions
rf_pred <- inv_scale(rf_pred)

# Simple accuracy metrics
rmse <- function(actual, pred) sqrt(mean((actual - pred)^2))
mae <- function(actual, pred) mean(abs(actual - pred))

rf_rmse <- rmse(y_test, rf_pred)
rf_mae <- mae(y_test, rf_pred)

cat("RF RMSE:", rf_rmse, "\n")
RF RMSE: 0.1186575 
cat("RF MAE:", rf_mae, "\n")
RF MAE: 0.09785185 
# Variable importance
rf_model$variable.importance
         lag_1 rolling_mean_3         diff_1   month_factor 
    20.4529830     20.3979637      3.8150090      0.9366323 

XGBoost

Boosting: Learning from mistakes sequentially: Unlike Random Forest’s bagging approach where trees are built independently and in parallel, XGBoost uses boosting - it builds trees sequentially where each new tree focuses on correcting the errors of all previous trees. The first tree makes predictions, then the second tree learns to predict the residuals (errors) from the first tree, the third tree learns from the combined errors, and so on. This “error-correction” approach makes boosting more powerful than bagging but also more prone to overfitting if not carefully tuned. Key parameters control this: eta (learning rate) determines how much each tree contributes (0.05 is conservative, preventing overfitting), max_depth = 3 keeps trees shallow (reducing complexity), and subsample/colsample_bytree = 0.8 randomly sample 80% of data/features per tree (adding regularization). The result is typically higher accuracy than Random Forest, but requires more careful hyperparameter tuning. We train for 300 rounds, allowing the model to progressively refine its predictions.

library(xgboost)

# Build model matrices
X_train <- model.matrix(value ~ . - date, data = train_final)[, -1]
X_test <- model.matrix(value ~ . - date, data = test_final)[, -1]

y_train_scaled <- train_final$value
y_test_orig <- inv_scale(test_final$value)

dtrain <- xgb.DMatrix(data = X_train, label = y_train_scaled)
dtest <- xgb.DMatrix(data = X_test)

# XGBoost parameters
params <- list(
  objective = "reg:squarederror",
  eta = 0.05,
  max_depth = 3,
  subsample = 0.8,
  colsample_bytree = 0.8,
  eval_metric = "rmse"
)

set.seed(123)
xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 300,
  watchlist = list(train = dtrain),
  verbose = 0
)

# Predict and inverse-transform
xgb_pred_scaled <- predict(xgb_model, dtest)
xgb_pred <- inv_scale(xgb_pred_scaled)

xgb_rmse <- rmse(y_test_orig, xgb_pred)
xgb_mae <- mae(y_test_orig, xgb_pred)

cat("XGB RMSE:", xgb_rmse, "\n")
XGB RMSE: 0.08563791 
cat("XGB MAE:", xgb_mae, "\n")
XGB MAE: 0.06461507 
# Feature importance
xgb_importance <- xgb.importance(model = xgb_model, feature_names = colnames(X_train))
print(xgb_importance)
            Feature         Gain       Cover   Frequency
             <char>        <num>       <num>       <num>
 1:           lag_1 6.462762e-01 0.327384523 0.268308921
 2:  rolling_mean_3 2.839651e-01 0.209088182 0.160452730
 3:            year 4.249634e-02 0.071895621 0.099201065
 4:           month 2.028056e-02 0.132453509 0.180426099
 5:          diff_1 3.906472e-03 0.136682663 0.130492676
 6: month_factorFeb 1.217013e-03 0.005758848 0.011984021
 7: month_factorAug 5.942214e-04 0.011757648 0.013981358
 8: month_factorSep 3.720197e-04 0.010437912 0.014647137
 9: month_factorOct 2.996390e-04 0.033923215 0.029294274
10: month_factorMay 2.193197e-04 0.018266347 0.030625832
11: month_factorJul 1.472233e-04 0.003299340 0.006657790
12: month_factorNov 5.587807e-05 0.006658668 0.010652463
13: month_factorDec 5.514394e-05 0.008968206 0.008655126
14: month_factorApr 5.150292e-05 0.007618476 0.010652463
15: month_factorJun 4.870558e-05 0.007798440 0.010652463
16: month_factorMar 1.459596e-05 0.008008398 0.013315579

Model Comparison Metrics

# Comprehensive accuracy functions
mape <- function(actual, pred) mean(abs((actual - pred) / actual)) * 100
mpe <- function(actual, pred) mean((actual - pred) / actual) * 100
me <- function(actual, pred) mean(actual - pred)

mase <- function(actual, pred, train_actual) {
  naive_error <- mean(abs(diff(train_actual)))
  mean(abs(actual - pred)) / naive_error
}

aicc <- function(actual, pred, k) {
  n <- length(actual)
  rss <- sum((actual - pred)^2)
  sigma2 <- rss / n
  ll <- -n/2 * (log(2 * pi * sigma2) + 1)
  aic <- -2 * ll + 2 * k
  aic + (2 * k * (k + 1)) / (n - k - 1)
}

bic <- function(actual, pred, k) {
  n <- length(actual)
  rss <- sum((actual - pred)^2)
  sigma2 <- rss / n
  ll <- -n/2 * (log(2 * pi * sigma2) + 1)
  -2 * ll + log(n) * k
}

# Original-scale targets
y_train_orig <- inv_scale(train_final$value)
y_test_orig <- inv_scale(test_final$value)

# Effective parameter counts
k_rf <- rf_model$num.independent.variables
k_xgb <- length(xgb_model$feature_names)

# Random Forest metrics
rf_metrics <- data.frame(
  Model = "Random Forest",
  RMSE = rmse(y_test_orig, rf_pred),
  MAE = mae(y_test_orig, rf_pred),
  MAPE = mape(y_test_orig, rf_pred),
  MPE = mpe(y_test_orig, rf_pred),
  ME = me(y_test_orig, rf_pred),
  MASE = mase(y_test_orig, rf_pred, y_train_orig),
  AICc = aicc(y_test_orig, rf_pred, k_rf),
  BIC = bic(y_test_orig, rf_pred, k_rf)
)

# XGBoost metrics
xgb_metrics <- data.frame(
  Model = "XGBoost",
  RMSE = rmse(y_test_orig, xgb_pred),
  MAE = mae(y_test_orig, xgb_pred),
  MAPE = mape(y_test_orig, xgb_pred),
  MPE = mpe(y_test_orig, xgb_pred),
  ME = me(y_test_orig, xgb_pred),
  MASE = mase(y_test_orig, xgb_pred, y_train_orig),
  AICc = aicc(y_test_orig, xgb_pred, k_xgb),
  BIC = bic(y_test_orig, xgb_pred, k_xgb)
)

comparison_full <- rbind(rf_metrics, xgb_metrics)
comparison_full %>% arrange(RMSE)
          Model       RMSE        MAE     MAPE       MPE         ME      MASE
1       XGBoost 0.08563791 0.06461507 1.523994 0.7790242 0.03461073 0.5147302
2 Random Forest 0.11865750 0.09785185 2.303982 1.7325004 0.07480741 0.7794978
       AICc       BIC
1 -54.69640 16.459198
2   5.17364 -4.037462

Statistical Models (ETS/ARIMA)

Classical time series methods: Unlike machine learning models that use engineered features, statistical models work directly on the raw time series values and model the underlying data-generating process.

  1. ETS (Error, Trend, Seasonality) decomposes the series into these three components and models them separately - it’s excellent for data with clear seasonal patterns.

  2. ARIMA (AutoRegressive Integrated Moving Average) models the series as a function of its own past values (AR), past forecast errors (MA), and differencing to make it stationary (I).

  3. TSLM is time series linear regression that can include trend and seasonal dummy variables.

  4. The baseline methods (MEAN, NAIVE, SNAIVE) provide simple benchmarks: MEAN uses the average of training data, NAIVE uses the last observed value, and SNAIVE (Seasonal Naive) uses the value from the same season last year.

These statistical methods are interpretable, require less data than deep learning, and often perform surprisingly well on simple time series, making them essential benchmarks.

library(fpp3)

# Make a monthly tsibble
fred_ts <- data %>%
  mutate(Month = yearmonth(date)) %>%
  as_tsibble(index = Month) %>%
  arrange(Month)

# Train/test split
years_in_data <- fred_ts %>%
  mutate(yr = year(Month)) %>%
  distinct(yr) %>%
  arrange(yr) %>%
  pull(yr)

last_year <- max(years_in_data)
train_years <- (last_year - 4):(last_year - 1)

train_ts <- fred_ts %>% filter(year(Month) %in% train_years)
test_ts <- fred_ts %>% filter(year(Month) == last_year)

h <- nrow(test_ts)

# Fit multiple models
fit <- train_ts %>%
  model(
    ETS = ETS(value),
    ARIMA = ARIMA(value),
    TSLM = TSLM(value),
    MEAN = MEAN(value),
    NAIVE = NAIVE(value),
    SNAIVE = SNAIVE(value)
  )

# Forecast
fc <- fit %>%
  forecast(h = h)

# Accuracy metrics
acc <- fc %>%
  accuracy(test_ts) %>%
  select(.model, RMSE, MAE, MAPE, MPE, ME, MASE)

# Information criteria
info_raw <- fit %>%
  glance() %>%
  as_tibble()

if ("AICc" %in% names(info_raw)) {
  info <- info_raw %>%
    transmute(.model, AICc = AICc, BIC = BIC)
} else {
  info <- info_raw %>%
    transmute(.model, AICc = AIC, BIC = BIC)
}

# Merge metrics
ets_arima_metrics <- acc %>%
  left_join(info, by = ".model") %>%
  mutate(
    Model = case_when(
      .model == "ETS" ~ "ETS (fpp3)",
      .model == "ARIMA" ~ "ARIMA (fpp3)",
      .model == "TSLM" ~ "TSLM (fpp3)",
      .model == "MEAN" ~ "MEAN (fpp3)",
      .model == "NAIVE" ~ "NAIVE (fpp3)",
      .model == "SNAIVE" ~ "SNAIVE(fpp3)",
      TRUE ~ .model
    )
  ) %>%
  select(Model, RMSE, MAE, MAPE, MPE, ME, MASE, AICc, BIC)

ets_arima_metrics
# A tibble: 6 × 9
  Model         RMSE    MAE  MAPE   MPE     ME  MASE    AICc   BIC
  <chr>        <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl>   <dbl> <dbl>
1 ARIMA (fpp3) 0.136 0.109   2.56 1.64  0.0716   NaN -43.8   -38.8
2 ETS (fpp3)   0.158 0.127   2.99 2.35  0.101    NaN   0.927  10.1
3 MEAN (fpp3)  0.112 0.0889  2.11 0.462 0.0222   NaN  NA      NA  
4 NAIVE (fpp3) 0.141 0.111   2.61 2.05  0.0889   NaN  NA      NA  
5 SNAIVE(fpp3) 0.224 0.189   4.51 4.51  0.189    NaN  NA      NA  
6 TSLM (fpp3)  0.112 0.0889  2.11 0.462 0.0222   NaN -14.0   -10.5
# Add to comparison
comparison_full <- bind_rows(comparison_full, ets_arima_metrics)
comparison_full %>% arrange(RMSE)
          Model       RMSE        MAE     MAPE       MPE         ME      MASE
1       XGBoost 0.08563791 0.06461507 1.523994 0.7790242 0.03461073 0.5147302
2   TSLM (fpp3) 0.11221672 0.08888889 2.110790 0.4621899 0.02222222       NaN
3   MEAN (fpp3) 0.11221672 0.08888889 2.110790 0.4621899 0.02222222       NaN
4 Random Forest 0.11865750 0.09785185 2.303982 1.7325004 0.07480741 0.7794978
5  ARIMA (fpp3) 0.13613813 0.10880935 2.563920 1.6369887 0.07155127       NaN
6  NAIVE (fpp3) 0.14142136 0.11111111 2.610350 2.0547949 0.08888889       NaN
7    ETS (fpp3) 0.15769849 0.12726794 2.993277 2.3494798 0.10149945       NaN
8  SNAIVE(fpp3) 0.22360680 0.18888889 4.507715 4.5077146 0.18888889       NaN
        AICc        BIC
1 -54.696396  16.459198
2 -14.000658 -10.524923
3         NA         NA
4   5.173640  -4.037462
5 -43.755009 -38.840513
6         NA         NA
7   0.926872  10.105298
8         NA         NA

Deep Learning Models

What makes it “deep”? Deep learning refers to neural networks with multiple hidden layers (typically 3+), allowing them to learn hierarchical representations of data. Shallow networks with 1-2 layers can only learn simple patterns, but deep networks progressively extract more abstract features at each layer.

In image recognition, early layers detect edges, middle layers recognize shapes, and deep layers identify objects.

For time series, early layers capture short-term fluctuations while deeper layers learn long-term trends and seasonal patterns. The “learning” happens through backpropagation - the network computes prediction errors and adjusts millions of weights across all layers to minimize those errors. This requires large amounts of data (thousands to millions of samples) and significant computational power (GPUs), which is why deep learning only became practical in the 2010s.

The models below (RNN, LSTM, CNN-LSTM, Transformer) are all “deep” because they stack multiple processing layers that transform raw sequences into increasingly sophisticated representations before making predictions.

Data Preparation for Neural Networks

Deep learning libraries explained: We use reticulate to interface between R and Python, since most deep learning frameworks are built in Python.

The keras3 library provides an R interface to Keras (a high-level neural networks API) that runs on top of TensorFlow. Keras makes building neural networks intuitive - you can stack layers like LEGO blocks rather than implementing backpropagation from scratch. It handles all the complex tensor operations, GPU acceleration, and optimization under the hood, letting us focus on architecture design.

Important
ML wouldn’t be a reality without the right tools. Enter Keras and PyTorch — powerful frameworks that simplify building and deploying intelligent models.Keras vs PyTorch: Which ML Framework is the Best for You?

READ - 10 minutes

Note

Keras vs PyTorch: The two dominant deep learning frameworks are Keras (TensorFlow backend) and PyTorch.

  • Keras prioritizes simplicity and quick prototyping - its high-level API with Sequential() and pre-built layers makes it ideal for standard architectures and beginners. PyTorch offers more flexibility and control with dynamic computational graphs, making it preferred for research and custom architectures, but requires more boilerplate code.

  • For production deployment, TensorFlow/Keras has better tooling. For experimentation and debugging, PyTorch’s eager execution is more intuitive.

Both can do the same things ultimately - Keras is “easy to learn, harder to customize” while PyTorch is “steeper learning curve, more flexible.” We use Keras here because its sequential API clearly demonstrates neural network concepts without excessive syntax.

  • Origin:

Caution

Setting random seeds for reproducibility: Neural networks initialize weights randomly, and training involves stochastic processes (random batch sampling, dropout). To get reproducible results, we set multiple random seeds: np$random$seed(123L) controls NumPy’s random number generator (used for array operations), random$seed(123L) controls Python’s built-in random module, keras3::set_random_seed(123) sets TensorFlow’s seed, and PYTHONHASHSEED = "123" controls hash randomization. We need all four because different parts of the deep learning stack use different random number generators. Without setting these seeds, you’d get slightly different results each time you train the model, making it impossible to compare results or debug issues.

library(reticulate)
library(keras3)

# Set seeds
np <- import("numpy", convert = FALSE)
random <- import("random", convert = FALSE)

np$random$seed(123L)
None
random$seed(123L)
None
keras3::set_random_seed(123)
Sys.setenv("PYTHONHASHSEED" = "123")

# Build supervised sequences
all_scaled <- bind_rows(train_final, test_final)

make_sequences <- function(x, lag = 12) {
  x <- as.numeric(x)
  n <- length(x)
  n_samples <- n - lag
  
  X <- array(NA_real_, dim = c(n_samples, lag, 1))
  y <- numeric(n_samples)
  
  for (i in seq_len(n_samples)) {
    X[i, , 1] <- x[i:(i + lag - 1)]
    y[i] <- x[i + lag]
  }
  list(X = X, y = y)
}

lag <- 12

train_seq <- make_sequences(train_final$value, lag)
full_seq <- make_sequences(all_scaled$value, lag)
n_test <- nrow(test_final)

X_train <- train_seq$X
y_train_scaled <- train_seq$y

X_test <- tail(full_seq$X, n_test)
y_test_scaled <- tail(full_seq$y, n_test)

y_train_orig <- inv_scale(y_train_scaled)
y_test_orig <- inv_scale(y_test_scaled)

RNN Model

Simple RNN architecture: This is the most basic recurrent neural network. It processes sequences one step at a time, maintaining a hidden state that gets updated at each time step. The hidden state acts as a “memory” that carries information from previous time steps forward. However, simple RNNs suffer from the vanishing gradient problem - they struggle to learn long-term dependencies because gradients get smaller as they propagate back through time. For this reason, simple RNNs work best on short sequences or as a baseline comparison. Here we use 16 hidden units with a lookback window of 12 months (one year), followed by a dense layer that outputs the prediction.

Model architecture breakdown:

  1. keras_model_sequential() creates a linear stack of layers where data flows from input to output.

  2. The pipe operator |> chains layers together.

  3. layer_simple_rnn(units = 16, input_shape = c(lag, 1)) is the RNN layer with 16 hidden units that expects sequences of length lag (12 months) with 1 feature per time step. It processes the sequence and outputs a 16-dimensional vector summarizing what it learned.

  4. layer_dense(units = 1) is the output layer that takes those 16 values and produces a single prediction (next month’s unemployment rate).

Compilation settings:

  1. compile() configures the learning process.

  2. loss = "mse" (mean squared error) measures how far predictions are from actual values - we minimize this during training.

  3. optimizer = optimizer_adam(learning_rate = 0.01) specifies the algorithm for updating weights. Adam is an adaptive optimizer that adjusts learning rates automatically (better than basic gradient descent).

  4. The learning_rate = 0.01 controls step size - too high causes instability, too low makes training slow. 0.01 is a moderate, safe choice for RNNs.

Training parameters explained:

  1. epochs = 50 means the model sees the entire training dataset 50 times - each pass allows it to learn patterns better, though too many epochs can cause overfitting.

  2. batch_size = 8 means we update weights after processing 8 samples at a time rather than one-by-one (faster, more stable gradients) or all at once (memory intensive). Smaller batches add noise that can help escape local minima.

  3. verbose = 0 suppresses training progress output to keep the document clean - set to 1 to see epoch-by-epoch loss values during training.

rnn_model <- keras_model_sequential() |>
  layer_simple_rnn(units = 16, input_shape = c(lag, 1)) |>
  layer_dense(units = 1)

rnn_model |>
  compile(
    loss = "mse",
    optimizer = optimizer_adam(learning_rate = 0.01)
  )

history_rnn <- rnn_model |>
  fit(
    X_train, y_train_scaled,
    epochs = 50,
    batch_size = 8,
    verbose = 0
  )

rnn_pred_scaled <- rnn_model |>
  predict(X_test) |>
  as.numeric()
1/1 - 0s - 37ms/step
rnn_pred <- inv_scale(rnn_pred_scaled)

LSTM Model

LSTM architecture improvement: Long Short-Term Memory networks solve the vanishing gradient problem that plagues simple RNNs. They use a sophisticated gating mechanism with three gates: a forget gate (decides what to discard from memory), an input gate (controls what new information to store), and an output gate (determines what to output). This gate system allows LSTMs to selectively remember or forget information over long sequences, making them much better at capturing long-term dependencies. Here we use 32 LSTM units (more than the RNN since LSTMs are more complex) with the same 12-month lookback. The additional parameters and gating mechanisms make LSTMs slower to train but significantly more accurate for time series with long-range patterns.

lstm_model <- keras_model_sequential() |>
  layer_lstm(units = 32, input_shape = c(lag, 1)) |>
  layer_dense(units = 1)

lstm_model |>
  compile(
    loss = "mse",
    optimizer = optimizer_adam(learning_rate = 0.005)
  )

history_lstm <- lstm_model |>
  fit(
    X_train, y_train_scaled,
    epochs = 60,
    batch_size = 8,
    verbose = 0
  )

lstm_pred_scaled <- lstm_model |>
  predict(X_test) |>
  as.numeric()
1/1 - 0s - 47ms/step
lstm_pred <- inv_scale(lstm_pred_scaled)

CNN-LSTM Model

Hybrid architecture combining spatial and temporal learning: This model stacks a CNN layer before the LSTM to create a powerful hybrid. The convolutional layer (with 16 filters and kernel size 3) first scans the input sequence looking for local patterns and short-term trends - think of it as detecting “motifs” or repeating patterns in the time series data. The max pooling layer then downsamples this information, keeping only the most important features while reducing dimensionality. The LSTM layer receives these extracted features and models the long-term temporal dependencies. This combination often outperforms pure LSTMs because the CNN handles local pattern detection efficiently, while the LSTM focuses on learning how these patterns evolve over time. It’s particularly effective when your time series has both short-term recurring patterns and long-term trends.

cnn_lstm_model <- keras_model_sequential() |>
  layer_conv_1d(
    filters = 16,
    kernel_size = 3,
    activation = "relu",
    padding = "causal",
    input_shape = c(lag, 1)
  ) |>
  layer_max_pooling_1d(pool_size = 2) |>
  layer_lstm(units = 32) |>
  layer_dense(units = 1)

cnn_lstm_model |>
  compile(
    loss = "mse",
    optimizer = optimizer_adam(learning_rate = 0.003)
  )

history_cnn_lstm <- cnn_lstm_model |>
  fit(
    X_train, y_train_scaled,
    epochs = 60,
    batch_size = 8,
    verbose = 0
  )

cnn_lstm_pred_scaled <- cnn_lstm_model |>
  predict(X_test) |>
  as.numeric()
1/1 - 0s - 53ms/step
cnn_lstm_pred <- inv_scale(cnn_lstm_pred_scaled)

Neural Network Metrics

nn_metrics <- function(model_name, actual, pred, train_actual) {
  mae <- mean(abs(actual - pred))
  rmse <- sqrt(mean((actual - pred)^2))
  mape <- mean(abs((actual - pred) / actual)) * 100
  mpe <- mean((actual - pred) / actual) * 100
  me <- mean(actual - pred)
  mase <- mae / mean(abs(diff(train_actual)))
  
  data.frame(
    Model = model_name,
    RMSE = rmse,
    MAE = mae,
    MAPE = mape,
    MPE = mpe,
    ME = me,
    MASE = mase,
    AICc = NA_real_,
    BIC = NA_real_
  )
}

rnn_metrics <- nn_metrics("RNN (keras3)", y_test_orig, rnn_pred, y_train_orig)
lstm_metrics <- nn_metrics("LSTM (keras3)", y_test_orig, lstm_pred, y_train_orig)
cnn_lstm_metrics <- nn_metrics("CNN_LSTM (keras3)", y_test_orig, cnn_lstm_pred, y_train_orig)

nn_results <- bind_rows(rnn_metrics, lstm_metrics, cnn_lstm_metrics)
nn_results
              Model      RMSE       MAE     MAPE      MPE         ME     MASE
1      RNN (keras3) 0.2096823 0.1864638 4.402199 4.109723 0.17476474 2.105236
2     LSTM (keras3) 0.1278936 0.1087412 2.573026 1.540689 0.06728356 1.227723
3 CNN_LSTM (keras3) 0.1329619 0.1130907 2.684733 1.051413 0.04723354 1.276831
  AICc BIC
1   NA  NA
2   NA  NA
3   NA  NA
# Add to comparison
comparison_full <- bind_rows(comparison_full, nn_results)
comparison_full %>% arrange(RMSE)
               Model       RMSE        MAE     MAPE       MPE         ME
1            XGBoost 0.08563791 0.06461507 1.523994 0.7790242 0.03461073
2        TSLM (fpp3) 0.11221672 0.08888889 2.110790 0.4621899 0.02222222
3        MEAN (fpp3) 0.11221672 0.08888889 2.110790 0.4621899 0.02222222
4      Random Forest 0.11865750 0.09785185 2.303982 1.7325004 0.07480741
5      LSTM (keras3) 0.12789355 0.10874117 2.573026 1.5406892 0.06728356
6  CNN_LSTM (keras3) 0.13296194 0.11309071 2.684733 1.0514133 0.04723354
7       ARIMA (fpp3) 0.13613813 0.10880935 2.563920 1.6369887 0.07155127
8       NAIVE (fpp3) 0.14142136 0.11111111 2.610350 2.0547949 0.08888889
9         ETS (fpp3) 0.15769849 0.12726794 2.993277 2.3494798 0.10149945
10      RNN (keras3) 0.20968228 0.18646377 4.402199 4.1097235 0.17476474
11      SNAIVE(fpp3) 0.22360680 0.18888889 4.507715 4.5077146 0.18888889
        MASE       AICc        BIC
1  0.5147302 -54.696396  16.459198
2        NaN -14.000658 -10.524923
3        NaN         NA         NA
4  0.7794978   5.173640  -4.037462
5  1.2277229         NA         NA
6  1.2768306         NA         NA
7        NaN -43.755009 -38.840513
8        NaN         NA         NA
9        NaN   0.926872  10.105298
10 2.1052361         NA         NA
11       NaN         NA         NA

Transformer Model

Self-attention for time series: This transformer-style model uses self-attention mechanisms instead of recurrence, allowing it to look at all time steps simultaneously rather than processing sequentially like RNNs/LSTMs.

The architecture first projects the input through a dense layer, then applies a self-attention layer where each time step can “attend to” every other time step, learning which past observations are most relevant for prediction. This attention mechanism computes relevance scores between all pairs of time steps, essentially asking “which past months matter most for predicting the future?” The model then flattens these attention-weighted features and passes them through dense layers for the final prediction.

Transformers can capture complex long-range dependencies that LSTMs might miss, and they parallelize better during training (though this single-sequence forecast doesn’t fully exploit that advantage).

The trade-off is higher computational cost and more parameters to tune.

# Transformer-style self-attention model
transformer_input <- keras_input(shape = c(lag, 1))

x <- transformer_input |>
  layer_dense(units = 16, activation = "relu", name = "proj_dense")

attn_out <- layer_attention(
  list(x, x, x),
  use_scale = TRUE,
  score_mode = "dot",
  name = "self_attention"
)

x_out <- attn_out |>
  layer_flatten(name = "flatten") |>
  layer_dense(units = 16, activation = "relu", name = "post_attn_dense") |>
  layer_dense(units = 1, name = "output")

transformer_model <- keras_model(
  inputs = transformer_input,
  outputs = x_out,
  name = "transformer_style_model"
)

summary(transformer_model)
Model: "transformer_style_model"
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)          ┃ Output Shape      ┃     Param # ┃ Connected to       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ input_layer_3         │ (None, 12, 1)     │           0 │ -                  │
│ (InputLayer)          │                   │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ proj_dense (Dense)    │ (None, 12, 16)    │          32 │ input_layer_3[0][… │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ self_attention        │ (None, 12, 16)    │           1 │ proj_dense[0][0],  │
│ (Attention)           │                   │             │ proj_dense[0][0],  │
│                       │                   │             │ proj_dense[0][0]   │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ flatten (Flatten)     │ (None, 192)       │           0 │ self_attention[0]… │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ post_attn_dense       │ (None, 16)        │       3,088 │ flatten[0][0]      │
│ (Dense)               │                   │             │                    │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ output (Dense)        │ (None, 1)         │          17 │ post_attn_dense[0… │
└───────────────────────┴───────────────────┴─────────────┴────────────────────┘
 Total params: 3,138 (12.26 KB)
 Trainable params: 3,138 (12.26 KB)
 Non-trainable params: 0 (0.00 B)
transformer_model |>
  compile(
    loss = "mse",
    optimizer = optimizer_adam(learning_rate = 0.005)
  )

history_transformer <- transformer_model |>
  fit(
    X_train, y_train_scaled,
    epochs = 60,
    batch_size = 8,
    verbose = 0
  )

transformer_pred_scaled <- transformer_model |>
  predict(X_test) |>
  as.numeric()
1/1 - 0s - 18ms/step
transformer_pred <- inv_scale(transformer_pred_scaled)

transformer_metrics <- nn_metrics(
  model_name = "Transformer (keras3)",
  actual = y_test_orig,
  pred = transformer_pred,
  train_actual = y_train_orig
)

transformer_metrics
                 Model     RMSE        MAE     MAPE       MPE         ME
1 Transformer (keras3) 0.124949 0.09712288 2.301205 0.6669475 0.03103196
      MASE AICc BIC
1 1.096549   NA  NA

Final Model Comparison

library(knitr)

comparison_full <- bind_rows(comparison_full, transformer_metrics)

comparison_full %>%
  arrange(RMSE) %>%
  kable(digits = 4, caption = "Model Performance Comparison (Sorted by RMSE)")
Model Performance Comparison (Sorted by RMSE)
Model RMSE MAE MAPE MPE ME MASE AICc BIC
XGBoost 0.0856 0.0646 1.5240 0.7790 0.0346 0.5147 -54.6964 16.4592
TSLM (fpp3) 0.1122 0.0889 2.1108 0.4622 0.0222 NaN -14.0007 -10.5249
MEAN (fpp3) 0.1122 0.0889 2.1108 0.4622 0.0222 NaN NA NA
Random Forest 0.1187 0.0979 2.3040 1.7325 0.0748 0.7795 5.1736 -4.0375
Transformer (keras3) 0.1249 0.0971 2.3012 0.6669 0.0310 1.0965 NA NA
LSTM (keras3) 0.1279 0.1087 2.5730 1.5407 0.0673 1.2277 NA NA
CNN_LSTM (keras3) 0.1330 0.1131 2.6847 1.0514 0.0472 1.2768 NA NA
ARIMA (fpp3) 0.1361 0.1088 2.5639 1.6370 0.0716 NaN -43.7550 -38.8405
NAIVE (fpp3) 0.1414 0.1111 2.6104 2.0548 0.0889 NaN NA NA
ETS (fpp3) 0.1577 0.1273 2.9933 2.3495 0.1015 NaN 0.9269 10.1053
RNN (keras3) 0.2097 0.1865 4.4022 4.1097 0.1748 2.1052 NA NA
SNAIVE(fpp3) 0.2236 0.1889 4.5077 4.5077 0.1889 NaN NA NA

Summary

This analysis compared 12 different forecasting approaches:

  • Machine Learning: Random Forest, XGBoost
  • Statistical: ETS, ARIMA, TSLM, MEAN, NAIVE, SNAIVE
  • Deep Learning: RNN, LSTM, CNN-LSTM, Transformer

The best performing model based on RMSE was XGBoost with an RMSE of 0.0856.