library(fredr)
library(dplyr)
library(lubridate)
library(slider)
set.seed(123)Machine Learning for Time Series Forecasting
This document demonstrates various time series forecasting approaches on monthly unemployment rate data from FRED, comparing traditional statistical methods (ETS, ARIMA), machine learning models (Random Forest, XGBoost), and deep learning architectures (RNN, LSTM, CNN-LSTM, Transformer).
Neural Networks: Quick Reference Guide
WATCH: But what is a neural network? | Deep learning chapter 1
There are four major types of neural networks you need to know about, each designed for different problems:
| Model Type | What It Learns | Best For | Main Weakness |
|---|---|---|---|
| Feedforward (MLP) | Dense layers | Tabular data | Ignores structure |
| CNN | Spatial patterns | Images, audio | Not for text/time |
| LSTM | Temporal patterns | Sequences, time series | Slow, replaced by transformers |
| Transformer | Global relationships via attention | NLP, vision, multimodal | High compute cost |
Quick decision rule: Feedforward for tables, CNNs for images, LSTMs for sequences when you need something lightweight, Transformers for everything else (especially language).
Understanding Each Architecture
1. Feedforward Networks (MLPs - Multi-Layer Perceptrons)
But what is a neural network? | Deep learning chapter 1
This is your basic neural network. Data goes in one end, passes through some hidden layers, comes out the other. Pretty straightforward.
How they work: Each layer has neurons that connect to everything in the next layer. The math is simple - you multiply inputs by weights, add a bias term, then run it through an activation function like ReLU or sigmoid.
Best for: Regular tabular data - spreadsheets, databases, that kind of stuff. Classification, regression, anything where the order or position of features doesn’t really matter.
Limitations: They’re terrible with images and sequences. An MLP just sees pixels as a flat list of numbers. It has no concept that nearby pixels form shapes or that words in a sentence relate to each other. A 28×28 image needs 784 input connections per neuron - very inefficient.
2. CNNs (Convolutional Neural Networks)
CNNs were basically invented for images. The breakthrough was using convolutions - small filters that slide across an image looking for patterns. Early layers detect edges and textures, deeper layers recognize complex shapes and objects.
The genius of CNNs: They understand spatial relationships. They know that neighboring pixels matter more than distant ones. And they use the same filters everywhere, so a face detector works whether the face is top-left or bottom-right.
https://developersbreach.com/convolution-neural-network-deep-learning/
Another similar picture with several CNN key components explained at https://glasswing.vc/blog/ai-atlas-16-convolutional-neural-networks-cnns/
https://saturncloud.io/blog/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way/
Watch the video on: Convoluting a 5x5x1 image with a 3x3x1 kernel to get a 3x3x1 convolved feature - https://saturncloud.io/blog/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way/
Efficiency advantage: Compare a regular neural network on a 28×28 image (needs 784 connections per neuron) to CNNs that use tiny 3×3 filters shared across the whole image. Way more efficient.
Best for: Basically everything visual - MNIST digits, ImageNet classification, object detection systems like YOLO, facial recognition, medical image analysis, even audio spectrograms.
3. LSTMs (Long Short-Term Memory)
LSTMs handle sequences and time-based data. Text, stock prices, sensor readings, music, anything where order matters and past events influence future ones.
The problem they solve: Regular RNNs have this problem where they forget stuff from earlier in the sequence - it’s called vanishing gradients. LSTMs fix this with a gate system:
Forget gate - decides what to throw away
Input gate - controls new information
Output gate - determines what gets passed forward
This allows them to maintain long-term dependencies and remember important information over many time steps.
Best for: Time series forecasting, text generation, speech recognition, music composition, sensor data analysis - any problem where past events influence future outcomes.
The downside: They’re slow. You have to process sequences one step at a time, can’t parallelize it, so training takes forever. That’s why Transformers ate their lunch in NLP. Still useful for resource-constrained sequential problems.
4. Transformers
Transformers run modern AI. GPT, BERT, ChatGPT, Vision Transformers - all the big models now use this architecture.
The key innovation: Self-attention. Instead of reading text word by word, every word can directly look at every other word in the sentence. The model learns which words matter and how they relate. There’s an attention equation that computes relevance scores between all word pairs.
Why they dominate:
- Parallelization - Process entire sequences at once on GPUs (unlike sequential LSTMs)
- Long-range dependencies - Better at understanding relationships between distant elements
- Scalability - More parameters + more data = better results
- Versatility - Work across different data types
Best for: Natural Language Processing (translation, summarization, question-answering), Vision Tasks (Vision Transformers/ViT), speech recognition (Whisper), and multimodal stuff (GPT-4 combining text and images).
Limitations: High computational cost and large memory requirements. Can be overkill for simpler problems.
Evolution Timeline
The Historical Arc:
- 1980s-1990s - Feedforward networks dominate
- 1997 - LSTMs enable sequence modeling
- 1998 - CNNs revolutionize computer vision (LeNet)
- 2017 - Transformers introduced (“Attention is All You Need”)
- 2020s - Transformers dominate most tasks
Current State:
- Transformers are the default choice for most new research
- CNNs still excel at pure vision tasks (faster, more efficient than Vision Transformers for some applications)
- LSTMs remain useful for resource-constrained sequential problems
- Feedforward networks are the baseline for tabular data
- Match architecture to problem structure - Don’t use a CNN for tabular data or an MLP for images
- Transformers are powerful but expensive - Great results, but require significant compute
- Start simple - Try a feedforward network first for tabular data, then add complexity if needed
- CNNs understand space, LSTMs understand time, Transformers understand everything
- Modern AI = Mostly Transformers - If you’re learning one architecture deeply, make it Transformers
Practical Implementation: Time Series Forecasting
The remainder of this document demonstrates how to implement and compare all these approaches on real unemployment data from FRED.
Data Preparation
Get FRED Data
# Set FRED API key
key <- fredr_set_key("8a9ec1330374c1696f05cc8e526233b5")
series_id <- "UNRATE" # Unemployment Rate
start_date <- floor_date(Sys.Date() %m-% years(5), unit = "month")
raw_data <- fredr(
series_id = series_id,
observation_start = start_date,
frequency = "m"
)
# Keep only date and value, rename for convenience
data <- raw_data %>%
select(date, value) %>%
arrange(date)Feature Engineering
Creating informative features for machine learning models: Raw time series data is just a sequence of values and dates. Feature engineering transforms this into a rich set of predictive variables that machine learning models can learn from. We create lag features (previous values like lag_1), rolling statistics (3-month moving average), differences (change from previous period), and time-based features (year, month as categorical). These features are essential for tree-based models (Random Forest, XGBoost) because trees can’t inherently understand temporal order - they need explicit features that encode historical patterns and trends. Without these engineered features, tree models would just see a single number with no context. Neural networks (RNN, LSTM) can theoretically learn temporal patterns from raw sequences, but engineered features often help them too by providing explicit signals about seasonality and trends. Statistical models (ETS, ARIMA) don’t use these features at all - they model the raw time series structure directly.
# Create time-based features
data <- data %>%
mutate(
year = year(date),
month = month(date),
month_factor = factor(month, levels = 1:12, labels = month.abb)
)
# Create lag and rolling features
data <- data %>%
mutate(
lag_1 = dplyr::lag(value, 1),
rolling_mean_3 = slide_dbl(
value,
mean,
.before = 2,
.complete = TRUE
),
diff_1 = value - lag_1
)Handle Missing Values
# Get first non-missing 3-month rolling mean
first_roll_mean <- data %>%
filter(!is.na(rolling_mean_3)) %>%
dplyr::slice(1) %>%
pull(rolling_mean_3)
data_ml <- data %>%
mutate(
rolling_mean_3 = if_else(
is.na(rolling_mean_3),
first_roll_mean,
rolling_mean_3
),
lag_1 = if_else(
is.na(lag_1),
value,
lag_1
),
diff_1 = if_else(
is.na(diff_1),
0,
diff_1
)
)Train-Test Split
# Get unique years
years_in_data <- sort(unique(year(data_ml$date)))
years_in_data[1] 2020 2021 2022 2023 2024 2025
# Last year = test, previous 4 = train
last_year <- max(years_in_data)
train_years <- (last_year - 4):(last_year - 1)
train_data <- data_ml %>% filter(year(date) %in% train_years)
test_data <- data_ml %>% filter(year(date) == last_year)Scale Features
Why scaling is required: Feature scaling standardizes all variables to the same range, which is critical for models that use distance metrics or gradient-based optimization. Neural networks (RNN, LSTM, CNN-LSTM, Transformer) are extremely sensitive to feature scale because they use gradient descent - unscaled features cause exploding/vanishing gradients and slow convergence. Machine learning models like Random Forest and XGBoost are tree-based and don’t require scaling since they split on feature values directly. However, we scale here because we’ll use the same features across all models. Statistical models (ETS, ARIMA) operate on the raw time series and don’t use these engineered features at all.
# Choose numeric feature columns
num_cols <- c("value", "lag_1", "rolling_mean_3", "diff_1")
# Fit scaler on training data
train_means <- sapply(train_data[num_cols], mean, na.rm = TRUE)
train_sds <- sapply(train_data[num_cols], sd, na.rm = TRUE)
scale_with_params <- function(x, mean_vec, sd_vec) {
as.data.frame(scale(x, center = mean_vec, scale = sd_vec))
}
train_scaled <- scale_with_params(train_data[num_cols], train_means, train_sds)
test_scaled <- scale_with_params(test_data[num_cols], train_means, train_sds)
# Reattach scaled features
train_final <- train_data %>%
select(date, year, month, month_factor) %>%
bind_cols(train_scaled)
test_final <- test_data %>%
select(date, year, month, month_factor) %>%
bind_cols(test_scaled)
# Get mean and sd for value only
value_mean <- train_means["value"]
value_sd <- train_sds["value"]
inv_scale <- function(z) z * value_sd + value_meanMachine Learning Models
Tree-based models like Random Forest and XGBoost work by recursively splitting data based on feature values. They learn non-linear relationships naturally without needing feature scaling. Random Forest builds many independent decision trees and averages their predictions, which reduces overfitting through ensemble voting. XGBoost (Extreme Gradient Boosting) builds trees sequentially, where each new tree corrects errors made by previous trees. XGBoost is generally more powerful but requires careful tuning, while Random Forest is more robust out-of-the-box. Both handle tabular data extremely well and can capture complex interactions between features like lag values, rolling averages, and seasonal patterns.
Random Forest
Ensemble learning through bagging: Random Forest improves upon single decision trees by using bootstrap aggregating (bagging). A single decision tree is fast and interpretable but tends to overfit - it memorizes training data by growing deep and learning noise. Bagging addresses this by training many trees on different random subsets of the data (bootstrap samples) and averaging their predictions. Random Forest adds an extra layer of randomization: at each split, it only considers a random subset of features (mtry = 2 here means each split looks at 2 random features). This decorrelates the trees, preventing them from all making the same mistakes. The result is a more robust ensemble where individual tree errors cancel out, giving better generalization than any single tree. With 500 trees, we get stable predictions without the overfitting problem of deep individual trees.
library(ranger)
y_test <- inv_scale(test_final$value)
rf_formula <- value ~ lag_1 + rolling_mean_3 + diff_1 + month_factor
set.seed(123)
rf_model <- ranger(
formula = rf_formula,
data = train_final,
num.trees = 500,
mtry = 2,
min.node.size = 3,
importance = "impurity"
)
# Predictions on test set
rf_pred <- predict(rf_model, data = test_final)$predictions
rf_pred <- inv_scale(rf_pred)
# Simple accuracy metrics
rmse <- function(actual, pred) sqrt(mean((actual - pred)^2))
mae <- function(actual, pred) mean(abs(actual - pred))
rf_rmse <- rmse(y_test, rf_pred)
rf_mae <- mae(y_test, rf_pred)
cat("RF RMSE:", rf_rmse, "\n")RF RMSE: 0.1186575
cat("RF MAE:", rf_mae, "\n")RF MAE: 0.09785185
# Variable importance
rf_model$variable.importance lag_1 rolling_mean_3 diff_1 month_factor
20.4529830 20.3979637 3.8150090 0.9366323
XGBoost
Boosting: Learning from mistakes sequentially: Unlike Random Forest’s bagging approach where trees are built independently and in parallel, XGBoost uses boosting - it builds trees sequentially where each new tree focuses on correcting the errors of all previous trees. The first tree makes predictions, then the second tree learns to predict the residuals (errors) from the first tree, the third tree learns from the combined errors, and so on. This “error-correction” approach makes boosting more powerful than bagging but also more prone to overfitting if not carefully tuned. Key parameters control this: eta (learning rate) determines how much each tree contributes (0.05 is conservative, preventing overfitting), max_depth = 3 keeps trees shallow (reducing complexity), and subsample/colsample_bytree = 0.8 randomly sample 80% of data/features per tree (adding regularization). The result is typically higher accuracy than Random Forest, but requires more careful hyperparameter tuning. We train for 300 rounds, allowing the model to progressively refine its predictions.
library(xgboost)
# Build model matrices
X_train <- model.matrix(value ~ . - date, data = train_final)[, -1]
X_test <- model.matrix(value ~ . - date, data = test_final)[, -1]
y_train_scaled <- train_final$value
y_test_orig <- inv_scale(test_final$value)
dtrain <- xgb.DMatrix(data = X_train, label = y_train_scaled)
dtest <- xgb.DMatrix(data = X_test)
# XGBoost parameters
params <- list(
objective = "reg:squarederror",
eta = 0.05,
max_depth = 3,
subsample = 0.8,
colsample_bytree = 0.8,
eval_metric = "rmse"
)
set.seed(123)
xgb_model <- xgb.train(
params = params,
data = dtrain,
nrounds = 300,
watchlist = list(train = dtrain),
verbose = 0
)
# Predict and inverse-transform
xgb_pred_scaled <- predict(xgb_model, dtest)
xgb_pred <- inv_scale(xgb_pred_scaled)
xgb_rmse <- rmse(y_test_orig, xgb_pred)
xgb_mae <- mae(y_test_orig, xgb_pred)
cat("XGB RMSE:", xgb_rmse, "\n")XGB RMSE: 0.08563791
cat("XGB MAE:", xgb_mae, "\n")XGB MAE: 0.06461507
# Feature importance
xgb_importance <- xgb.importance(model = xgb_model, feature_names = colnames(X_train))
print(xgb_importance) Feature Gain Cover Frequency
<char> <num> <num> <num>
1: lag_1 6.462762e-01 0.327384523 0.268308921
2: rolling_mean_3 2.839651e-01 0.209088182 0.160452730
3: year 4.249634e-02 0.071895621 0.099201065
4: month 2.028056e-02 0.132453509 0.180426099
5: diff_1 3.906472e-03 0.136682663 0.130492676
6: month_factorFeb 1.217013e-03 0.005758848 0.011984021
7: month_factorAug 5.942214e-04 0.011757648 0.013981358
8: month_factorSep 3.720197e-04 0.010437912 0.014647137
9: month_factorOct 2.996390e-04 0.033923215 0.029294274
10: month_factorMay 2.193197e-04 0.018266347 0.030625832
11: month_factorJul 1.472233e-04 0.003299340 0.006657790
12: month_factorNov 5.587807e-05 0.006658668 0.010652463
13: month_factorDec 5.514394e-05 0.008968206 0.008655126
14: month_factorApr 5.150292e-05 0.007618476 0.010652463
15: month_factorJun 4.870558e-05 0.007798440 0.010652463
16: month_factorMar 1.459596e-05 0.008008398 0.013315579
Model Comparison Metrics
# Comprehensive accuracy functions
mape <- function(actual, pred) mean(abs((actual - pred) / actual)) * 100
mpe <- function(actual, pred) mean((actual - pred) / actual) * 100
me <- function(actual, pred) mean(actual - pred)
mase <- function(actual, pred, train_actual) {
naive_error <- mean(abs(diff(train_actual)))
mean(abs(actual - pred)) / naive_error
}
aicc <- function(actual, pred, k) {
n <- length(actual)
rss <- sum((actual - pred)^2)
sigma2 <- rss / n
ll <- -n/2 * (log(2 * pi * sigma2) + 1)
aic <- -2 * ll + 2 * k
aic + (2 * k * (k + 1)) / (n - k - 1)
}
bic <- function(actual, pred, k) {
n <- length(actual)
rss <- sum((actual - pred)^2)
sigma2 <- rss / n
ll <- -n/2 * (log(2 * pi * sigma2) + 1)
-2 * ll + log(n) * k
}
# Original-scale targets
y_train_orig <- inv_scale(train_final$value)
y_test_orig <- inv_scale(test_final$value)
# Effective parameter counts
k_rf <- rf_model$num.independent.variables
k_xgb <- length(xgb_model$feature_names)
# Random Forest metrics
rf_metrics <- data.frame(
Model = "Random Forest",
RMSE = rmse(y_test_orig, rf_pred),
MAE = mae(y_test_orig, rf_pred),
MAPE = mape(y_test_orig, rf_pred),
MPE = mpe(y_test_orig, rf_pred),
ME = me(y_test_orig, rf_pred),
MASE = mase(y_test_orig, rf_pred, y_train_orig),
AICc = aicc(y_test_orig, rf_pred, k_rf),
BIC = bic(y_test_orig, rf_pred, k_rf)
)
# XGBoost metrics
xgb_metrics <- data.frame(
Model = "XGBoost",
RMSE = rmse(y_test_orig, xgb_pred),
MAE = mae(y_test_orig, xgb_pred),
MAPE = mape(y_test_orig, xgb_pred),
MPE = mpe(y_test_orig, xgb_pred),
ME = me(y_test_orig, xgb_pred),
MASE = mase(y_test_orig, xgb_pred, y_train_orig),
AICc = aicc(y_test_orig, xgb_pred, k_xgb),
BIC = bic(y_test_orig, xgb_pred, k_xgb)
)
comparison_full <- rbind(rf_metrics, xgb_metrics)
comparison_full %>% arrange(RMSE) Model RMSE MAE MAPE MPE ME MASE
1 XGBoost 0.08563791 0.06461507 1.523994 0.7790242 0.03461073 0.5147302
2 Random Forest 0.11865750 0.09785185 2.303982 1.7325004 0.07480741 0.7794978
AICc BIC
1 -54.69640 16.459198
2 5.17364 -4.037462
Statistical Models (ETS/ARIMA)
Classical time series methods: Unlike machine learning models that use engineered features, statistical models work directly on the raw time series values and model the underlying data-generating process.
ETS (Error, Trend, Seasonality) decomposes the series into these three components and models them separately - it’s excellent for data with clear seasonal patterns.
ARIMA (AutoRegressive Integrated Moving Average) models the series as a function of its own past values (AR), past forecast errors (MA), and differencing to make it stationary (I).
TSLM is time series linear regression that can include trend and seasonal dummy variables.
The baseline methods (MEAN, NAIVE, SNAIVE) provide simple benchmarks: MEAN uses the average of training data, NAIVE uses the last observed value, and SNAIVE (Seasonal Naive) uses the value from the same season last year.
These statistical methods are interpretable, require less data than deep learning, and often perform surprisingly well on simple time series, making them essential benchmarks.
library(fpp3)
# Make a monthly tsibble
fred_ts <- data %>%
mutate(Month = yearmonth(date)) %>%
as_tsibble(index = Month) %>%
arrange(Month)
# Train/test split
years_in_data <- fred_ts %>%
mutate(yr = year(Month)) %>%
distinct(yr) %>%
arrange(yr) %>%
pull(yr)
last_year <- max(years_in_data)
train_years <- (last_year - 4):(last_year - 1)
train_ts <- fred_ts %>% filter(year(Month) %in% train_years)
test_ts <- fred_ts %>% filter(year(Month) == last_year)
h <- nrow(test_ts)
# Fit multiple models
fit <- train_ts %>%
model(
ETS = ETS(value),
ARIMA = ARIMA(value),
TSLM = TSLM(value),
MEAN = MEAN(value),
NAIVE = NAIVE(value),
SNAIVE = SNAIVE(value)
)
# Forecast
fc <- fit %>%
forecast(h = h)
# Accuracy metrics
acc <- fc %>%
accuracy(test_ts) %>%
select(.model, RMSE, MAE, MAPE, MPE, ME, MASE)
# Information criteria
info_raw <- fit %>%
glance() %>%
as_tibble()
if ("AICc" %in% names(info_raw)) {
info <- info_raw %>%
transmute(.model, AICc = AICc, BIC = BIC)
} else {
info <- info_raw %>%
transmute(.model, AICc = AIC, BIC = BIC)
}
# Merge metrics
ets_arima_metrics <- acc %>%
left_join(info, by = ".model") %>%
mutate(
Model = case_when(
.model == "ETS" ~ "ETS (fpp3)",
.model == "ARIMA" ~ "ARIMA (fpp3)",
.model == "TSLM" ~ "TSLM (fpp3)",
.model == "MEAN" ~ "MEAN (fpp3)",
.model == "NAIVE" ~ "NAIVE (fpp3)",
.model == "SNAIVE" ~ "SNAIVE(fpp3)",
TRUE ~ .model
)
) %>%
select(Model, RMSE, MAE, MAPE, MPE, ME, MASE, AICc, BIC)
ets_arima_metrics# A tibble: 6 × 9
Model RMSE MAE MAPE MPE ME MASE AICc BIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIMA (fpp3) 0.136 0.109 2.56 1.64 0.0716 NaN -43.8 -38.8
2 ETS (fpp3) 0.158 0.127 2.99 2.35 0.101 NaN 0.927 10.1
3 MEAN (fpp3) 0.112 0.0889 2.11 0.462 0.0222 NaN NA NA
4 NAIVE (fpp3) 0.141 0.111 2.61 2.05 0.0889 NaN NA NA
5 SNAIVE(fpp3) 0.224 0.189 4.51 4.51 0.189 NaN NA NA
6 TSLM (fpp3) 0.112 0.0889 2.11 0.462 0.0222 NaN -14.0 -10.5
# Add to comparison
comparison_full <- bind_rows(comparison_full, ets_arima_metrics)
comparison_full %>% arrange(RMSE) Model RMSE MAE MAPE MPE ME MASE
1 XGBoost 0.08563791 0.06461507 1.523994 0.7790242 0.03461073 0.5147302
2 TSLM (fpp3) 0.11221672 0.08888889 2.110790 0.4621899 0.02222222 NaN
3 MEAN (fpp3) 0.11221672 0.08888889 2.110790 0.4621899 0.02222222 NaN
4 Random Forest 0.11865750 0.09785185 2.303982 1.7325004 0.07480741 0.7794978
5 ARIMA (fpp3) 0.13613813 0.10880935 2.563920 1.6369887 0.07155127 NaN
6 NAIVE (fpp3) 0.14142136 0.11111111 2.610350 2.0547949 0.08888889 NaN
7 ETS (fpp3) 0.15769849 0.12726794 2.993277 2.3494798 0.10149945 NaN
8 SNAIVE(fpp3) 0.22360680 0.18888889 4.507715 4.5077146 0.18888889 NaN
AICc BIC
1 -54.696396 16.459198
2 -14.000658 -10.524923
3 NA NA
4 5.173640 -4.037462
5 -43.755009 -38.840513
6 NA NA
7 0.926872 10.105298
8 NA NA
Deep Learning Models
What makes it “deep”? Deep learning refers to neural networks with multiple hidden layers (typically 3+), allowing them to learn hierarchical representations of data. Shallow networks with 1-2 layers can only learn simple patterns, but deep networks progressively extract more abstract features at each layer.
In image recognition, early layers detect edges, middle layers recognize shapes, and deep layers identify objects.
For time series, early layers capture short-term fluctuations while deeper layers learn long-term trends and seasonal patterns. The “learning” happens through backpropagation - the network computes prediction errors and adjusts millions of weights across all layers to minimize those errors. This requires large amounts of data (thousands to millions of samples) and significant computational power (GPUs), which is why deep learning only became practical in the 2010s.
The models below (RNN, LSTM, CNN-LSTM, Transformer) are all “deep” because they stack multiple processing layers that transform raw sequences into increasingly sophisticated representations before making predictions.
Data Preparation for Neural Networks
Deep learning libraries explained: We use reticulate to interface between R and Python, since most deep learning frameworks are built in Python.
The keras3 library provides an R interface to Keras (a high-level neural networks API) that runs on top of TensorFlow. Keras makes building neural networks intuitive - you can stack layers like LEGO blocks rather than implementing backpropagation from scratch. It handles all the complex tensor operations, GPU acceleration, and optimization under the hood, letting us focus on architecture design.
- ML wouldn’t be a reality without the right tools. Enter Keras and PyTorch — powerful frameworks that simplify building and deploying intelligent models.Keras vs PyTorch: Which ML Framework is the Best for You?
-
READ - 10 minutes
Keras vs PyTorch: The two dominant deep learning frameworks are Keras (TensorFlow backend) and PyTorch.
Keras prioritizes simplicity and quick prototyping - its high-level API with
Sequential()and pre-built layers makes it ideal for standard architectures and beginners. PyTorch offers more flexibility and control with dynamic computational graphs, making it preferred for research and custom architectures, but requires more boilerplate code.For production deployment, TensorFlow/Keras has better tooling. For experimentation and debugging, PyTorch’s eager execution is more intuitive.
Both can do the same things ultimately - Keras is “easy to learn, harder to customize” while PyTorch is “steeper learning curve, more flexible.” We use Keras here because its sequential API clearly demonstrates neural network concepts without excessive syntax.
- Origin:
Setting random seeds for reproducibility: Neural networks initialize weights randomly, and training involves stochastic processes (random batch sampling, dropout). To get reproducible results, we set multiple random seeds: np$random$seed(123L) controls NumPy’s random number generator (used for array operations), random$seed(123L) controls Python’s built-in random module, keras3::set_random_seed(123) sets TensorFlow’s seed, and PYTHONHASHSEED = "123" controls hash randomization. We need all four because different parts of the deep learning stack use different random number generators. Without setting these seeds, you’d get slightly different results each time you train the model, making it impossible to compare results or debug issues.
library(reticulate)
library(keras3)
# Set seeds
np <- import("numpy", convert = FALSE)
random <- import("random", convert = FALSE)
np$random$seed(123L)None
random$seed(123L)None
keras3::set_random_seed(123)
Sys.setenv("PYTHONHASHSEED" = "123")
# Build supervised sequences
all_scaled <- bind_rows(train_final, test_final)
make_sequences <- function(x, lag = 12) {
x <- as.numeric(x)
n <- length(x)
n_samples <- n - lag
X <- array(NA_real_, dim = c(n_samples, lag, 1))
y <- numeric(n_samples)
for (i in seq_len(n_samples)) {
X[i, , 1] <- x[i:(i + lag - 1)]
y[i] <- x[i + lag]
}
list(X = X, y = y)
}
lag <- 12
train_seq <- make_sequences(train_final$value, lag)
full_seq <- make_sequences(all_scaled$value, lag)
n_test <- nrow(test_final)
X_train <- train_seq$X
y_train_scaled <- train_seq$y
X_test <- tail(full_seq$X, n_test)
y_test_scaled <- tail(full_seq$y, n_test)
y_train_orig <- inv_scale(y_train_scaled)
y_test_orig <- inv_scale(y_test_scaled)RNN Model
Simple RNN architecture: This is the most basic recurrent neural network. It processes sequences one step at a time, maintaining a hidden state that gets updated at each time step. The hidden state acts as a “memory” that carries information from previous time steps forward. However, simple RNNs suffer from the vanishing gradient problem - they struggle to learn long-term dependencies because gradients get smaller as they propagate back through time. For this reason, simple RNNs work best on short sequences or as a baseline comparison. Here we use 16 hidden units with a lookback window of 12 months (one year), followed by a dense layer that outputs the prediction.
Model architecture breakdown:
keras_model_sequential()creates a linear stack of layers where data flows from input to output.The pipe operator
|>chains layers together.layer_simple_rnn(units = 16, input_shape = c(lag, 1))is the RNN layer with 16 hidden units that expects sequences of lengthlag(12 months) with 1 feature per time step. It processes the sequence and outputs a 16-dimensional vector summarizing what it learned.layer_dense(units = 1)is the output layer that takes those 16 values and produces a single prediction (next month’s unemployment rate).
Compilation settings:
compile()configures the learning process.loss = "mse"(mean squared error) measures how far predictions are from actual values - we minimize this during training.optimizer = optimizer_adam(learning_rate = 0.01)specifies the algorithm for updating weights. Adam is an adaptive optimizer that adjusts learning rates automatically (better than basic gradient descent).The
learning_rate = 0.01controls step size - too high causes instability, too low makes training slow. 0.01 is a moderate, safe choice for RNNs.
Training parameters explained:
epochs = 50means the model sees the entire training dataset 50 times - each pass allows it to learn patterns better, though too many epochs can cause overfitting.batch_size = 8means we update weights after processing 8 samples at a time rather than one-by-one (faster, more stable gradients) or all at once (memory intensive). Smaller batches add noise that can help escape local minima.verbose = 0suppresses training progress output to keep the document clean - set to 1 to see epoch-by-epoch loss values during training.
rnn_model <- keras_model_sequential() |>
layer_simple_rnn(units = 16, input_shape = c(lag, 1)) |>
layer_dense(units = 1)
rnn_model |>
compile(
loss = "mse",
optimizer = optimizer_adam(learning_rate = 0.01)
)
history_rnn <- rnn_model |>
fit(
X_train, y_train_scaled,
epochs = 50,
batch_size = 8,
verbose = 0
)
rnn_pred_scaled <- rnn_model |>
predict(X_test) |>
as.numeric()1/1 - 0s - 37ms/step
rnn_pred <- inv_scale(rnn_pred_scaled)LSTM Model
LSTM architecture improvement: Long Short-Term Memory networks solve the vanishing gradient problem that plagues simple RNNs. They use a sophisticated gating mechanism with three gates: a forget gate (decides what to discard from memory), an input gate (controls what new information to store), and an output gate (determines what to output). This gate system allows LSTMs to selectively remember or forget information over long sequences, making them much better at capturing long-term dependencies. Here we use 32 LSTM units (more than the RNN since LSTMs are more complex) with the same 12-month lookback. The additional parameters and gating mechanisms make LSTMs slower to train but significantly more accurate for time series with long-range patterns.
lstm_model <- keras_model_sequential() |>
layer_lstm(units = 32, input_shape = c(lag, 1)) |>
layer_dense(units = 1)
lstm_model |>
compile(
loss = "mse",
optimizer = optimizer_adam(learning_rate = 0.005)
)
history_lstm <- lstm_model |>
fit(
X_train, y_train_scaled,
epochs = 60,
batch_size = 8,
verbose = 0
)
lstm_pred_scaled <- lstm_model |>
predict(X_test) |>
as.numeric()1/1 - 0s - 47ms/step
lstm_pred <- inv_scale(lstm_pred_scaled)CNN-LSTM Model
Hybrid architecture combining spatial and temporal learning: This model stacks a CNN layer before the LSTM to create a powerful hybrid. The convolutional layer (with 16 filters and kernel size 3) first scans the input sequence looking for local patterns and short-term trends - think of it as detecting “motifs” or repeating patterns in the time series data. The max pooling layer then downsamples this information, keeping only the most important features while reducing dimensionality. The LSTM layer receives these extracted features and models the long-term temporal dependencies. This combination often outperforms pure LSTMs because the CNN handles local pattern detection efficiently, while the LSTM focuses on learning how these patterns evolve over time. It’s particularly effective when your time series has both short-term recurring patterns and long-term trends.
cnn_lstm_model <- keras_model_sequential() |>
layer_conv_1d(
filters = 16,
kernel_size = 3,
activation = "relu",
padding = "causal",
input_shape = c(lag, 1)
) |>
layer_max_pooling_1d(pool_size = 2) |>
layer_lstm(units = 32) |>
layer_dense(units = 1)
cnn_lstm_model |>
compile(
loss = "mse",
optimizer = optimizer_adam(learning_rate = 0.003)
)
history_cnn_lstm <- cnn_lstm_model |>
fit(
X_train, y_train_scaled,
epochs = 60,
batch_size = 8,
verbose = 0
)
cnn_lstm_pred_scaled <- cnn_lstm_model |>
predict(X_test) |>
as.numeric()1/1 - 0s - 53ms/step
cnn_lstm_pred <- inv_scale(cnn_lstm_pred_scaled)Neural Network Metrics
nn_metrics <- function(model_name, actual, pred, train_actual) {
mae <- mean(abs(actual - pred))
rmse <- sqrt(mean((actual - pred)^2))
mape <- mean(abs((actual - pred) / actual)) * 100
mpe <- mean((actual - pred) / actual) * 100
me <- mean(actual - pred)
mase <- mae / mean(abs(diff(train_actual)))
data.frame(
Model = model_name,
RMSE = rmse,
MAE = mae,
MAPE = mape,
MPE = mpe,
ME = me,
MASE = mase,
AICc = NA_real_,
BIC = NA_real_
)
}
rnn_metrics <- nn_metrics("RNN (keras3)", y_test_orig, rnn_pred, y_train_orig)
lstm_metrics <- nn_metrics("LSTM (keras3)", y_test_orig, lstm_pred, y_train_orig)
cnn_lstm_metrics <- nn_metrics("CNN_LSTM (keras3)", y_test_orig, cnn_lstm_pred, y_train_orig)
nn_results <- bind_rows(rnn_metrics, lstm_metrics, cnn_lstm_metrics)
nn_results Model RMSE MAE MAPE MPE ME MASE
1 RNN (keras3) 0.2096823 0.1864638 4.402199 4.109723 0.17476474 2.105236
2 LSTM (keras3) 0.1278936 0.1087412 2.573026 1.540689 0.06728356 1.227723
3 CNN_LSTM (keras3) 0.1329619 0.1130907 2.684733 1.051413 0.04723354 1.276831
AICc BIC
1 NA NA
2 NA NA
3 NA NA
# Add to comparison
comparison_full <- bind_rows(comparison_full, nn_results)
comparison_full %>% arrange(RMSE) Model RMSE MAE MAPE MPE ME
1 XGBoost 0.08563791 0.06461507 1.523994 0.7790242 0.03461073
2 TSLM (fpp3) 0.11221672 0.08888889 2.110790 0.4621899 0.02222222
3 MEAN (fpp3) 0.11221672 0.08888889 2.110790 0.4621899 0.02222222
4 Random Forest 0.11865750 0.09785185 2.303982 1.7325004 0.07480741
5 LSTM (keras3) 0.12789355 0.10874117 2.573026 1.5406892 0.06728356
6 CNN_LSTM (keras3) 0.13296194 0.11309071 2.684733 1.0514133 0.04723354
7 ARIMA (fpp3) 0.13613813 0.10880935 2.563920 1.6369887 0.07155127
8 NAIVE (fpp3) 0.14142136 0.11111111 2.610350 2.0547949 0.08888889
9 ETS (fpp3) 0.15769849 0.12726794 2.993277 2.3494798 0.10149945
10 RNN (keras3) 0.20968228 0.18646377 4.402199 4.1097235 0.17476474
11 SNAIVE(fpp3) 0.22360680 0.18888889 4.507715 4.5077146 0.18888889
MASE AICc BIC
1 0.5147302 -54.696396 16.459198
2 NaN -14.000658 -10.524923
3 NaN NA NA
4 0.7794978 5.173640 -4.037462
5 1.2277229 NA NA
6 1.2768306 NA NA
7 NaN -43.755009 -38.840513
8 NaN NA NA
9 NaN 0.926872 10.105298
10 2.1052361 NA NA
11 NaN NA NA
Transformer Model
Self-attention for time series: This transformer-style model uses self-attention mechanisms instead of recurrence, allowing it to look at all time steps simultaneously rather than processing sequentially like RNNs/LSTMs.
The architecture first projects the input through a dense layer, then applies a self-attention layer where each time step can “attend to” every other time step, learning which past observations are most relevant for prediction. This attention mechanism computes relevance scores between all pairs of time steps, essentially asking “which past months matter most for predicting the future?” The model then flattens these attention-weighted features and passes them through dense layers for the final prediction.
Transformers can capture complex long-range dependencies that LSTMs might miss, and they parallelize better during training (though this single-sequence forecast doesn’t fully exploit that advantage).
The trade-off is higher computational cost and more parameters to tune.
# Transformer-style self-attention model
transformer_input <- keras_input(shape = c(lag, 1))
x <- transformer_input |>
layer_dense(units = 16, activation = "relu", name = "proj_dense")
attn_out <- layer_attention(
list(x, x, x),
use_scale = TRUE,
score_mode = "dot",
name = "self_attention"
)
x_out <- attn_out |>
layer_flatten(name = "flatten") |>
layer_dense(units = 16, activation = "relu", name = "post_attn_dense") |>
layer_dense(units = 1, name = "output")
transformer_model <- keras_model(
inputs = transformer_input,
outputs = x_out,
name = "transformer_style_model"
)
summary(transformer_model)Model: "transformer_style_model"
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃ Connected to ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ input_layer_3 │ (None, 12, 1) │ 0 │ - │
│ (InputLayer) │ │ │ │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ proj_dense (Dense) │ (None, 12, 16) │ 32 │ input_layer_3[0][… │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ self_attention │ (None, 12, 16) │ 1 │ proj_dense[0][0], │
│ (Attention) │ │ │ proj_dense[0][0], │
│ │ │ │ proj_dense[0][0] │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ flatten (Flatten) │ (None, 192) │ 0 │ self_attention[0]… │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ post_attn_dense │ (None, 16) │ 3,088 │ flatten[0][0] │
│ (Dense) │ │ │ │
├───────────────────────┼───────────────────┼─────────────┼────────────────────┤
│ output (Dense) │ (None, 1) │ 17 │ post_attn_dense[0… │
└───────────────────────┴───────────────────┴─────────────┴────────────────────┘
Total params: 3,138 (12.26 KB)
Trainable params: 3,138 (12.26 KB)
Non-trainable params: 0 (0.00 B)
transformer_model |>
compile(
loss = "mse",
optimizer = optimizer_adam(learning_rate = 0.005)
)
history_transformer <- transformer_model |>
fit(
X_train, y_train_scaled,
epochs = 60,
batch_size = 8,
verbose = 0
)
transformer_pred_scaled <- transformer_model |>
predict(X_test) |>
as.numeric()1/1 - 0s - 18ms/step
transformer_pred <- inv_scale(transformer_pred_scaled)
transformer_metrics <- nn_metrics(
model_name = "Transformer (keras3)",
actual = y_test_orig,
pred = transformer_pred,
train_actual = y_train_orig
)
transformer_metrics Model RMSE MAE MAPE MPE ME
1 Transformer (keras3) 0.124949 0.09712288 2.301205 0.6669475 0.03103196
MASE AICc BIC
1 1.096549 NA NA
Final Model Comparison
library(knitr)
comparison_full <- bind_rows(comparison_full, transformer_metrics)
comparison_full %>%
arrange(RMSE) %>%
kable(digits = 4, caption = "Model Performance Comparison (Sorted by RMSE)")| Model | RMSE | MAE | MAPE | MPE | ME | MASE | AICc | BIC |
|---|---|---|---|---|---|---|---|---|
| XGBoost | 0.0856 | 0.0646 | 1.5240 | 0.7790 | 0.0346 | 0.5147 | -54.6964 | 16.4592 |
| TSLM (fpp3) | 0.1122 | 0.0889 | 2.1108 | 0.4622 | 0.0222 | NaN | -14.0007 | -10.5249 |
| MEAN (fpp3) | 0.1122 | 0.0889 | 2.1108 | 0.4622 | 0.0222 | NaN | NA | NA |
| Random Forest | 0.1187 | 0.0979 | 2.3040 | 1.7325 | 0.0748 | 0.7795 | 5.1736 | -4.0375 |
| Transformer (keras3) | 0.1249 | 0.0971 | 2.3012 | 0.6669 | 0.0310 | 1.0965 | NA | NA |
| LSTM (keras3) | 0.1279 | 0.1087 | 2.5730 | 1.5407 | 0.0673 | 1.2277 | NA | NA |
| CNN_LSTM (keras3) | 0.1330 | 0.1131 | 2.6847 | 1.0514 | 0.0472 | 1.2768 | NA | NA |
| ARIMA (fpp3) | 0.1361 | 0.1088 | 2.5639 | 1.6370 | 0.0716 | NaN | -43.7550 | -38.8405 |
| NAIVE (fpp3) | 0.1414 | 0.1111 | 2.6104 | 2.0548 | 0.0889 | NaN | NA | NA |
| ETS (fpp3) | 0.1577 | 0.1273 | 2.9933 | 2.3495 | 0.1015 | NaN | 0.9269 | 10.1053 |
| RNN (keras3) | 0.2097 | 0.1865 | 4.4022 | 4.1097 | 0.1748 | 2.1052 | NA | NA |
| SNAIVE(fpp3) | 0.2236 | 0.1889 | 4.5077 | 4.5077 | 0.1889 | NaN | NA | NA |
Summary
This analysis compared 12 different forecasting approaches:
- Machine Learning: Random Forest, XGBoost
- Statistical: ETS, ARIMA, TSLM, MEAN, NAIVE, SNAIVE
- Deep Learning: RNN, LSTM, CNN-LSTM, Transformer
The best performing model based on RMSE was XGBoost with an RMSE of 0.0856.