BTC Bridge: Multi-Exchange Decomposition Feature Engineering

Author

Adrian Aziza

library(fpp3)
library(tidyverse)
library(lubridate)
library(patchwork)
library(knitr)

1 1. Concept: The Collapse Layer

The collapse layer is the foundation of multi-feed time series feature engineering:

  1. Define a single temporal grid (e.g., 1-minute bars)
  2. Enforce a consistent missing-data policy (forward-fill, interpolate, or drop)
  3. Align all feeds to this grid
  4. Generate features on the aligned data

This enables: - Time-aligned feature matrices (X_t) where each row is a timestamp - Cross-feed analysis (correlations, spreads, leads/lags) - Decomposition-based features (trend, seasonal, remainder)

2 2. Simulated Multi-Exchange BTC Data

Since we don’t have live API access in this environment, we’ll simulate realistic multi-exchange data with: - Different sampling frequencies - Missing timestamps (gaps) - Noise and drift between exchanges

set.seed(42)

# Base BTC price signal (shared trend)
start_time <- ymd_hms("2024-01-01 00:00:00")
n_minutes <- 1440  # 24 hours of minute data

base_grid <- tibble(
  timestamp = start_time + minutes(0:(n_minutes-1)),
  base_price = 42000 + cumsum(rnorm(n_minutes, mean = 0.5, sd = 50))
)

# Exchange 1: Binance (complete data, slight premium)
binance <- base_grid %>%
  mutate(
    exchange = "binance",
    price = base_price * 1.002 + rnorm(n(), 0, 10)  # 0.2% premium + noise
  ) %>%
  select(timestamp, exchange, price)

# Exchange 2: Coinbase (90% completeness, slight discount)
coinbase <- base_grid %>%
  sample_frac(0.90) %>%  # Missing 10% of timestamps
  mutate(
    exchange = "coinbase",
    price = base_price * 0.998 + rnorm(n(), 0, 15)  # 0.2% discount + more noise
  ) %>%
  select(timestamp, exchange, price)

# Exchange 3: Kraken (85% completeness, variable spread)
kraken <- base_grid %>%
  sample_frac(0.85) %>%  # Missing 15% of timestamps
  mutate(
    exchange = "kraken",
    price = base_price * (1 + rnorm(n(), 0, 0.003)) + rnorm(n(), 0, 20)  # Variable spread
  ) %>%
  select(timestamp, exchange, price)

# Combine all exchanges
raw_data <- bind_rows(binance, coinbase, kraken)

# Summary
raw_data %>%
  count(exchange) %>%
  kable(caption = "Raw data counts by exchange (before alignment)")
Raw data counts by exchange (before alignment)
exchange n
binance 1440
coinbase 1296
kraken 1224

3 3. The Collapse Layer: Grid Alignment

# Step 1: Define the master temporal grid (every minute)
master_grid <- tibble(
  timestamp = start_time + minutes(0:(n_minutes-1))
)

# Step 2: Pivot to wide format (one column per exchange)
wide_data <- raw_data %>%
  pivot_wider(
    names_from = exchange,
    values_from = price,
    names_prefix = "price_"
  )

# Step 3: Align to master grid and handle missing data
aligned_data <- master_grid %>%
  left_join(wide_data, by = "timestamp") %>%
  # Forward-fill missing values (last observation carried forward)
  fill(starts_with("price_"), .direction = "down") %>%
  # Drop any remaining leading NAs
  drop_na()

# Verify alignment
cat("Master grid rows:", nrow(master_grid), "\n")
Master grid rows: 1440 
cat("Aligned data rows:", nrow(aligned_data), "\n")
Aligned data rows: 1440 
cat("Missing values:\n")
Missing values:
aligned_data %>% 
  summarise(across(starts_with("price_"), ~sum(is.na(.)))) %>%
  kable()
price_binance price_coinbase price_kraken
0 0 0
# Visualize the aligned price feeds
aligned_data %>%
  pivot_longer(
    cols = starts_with("price_"),
    names_to = "exchange",
    values_to = "price",
    names_prefix = "price_"
  ) %>%
  ggplot(aes(x = timestamp, y = price, color = exchange)) +
  geom_line(alpha = 0.7) +
  labs(
    title = "Aligned BTC Prices Across Exchanges (Post-Collapse)",
    x = "Time",
    y = "Price (USD)",
    color = "Exchange"
  ) +
  theme_minimal()

4 4. STL Decomposition on Each Feed

Now we decompose each exchange feed into trend + seasonal + remainder components.

# Convert to tsibble format (required for fable/feasts)
# We'll use 15-minute aggregation for seasonality detection
btc_ts <- aligned_data %>%
  mutate(
    time_15min = floor_date(timestamp, "15 minutes")
  ) %>%
  group_by(time_15min) %>%
  summarise(
    across(starts_with("price_"), mean, na.rm = TRUE)
  ) %>%
  ungroup() %>%
  as_tsibble(index = time_15min)

# Verify tsibble properties
cat("Tsibble interval:", format(interval(btc_ts)), "\n")
Tsibble interval: 15m 
cat("Has gaps:", has_gaps(btc_ts)$.gaps, "\n")
Has gaps: FALSE 
# Decompose each exchange feed using STL
# Note: We need at least 2 full seasonal periods for STL
# With 15-min bars over 24 hours, we have 96 observations
# We'll use a 4-hour seasonal window (16 bars)

decomp_models <- btc_ts %>%
  pivot_longer(
    cols = starts_with("price_"),
    names_to = "exchange",
    values_to = "price",
    names_prefix = "price_"
  ) %>%
  as_tsibble(key = exchange, index = time_15min) %>%
  model(
    stl = STL(price ~ trend(window = 21) + season(window = "periodic"))
  )

# Extract components
decomp_components <- decomp_models %>%
  components()

# View structure
decomp_components %>%
  head(10) %>%
  kable(digits = 2, caption = "Sample STL components (first 10 rows)")
Sample STL components (first 10 rows)
exchange .model time_15min price trend season_hour remainder season_adjust
binance stl 2024-01-01 00:00:00 42309.90 42320.97 -9.68 -1.40 42319.57
binance stl 2024-01-01 00:15:00 42278.66 42289.26 1.41 -12.01 42277.25
binance stl 2024-01-01 00:30:00 42137.62 42257.55 20.91 -140.84 42116.71
binance stl 2024-01-01 00:45:00 42053.16 42225.84 -12.63 -160.04 42065.80
binance stl 2024-01-01 01:00:00 42179.54 42192.25 -9.68 -3.03 42189.22
binance stl 2024-01-01 01:15:00 42273.51 42158.66 1.41 113.44 42272.10
binance stl 2024-01-01 01:30:00 42376.58 42125.07 20.91 230.61 42355.68
binance stl 2024-01-01 01:45:00 42362.75 42093.26 -12.63 282.12 42375.39
binance stl 2024-01-01 02:00:00 42056.52 42061.46 -9.68 4.74 42066.20
binance stl 2024-01-01 02:15:00 41997.11 42029.65 1.41 -33.94 41995.70

5 5. Visualize Decompositions

# Plot all three exchange decompositions
decomp_components %>%
  autoplot() +
  facet_wrap(~ exchange, ncol = 1, scales = "free_y") +
  labs(title = "STL Decomposition: All Exchanges") +
  theme_minimal()

6 6. Feature Matrix Construction

Extract features from each component and create the time-aligned feature matrix (X_t).

# Check what columns STL produced
cat("STL decomposition columns:\n")
STL decomposition columns:
cat(paste(colnames(decomp_components), collapse = "\n"))
exchange
.model
time_15min
price
trend
season_hour
remainder
season_adjust
cat("\n\n")
# Show sample of decomposition
decomp_components %>%
  head(3) %>%
  kable(digits = 2, caption = "Sample decomposition output")
Sample decomposition output
exchange .model time_15min price trend season_hour remainder season_adjust
binance stl 2024-01-01 00:00:00 42309.90 42320.97 -9.68 -1.40 42319.57
binance stl 2024-01-01 00:15:00 42278.66 42289.26 1.41 -12.01 42277.25
binance stl 2024-01-01 00:30:00 42137.62 42257.55 20.91 -140.84 42116.71
# First, let's see what columns we have
decomp_df <- decomp_components %>% as_tibble()
cat("Available columns:\n")
Available columns:
print(colnames(decomp_df))
[1] "exchange"      ".model"        "time_15min"    "price"        
[5] "trend"         "season_hour"   "remainder"     "season_adjust"
cat("\n")
# Get the seasonal column name (it varies based on the period)
seasonal_cols <- colnames(decomp_df)[grepl("season", colnames(decomp_df))]
seasonal_col <- seasonal_cols[1]  # Take the first seasonal column
cat("Seasonal column found:", seasonal_col, "\n\n")
Seasonal column found: season_hour 
# Extract trend features: level, slope (first difference), curvature (second difference)
trend_features <- decomp_df %>%
  group_by(exchange) %>%
  arrange(time_15min) %>%
  mutate(
    trend_level = trend,
    trend_slope = trend - lag(trend, default = first(trend)),
    trend_curvature = trend_slope - lag(trend_slope, default = first(trend_slope))
  ) %>%
  ungroup() %>%
  select(time_15min, exchange, trend_level, trend_slope, trend_curvature)

# Extract seasonal and remainder features
# Use the seasonal column we found dynamically
seasonal_remainder <- decomp_df %>%
  select(time_15min, exchange, season_component = all_of(seasonal_col), remainder)

# Combine all features
features_long <- trend_features %>%
  left_join(seasonal_remainder, by = c("time_15min", "exchange"))

# Pivot to wide format (one column per exchange-feature combination)
feature_matrix <- features_long %>%
  pivot_wider(
    names_from = exchange,
    values_from = c(trend_level, trend_slope, trend_curvature, season_component, remainder),
    names_sep = "_"
  ) %>%
  arrange(time_15min)

# Display feature matrix structure
cat("Feature matrix dimensions:", nrow(feature_matrix), "x", ncol(feature_matrix), "\n\n")
Feature matrix dimensions: 96 x 16 
cat("Feature columns:\n")
Feature columns:
colnames(feature_matrix) %>% head(10) %>% cat(sep = "\n")
time_15min
trend_level_binance
trend_level_coinbase
trend_level_kraken
trend_slope_binance
trend_slope_coinbase
trend_slope_kraken
trend_curvature_binance
trend_curvature_coinbase
trend_curvature_kraken
# Show sample
feature_matrix %>%
  head(5) %>%
  select(1:8) %>%  # First 8 columns only for display
  kable(digits = 3, caption = "Sample feature matrix (first 5 rows, 8 cols)")
Sample feature matrix (first 5 rows, 8 cols)
time_15min trend_level_binance trend_level_coinbase trend_level_kraken trend_slope_binance trend_slope_coinbase trend_slope_kraken trend_curvature_binance
2024-01-01 00:00:00 42320.97 42158.79 42261.32 0.000 0.000 0.000 0.000
2024-01-01 00:15:00 42289.26 42126.31 42226.80 -31.712 -32.475 -34.518 -31.712
2024-01-01 00:30:00 42257.55 42093.84 42192.28 -31.712 -32.475 -34.518 0.000
2024-01-01 00:45:00 42225.84 42061.36 42157.76 -31.712 -32.475 -34.518 0.000
2024-01-01 01:00:00 42192.25 42027.08 42121.32 -33.589 -34.282 -36.444 -1.878

7 7. Cross-Feed Analysis: Spreads and Correlations

# Calculate exchange spreads (price differences)
spread_data <- aligned_data %>%
  mutate(
    spread_binance_coinbase = price_binance - price_coinbase,
    spread_binance_kraken = price_binance - price_kraken,
    spread_coinbase_kraken = price_coinbase - price_kraken
  )

# Visualize spreads
spread_data %>%
  select(timestamp, starts_with("spread_")) %>%
  pivot_longer(-timestamp, names_to = "spread", values_to = "value") %>%
  ggplot(aes(x = timestamp, y = value, color = spread)) +
  geom_line(alpha = 0.7) +
  labs(
    title = "Inter-Exchange Price Spreads",
    x = "Time",
    y = "Spread (USD)",
    color = "Spread"
  ) +
  theme_minimal()

# Calculate rolling correlations between exchanges
library(zoo)

# 60-minute rolling window correlations
roll_window <- 60

correlations <- aligned_data %>%
  mutate(
    corr_binance_coinbase = rollapply(
      cbind(price_binance, price_coinbase),
      width = roll_window,
      FUN = function(x) cor(x[,1], x[,2]),
      by.column = FALSE,
      fill = NA,
      align = "right"
    ),
    corr_binance_kraken = rollapply(
      cbind(price_binance, price_kraken),
      width = roll_window,
      FUN = function(x) cor(x[,1], x[,2]),
      by.column = FALSE,
      fill = NA,
      align = "right"
    ),
    corr_coinbase_kraken = rollapply(
      cbind(price_coinbase, price_kraken),
      width = roll_window,
      FUN = function(x) cor(x[,1], x[,2]),
      by.column = FALSE,
      fill = NA,
      align = "right"
    )
  ) %>%
  drop_na()

# Visualize rolling correlations
correlations %>%
  select(timestamp, starts_with("corr_")) %>%
  pivot_longer(-timestamp, names_to = "pair", values_to = "correlation") %>%
  ggplot(aes(x = timestamp, y = correlation, color = pair)) +
  geom_line(alpha = 0.7) +
  labs(
    title = paste("Rolling Correlation (", roll_window, "-minute window)"),
    x = "Time",
    y = "Correlation",
    color = "Exchange Pair"
  ) +
  ylim(0.9, 1.0) +
  theme_minimal()

8 8. ACF/PACF Analysis on Remainder Components

# ACF/PACF on remainder to check for remaining autocorrelation
remainder_data <- decomp_components %>%
  as_tibble() %>%
  select(time_15min, exchange, remainder) %>%
  as_tsibble(key = exchange, index = time_15min)

# Plot ACF for each exchange
p_acf <- remainder_data %>%
  ACF(remainder, lag_max = 48) %>%
  autoplot() +
  facet_wrap(~ exchange, ncol = 1) +
  labs(title = "ACF of Remainder (checking for leftover structure)") +
  theme_minimal()

# Plot PACF for each exchange
p_pacf <- remainder_data %>%
  PACF(remainder, lag_max = 48) %>%
  autoplot() +
  facet_wrap(~ exchange, ncol = 1) +
  labs(title = "PACF of Remainder") +
  theme_minimal()

p_acf / p_pacf

9 9. Export Feature Matrix for Modeling

# Export the feature matrix for downstream modeling
library(writexl)

write_xlsx(feature_matrix, "btc_feature_matrix.xlsx")
write.csv(feature_matrix, "btc_feature_matrix.csv", row.names = FALSE)

cat("Feature matrix exported to:\n")
Feature matrix exported to:
cat("- btc_feature_matrix.xlsx\n")
- btc_feature_matrix.xlsx
cat("- btc_feature_matrix.csv\n\n")
- btc_feature_matrix.csv
cat("Ready for:\n")
Ready for:
cat("- ML models (regression, classification, reinforcement learning)\n")
- ML models (regression, classification, reinforcement learning)
cat("- ARIMA/VAR models on trend components\n")
- ARIMA/VAR models on trend components
cat("- Volatility forecasting on remainder\n")
- Volatility forecasting on remainder
cat("- Spread/arbitrage signal generation\n")
- Spread/arbitrage signal generation

10 10. Summary: The Complete Pipeline

10.1 What We Built

  1. Multi-source data ingestion (simulated 3 exchanges)
  2. Collapse layer (alignment to master grid + forward-fill)
  3. STL decomposition (trend + seasonal + remainder per exchange)
  4. Feature engineering (trend level/slope/curvature, seasonal, remainder)
  5. Cross-feed analysis (spreads, rolling correlations)
  6. Persistence diagnostics (ACF/PACF on remainder)
  7. Time-aligned feature matrix (X_t) (ready for modeling)

10.2 Why This Matters

Traditional approaches: - Use raw prices directly → noisy, non-stationary - Ignore cross-feed structure → miss arbitrage opportunities - No systematic decomposition → hard to interpret model features

This pipeline: - ✅ Separates signal (trend/seasonal) from noise (remainder) - ✅ Creates interpretable features (trend slope = momentum) - ✅ Enables multi-timescale analysis (trend for position, remainder for risk) - ✅ Generalizes to any multi-feed scenario (crypto indices, FX, commodities)

10.3 Next Steps

  1. Feature selection: Use feature importance (RF, LASSO) on (X_t)
  2. Forecasting: Build ARIMAX/VAR models on trend components
  3. Regime detection: Cluster on remainder variance (high vol = regime shift)
  4. Signal generation: Spread crossovers, trend slope thresholds
  5. Backtesting: Simulate trades using feature-based rules

10.4 Connection to Discussion 3

This is the direct extension of the PAYNSA decomposition work:

PAYNSA (Single Series) BTC Engine (Multi-Feed)
1 series (employment) N series (exchanges)
Monthly frequency Minute/tick frequency
Additive vs multiplicative Additive (price levels)
Forecast next value Generate trading signals
Excel verification Production pipeline

Core insight remains the same: Decomposition transforms raw data into structured features that are easier to model, interpret, and act upon.