1. 바

비정형 데이터에 머신러닝 알고리즘을 적용하기 위해서는 정규화된 형태로 바꿔줘야 한다. 대부분 표 형식으로 표현되는데 실무자들은 이러한 표의 행을 바(bar)라고 부른다. 바는 표준 바와 정보 기반 바 2가지로 구분 가능하다.

1.1 표준 바

1.1.1 시간 바

시간 바(time bar)는 고정된 시간 간격으로 정보를 표본 추출해 생성한다. 이는 대부분의 금융 시계열 데이터의 형태이기도 하다. 가장 보편적이지만 다음과 같은 이유로 사용하지 않는 것이 좋다.

시장은 정보를 일정한 시간 간격으로 처리하지 않는다.
- 시간 바는 거래가 적을 때 과다 정보를 거래가 많을 때는 과소 정보를 추출하게 된다.
시간에 따라 추출된 시계열 자료는 좋지 못한 통계적 성질을 보인다.
- 계열 상관(serial correlation)
- 이분산성(heteroscedasticity)
- 수익률의 비정규 분포성(non-normality)

1.1.2 틱 바

틱(tick)은 금융 시계열 데이터의 가장 작은 단위의 거래를 의미하며 예를 들어 삼성전자 5주를 매수하게 되면 1틱이 발생한 것이다. 틱 바는 사전에 정해 둔 거래 건수가 발생할 때마다 추출하는 방식이다. 또한 고정된 거래 건수에 따른 가격 변동은 IID(independent identically distributed) 정규 분포에 근접한 수익률을 얻을 수 있어 시간 바보다 훨씬 좋은 통게적 성질을 가진다.

그렇지만 틱 바의 경우 이상치(outlier)에 주의해야 한다. 많은 거래소에서 장 시작이나 종료 시 대량의 동시 호가를 실시하기에 대량의 거래가 하나의 틱으로 기록될 수 있기 때문이다.

library(tidyverse)
library(quantmod)
library(PerformanceAnalytics)
library(ggpubr)
library(tseries)

E_mini_SPX <- read_delim("ES_Trades.csv", col_select = c(2:5)) |> 
  mutate(Date = mdy_hms(paste0(Date, " ",Time)),
         Dollar = Price * Volume,
         tick_rule = c(0, sign(diff(Price))),
         Volume_tick = Volume * tick_rule,
         Dollar_tick = Dollar * tick_rule) |> 
  filter(tick_rule != 0) |> 
  select(Date, Price, Volume, Dollar, tick_rule, Volume_tick, Dollar_tick)

E_mini_SPX

## # A tibble: 270,590 × 7
##    Date                Price Volume Dollar tick_rule Volume_tick Dollar_tick
##    <dttm>              <dbl>  <dbl>  <dbl>     <dbl>       <dbl>       <dbl>
##  1 2013-09-01 17:00:00 1640       1  1640         -1          -1      -1640 
##  2 2013-09-01 17:00:00 1640.      2  3280.         1           2       3280.
##  3 2013-09-01 17:00:00 1640       1  1640         -1          -1      -1640 
##  4 2013-09-01 17:00:00 1640.      1  1640.         1           1       1640.
##  5 2013-09-01 17:00:00 1640       2  3280         -1          -2      -3280 
##  6 2013-09-01 17:00:00 1640.      3  4919.        -1          -3      -4919.
##  7 2013-09-01 17:00:00 1640       1  1640          1           1       1640 
##  8 2013-09-01 17:00:00 1640.      1  1640.        -1          -1      -1640.
##  9 2013-09-01 17:00:00 1640       1  1640          1           1       1640 
## 10 2013-09-01 17:00:00 1640.      2  3280.        -1          -2      -3280.
## # ℹ 270,580 more rows

예제로 사용할 데이터는 다음과 같다.

E-mini S&P 500 선물 데이터
- 기간: 2013-09-01 ~ 2013-09-20
- 행수: 270,590
- 출처: github-AFML

# 틱 바 함수 생성
func_TickBar <- function(df, tickSize = 5500) {
  
  TickBar <- df |> 
    mutate(bar_id = floor((row_number() - 1) / tickSize)) |> 
    group_by(bar_id) |> 
    summarise(
      date_time = last(Date),
      open = first(Price),
      high = max(Price),
      low = min(Price),
      close = last(Price),
      volume = sum(Volume),
      .groups = "drop"
    ) |> 
    select(-bar_id)
  
  return(TickBar)
  
}

# 틱 바 수익률 생성
TickBar <- func_TickBar(E_mini_SPX)
TickBar <- TickBar |> 
  mutate(ret = c(NA, diff(log(close))))

TickBar

## # A tibble: 50 × 7
##    date_time            open  high   low close volume       ret
##    <dttm>              <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
##  1 2013-09-03 01:03:01 1640  1648. 1639  1648.  20801 NA       
##  2 2013-09-03 08:41:36 1648  1650  1642. 1648   19242 -0.000152
##  3 2013-09-03 10:23:37 1648. 1649. 1640. 1640.  20242 -0.00456 
##  4 2013-09-03 12:35:58 1641. 1642. 1632  1636.  21095 -0.00305 
##  5 2013-09-03 14:54:37 1635. 1640. 1631. 1637   19546  0.000917
##  6 2013-09-04 07:25:30 1637. 1642. 1636. 1638.  26517  0.000305
##  7 2013-09-04 09:25:21 1638. 1647. 1635  1646   18380  0.00518 
##  8 2013-09-04 12:16:39 1646. 1654. 1646. 1653   22428  0.00424 
##  9 2013-09-04 14:58:37 1653. 1655. 1650. 1651   19348 -0.00121 
## 10 2013-09-05 08:31:27 1651. 1656. 1650. 1652.  26478  0.000757
## # ℹ 40 more rows

27만개 수준의 틱 데이터가 50개의 행으로 구성된 틱 바로 변환됐다.

1.1.3 거래량 바

틱 바와 원리는 비슷하다. 미리 정의된 단위의 거래가 일어날 때마다 표본을 추출한다. 거래량에 기반을 둔 수익률 표본 추출은 틱 바에 의한 추출보다 훨씬 더 나은 통계적 성질을 가진다는 것이 확인됐다.

# 거래량 바 함수 생성
func_VolumeBar <- function(df, Volume_threshold = 28000) {
  
  VolumeBar <- df |> 
    mutate(bar_id = floor(cumsum(Volume) / Volume_threshold)) |> 
    group_by(bar_id) |> 
    summarise(
      date_time = last(Date),
      open = first(Price),
      high = max(Price),
      low = min(Price),
      close = last(Price),
      volume = sum(Volume),
      .groups = "drop"
    ) |> 
    select(-bar_id)
  
  return(VolumeBar)
  
}

VolumeBar <- func_VolumeBar(E_mini_SPX)
VolumeBar <- VolumeBar |> 
  mutate(ret = c(NA, diff(log(close))))
VolumeBar

## # A tibble: 41 × 7
##    date_time            open  high   low close volume       ret
##    <dttm>              <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
##  1 2013-09-03 04:38:40 1640  1649  1639  1644.  27998 NA       
##  2 2013-09-03 09:55:10 1645. 1650  1642. 1646.  27976  0.000760
##  3 2013-09-03 12:57:44 1646. 1647. 1632. 1633.  28020 -0.00793 
##  4 2013-09-03 15:05:58 1633  1640. 1631. 1638.  27947  0.00336 
##  5 2013-09-04 08:53:38 1638  1644. 1635  1643   28035  0.00290 
##  6 2013-09-04 12:10:58 1643. 1654. 1642. 1653.  28012  0.00592 
##  7 2013-09-04 15:04:17 1652. 1655. 1650. 1653.  27993  0       
##  8 2013-09-05 09:17:28 1652. 1658. 1650. 1658.  28018  0.00302 
##  9 2013-09-05 15:01:26 1658. 1658. 1652. 1653   27985 -0.00287 
## 10 2013-09-06 07:59:35 1653. 1664. 1648. 1662.  28012  0.00558 
## # ℹ 31 more rows

거래량(volumn)이 균일하게 표본이 추출되었다.

1.1.4 달러 바

달러 바는 사전에 정해 둔 시장 가치(market value)가 거래될 때마다 표본을 추출한다. 틱이나 거래량보다는 거래된 달러 가치로 표본을 추출하는 것이 더 합리적이며 특히 큰 가격 변동을 분석하는 경우에 그렇다. 틱이나 거래량 바의 일별 바의 개수는 변동이 심하지만 고정된 크기의 일별 달러 바의 경우 연간 변동의 범위와 속도는 유의미하게 감소하게 된다. 또한 액면 분할이나 병합, 신주 발행 등 틱이나 거래량에 영향을 미치는 다양한 기업 행위들에 대해 달러 바는 강건한 경향이 있다.

# 달러 바 함수 생성
func_DollarBar <- function(df, Dollar_threshold = 7e+7) {
  
  DollarBar <- df |> 
    mutate(
      bar_id = floor(cumsum(Dollar) / Dollar_threshold)) |> 
    group_by(bar_id) |> 
    summarise(
      date_time = last(Date),
      open = first(Price),
      high = max(Price),
      low = min(Price),
      close = last(Price),
      volume = sum(Volume),
      .groups = "drop"
    ) |> 
    select(-bar_id)
  
  return(DollarBar)
  
}

DollarBar <- func_DollarBar(E_mini_SPX)
DollarBar <- DollarBar |> 
  mutate(ret = c(NA, diff(log(close))))
DollarBar

## # A tibble: 28 × 7
##    date_time            open  high   low close volume       ret
##    <dttm>              <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
##  1 2013-09-03 08:51:27 1640  1650  1639  1648.  42532 NA       
##  2 2013-09-03 13:10:58 1647. 1649. 1632. 1635.  42680 -0.00777 
##  3 2013-09-04 07:36:53 1635  1642. 1631. 1636.  42728  0.00107 
##  4 2013-09-04 12:39:25 1637. 1655. 1635  1654.  42540  0.0103  
##  5 2013-09-05 08:29:54 1653. 1656. 1650. 1652.  42364 -0.000756
##  6 2013-09-05 15:30:24 1652. 1658. 1651. 1653.  42294  0.000605
##  7 2013-09-06 08:46:47 1653  1664. 1646  1646   42268 -0.00439 
##  8 2013-09-06 10:13:34 1646. 1660. 1639. 1658.  42462  0.00741 
##  9 2013-09-06 15:30:00 1658. 1664  1652. 1654   42185 -0.00257 
## 10 2013-09-09 13:07:23 1654. 1670. 1652  1669.  42157  0.00918 
## # ℹ 18 more rows

1.1.5 주 단위 발생 건수 비교

bar_count <- tibble(
  week = floor_date(E_mini_SPX$Date, "week")
) |> count(week, name = "count", .drop = T)

bar_count <- bar_count |> 
  full_join(
    TickBar |> mutate(week = floor_date(date_time, "week")) |> count(week, name = "tick"),
    by = "week"
  ) |> 
  full_join(
    VolumeBar |> mutate(week = floor_date(date_time, "week")) |> count(week, name = "volume"),
    by = "week"
  ) |> 
  full_join(
    DollarBar |> mutate(week = floor_date(date_time, "week")) |> count(week, name = "dollar"),
    by = "week"
  ) |> 
  select(-count)

bar_count_long <- pivot_longer(bar_count, -week, names_to = "bar_type", values_to = "count")

ggplot(bar_count_long, aes(x = week, y = count, fill = bar_type)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Number of Bars per Week", y = "Count", x = "Week") +
  theme_minimal()

1.1.6 수익률 정규성 비교

Bar_list <- list(
  TickBar,
  VolumeBar,
  DollarBar
)

log_ret <- map(Bar_list, ~ diff(log(.x$close))) |> 
  map(~scale(.x)) |> 
  map(~as.vector(.x)) |> 
  set_names(c("Tick", "Volume", "Dollar"))

log_ret |> map(~jarque.bera.test(.x)) |> 
  map("statistic") |> 
  reduce(cbind) |> 
  `colnames<-`(c("Tick", "Volume", "Dollar")) |> 
  `rownames<-`("jb_stats")

##              Tick    Volume   Dollar
## jb_stats 3.398232 0.9282544 1.537217

bind_rows(
  lapply(names(log_ret), function(name) {
    tibble(ret = log_ret[[name]], bar_type = name)
    })
  ) |> 
  ggplot(aes(x = ret, color = bar_type)) +
    geom_density(lwd = 1) +
    stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "black", linetype = "dashed") +
    xlim(c(-5, 5)) +
    theme_minimal() +
    labs(title = "Standardized Log Returns[Standard Bar]", x = "Z-score", color = "Bar Type")

1.2 정보 주도 바

표준 바에서 발전된 형태로 시장에 새로운 정보가 도달할 경우 더 빈번히 표본을 추출하는 것이 목적이다. 이는 미시 구조 이론에 기반하는데 불균형한 부호의 거래량이 지속되는 데 초점을 맞추고 있다. 표본 추출을 정보 기반 거래자의 도착과 연동시키면 가격이 새로운 균형 상태에 이르기 전에 투자 의사결정을 내릴 수 있게 된다.

1.2.1 틱 불균형 바

틱 불균형 바(TIB, Tick Imbalance Bar)의 기본 아이디어는 틱 불균형이 예상을 초과할 때마다 표본을 추출하는 것이다. 기본적인 틱 규칙(tick rule)은 다음과 같다.

\[ \text{틱의 가격변화}:\Delta p_t \\ b_t = \left\{ \begin{array}{11} 1 & \text{if } \Delta p_t > 0 \\ -1 & \text{if } \Delta p_t < 0 \end{array} \right. \] \[ \theta_T = \sum_{t=1}^T b_t \]

이제 틱 규칙에 따라 부호가 붙은 틱의 누적값($\theta$)이 주어진 임계값을 넘는 인덱스($T$)를 찾으면 된다. 이때 임계값은 바의 시작점에서의 $\theta$의 기대값으로 계산된다.

\[ \begin{align} E_0[\theta_T] & = E_0[T] \times (P[b_t = 1]-P[b_t = -1]) \\ & = E_0[T] \times (2P[b_t = 1]-1]) \end{align} \]

$E_0[T]$: 틱 바의 기대 크기, 이전 바들로부터의 $T$값의 지수 가중 이동 평균으로 계산
$(2P[b_t = 1]-1)$: 틱이 매수/매도로 분류될 비조건부 확률로 이전 바들로부터의 $b_t$ 값의 지수 가중 이동 평균으로 계산

\[ T^* = \arg\min_{T} \left\{ \left|\theta_T \right| \ge E_0[T] \left|2P[b_t = 1]-1]\right| \right\} \] $\theta_T$가 예상보다 더 불균형일 경우 TIB는 더 빈번히 발생한다. 사실 TIB를 동일한 정보를 가진 거래의 버킷으로 이해할 수 있다.

EWMA <- function(x_vec, span) {
  alpha <- 2 / (span + 1)
  accumulate(x_vec, function(prev, xi) alpha * xi + (1 - alpha) * prev)
}

func_TIB <- function(df, 
                     init_expected_T = 1000, 
                     max_expected_T = 1000, 
                     span_T = 3,
                     span_p = 100) {
  
  t_events <- c()
  cum_theta_list <- vector("numeric", nrow(df) - 1)
  threshold_list <- vector("numeric", nrow(df) - 1)
  
  bar_start <- 1
  num_ticks_in_bar <- 0
  num_ticks_bar_list <- c()
  cum_theta <- 0
  
  # 초기 기대값 설정
  expected_T <- init_expected_T
  imbalance_expectation <- last(EWMA(df$tick_rule[1:init_expected_T], span = span_p))
  threshold <- init_expected_T * abs(imbalance_expectation)
  
  #browser()
  for (i in 2:nrow(df)) {
    
    # 임계값 계산
    b_t <- df$tick_rule[i]
    
    num_ticks_in_bar <- num_ticks_in_bar + 1
    cum_theta <- cum_theta + b_t
    cum_theta_list[i] <- cum_theta
    threshold_list[i] <- threshold
    
    if (threshold != 0 && abs(cum_theta) >= threshold) {

      # 이벤트 발생
      t_events <- c(t_events, i)
      num_ticks_bar_list <- c(num_ticks_bar_list, num_ticks_in_bar)

      # EMA 업데이트
      expected_T <- min(last(EWMA(num_ticks_bar_list, span = span_T)), max_expected_T)
      imbalance_expectation <- last(EWMA(df$tick_rule[bar_start:i], span = span_p))
      
      threshold <- expected_T * abs(imbalance_expectation)
      
      # reset
      bar_start <- i + 1
      num_ticks_in_bar <- 0
      cum_theta <- 0
    }
  }

  TIBBar <- df |> 
              mutate(bar_id = cut(c(1:nrow(df)), breaks = c(0, t_events, nrow(df)), labels = c(0:length(t_events)))) |> 
              group_by(bar_id) |> 
              summarise(
                date_time = last(Date),
                open = first(Price),
                high = max(Price),
                low = min(Price),
                close = last(Price),
                volume = sum(Volume),
                .groups = "drop"
              ) |> 
    select(-bar_id) |> 
    mutate(ret = c(NA, diff(log(close))))
  
  res <- list(
    t_events = t_events,
    TIBBar = TIBBar,
    cum_theta = cum_theta_list,
    threshold = threshold_list
  )
  
  return(res)
}

TIBBar_res <- func_TIB(E_mini_SPX)

TIBBar_res$TIBBar

## # A tibble: 5 × 7
##   date_time            open  high   low close volume      ret
##   <dttm>              <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl>
## 1 2013-09-06 07:08:58 1640  1659. 1631. 1659. 265552 NA      
## 2 2013-09-10 06:36:11 1659  1680. 1639. 1680. 198743  0.0126 
## 3 2013-09-15 17:00:17 1680  1696  1674  1695  221113  0.00874
## 4 2013-09-18 13:00:55 1695. 1712. 1688. 1712. 176933  0.0101 
## 5 2013-09-20 16:14:58 1712  1727. 1701. 1704. 268848 -0.00498

plot(E_mini_SPX$Date, E_mini_SPX$Price, 
     type = "l", 
     main = "Tick Imbalanced Bar(Red)",
     xlab = "Date",
     ylab = "Price")
points(TIBBar_res$TIBBar$date_time, TIBBar_res$TIBBar$close, col = "red", lwd = 2)

plot(TIBBar_res$cum_theta, 
     type = "l",
     main = "Cum theta vs. Threshold",
     xlab = "Index",
     ylab = "Count")
lines(TIBBar_res$threshold, col = "red", lwd = 2)

1.2.2 거래량 불균형 바/달러 불균형 바

거래량 불균형 바(VIB, Volume Imbalance Bar)와 달러 불균형 바(DIB, Dollar Imbalance Bar)는 TIB의 개념을 확장한 것이다. 즉, 거래량이나 달러의 불균형이 기대값을 벗어날 경우에 표본 추출을 하는 것이다. 시점 $T$에서의 불균형은 다음과 같다.

\[ \theta_T = \sum_{t=1}^T b_t v_t \]

여기서 $v_t$는 거래된 증권수(VIB)나 혹은 거래된 달러량(DIB)를 의미한다. 바의 시작에서 $\theta_T$의 기대값은 다음과 같다.

\[ \begin{align} E_0[\theta_T] & = E_0 \left[ \sum_{t|b_t=1}^T v_t \right] - E_0 \left[ \sum_{t|b_t=-1}^T v_t \right] \\ & = E_0[T] \times (P[b_t = 1]E_0[v_t|b_t=1] - P[b_t = -1]E_0[v_t|b_t=-1]) \\ & = E_0[T] \times (v^+ - v^-) \\ & = E_0[T] \times (2v^+ - E_0[v_t]) \\ \\ E_0[v_t] &= v^+ + v^- \end{align} \]

모든 방향의 평균 거래량($E_0[v_t]$)은 매수 방향에서 발생한 기대 거래량($v^+$)과 매도 방향에서 발생한 기대 거래량($v^-$)의 합이다.

$(2v^+ - E_0[v_t])$: 이전 바들로부터의 $b_tv_t$ 값의 지수 가중 이동 평균으로 계산

\[ T^* = \arg\min_{T} \left\{ \left|\theta_T \right| \ge E_0[T] \left|2v^+ - E_0[v_t]\right| \right\} \]

1.2.2.1 거래량 불균형 바(VIB)

func_VIB <- function(df, 
                     init_expected_T = 1000, 
                     max_expected_T = 1000, 
                     span_T = 3,
                     span_p = 100) {
  
  t_events <- c()
  cum_theta_list <- vector("numeric", nrow(df) - 1)
  threshold_list <- vector("numeric", nrow(df) - 1)
  
  bar_start <- 1
  num_ticks_in_bar <- 0
  num_ticks_bar_list <- c()
  cum_theta <- 0
  
  # 초기 기대값 설정
  expected_T <- init_expected_T
  imbalance_expectation <- last(EWMA(df$Volume_tick[1:init_expected_T], span = span_p))
  threshold <- init_expected_T * abs(imbalance_expectation)
  threshold_list <- c(threshold)
  
  for (i in 2:nrow(df)) {
    
    # 임계값 계산
    theta <- df$Volume_tick[i]
    
    num_ticks_in_bar <- num_ticks_in_bar + 1
    cum_theta <- cum_theta + theta
    cum_theta_list[i] <- cum_theta
    threshold_list[i] <- threshold
    
    if (threshold != 0 && abs(cum_theta) >= threshold) {

      # 이벤트 발생
      t_events <- c(t_events, i)
      num_ticks_bar_list <- c(num_ticks_bar_list, num_ticks_in_bar)

      # EMA 업데이트
      expected_T <- min(last(EWMA(num_ticks_bar_list, span = span_T)), max_expected_T)
      imbalance_expectation <- last(EWMA(df$Volume_tick[bar_start:i], span = span_p))
      
      threshold <- expected_T * abs(imbalance_expectation)
      
      # reset
      bar_start <- i + 1
      num_ticks_in_bar <- 0
      cum_theta <- 0
    }
  }

  VIBBar <- df |> 
              mutate(bar_id = cut(c(1:nrow(df)), breaks = c(0, t_events, nrow(df)), labels = c(0:length(t_events)))) |> 
              group_by(bar_id) |> 
              summarise(
                date_time = last(Date),
                open = first(Price),
                high = max(Price),
                low = min(Price),
                close = last(Price),
                volume = sum(Volume),
                .groups = "drop"
              ) |> 
    select(-bar_id) |> 
    mutate(ret = c(NA, diff(log(close))))
  
  res <- list(
    t_events = t_events,
    VIBBar = VIBBar,
    cum_theta = cum_theta_list,
    threshold = threshold_list
  )
  
  return(res)
}

VIBBar_res <- func_VIB(E_mini_SPX)

VIBBar_res$VIBBar

## # A tibble: 6 × 7
##   date_time            open  high   low close volume       ret
##   <dttm>              <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
## 1 2013-09-06 09:03:25 1640  1664. 1631. 1642. 311219 NA       
## 2 2013-09-06 10:02:10 1642  1658. 1639. 1658   25284  0.00985 
## 3 2013-09-10 13:08:22 1658. 1683  1652  1681  169589  0.0138  
## 4 2013-09-10 14:55:32 1681. 1683. 1679. 1681.  12028  0.000149
## 5 2013-09-19 04:32:42 1682. 1726. 1674  1726. 440287  0.0266  
## 6 2013-09-20 16:14:58 1726. 1727. 1701. 1704. 172782 -0.0133

1.2.2.2 달러 불균형 바(DIB)

func_DIB <- function(df, 
                     init_expected_T = 1000, 
                     max_expected_T = 1000, 
                     span_T = 3,
                     span_p = 100) {
  
  t_events <- c()
  cum_theta_list <- vector("numeric", nrow(df) - 1)
  threshold_list <- vector("numeric", nrow(df) - 1)
  
  bar_start <- 1
  num_ticks_in_bar <- 0
  num_ticks_bar_list <- c()
  cum_theta <- 0
  
  # 초기 기대값 설정
  expected_T <- init_expected_T
  imbalance_expectation <- last(EWMA(df$Dollar_tick[1:init_expected_T], span = span_p))
  threshold <- init_expected_T * abs(imbalance_expectation)
  
  for (i in 2:nrow(df)) {
    
    # 임계값 계산
    theta <- df$Dollar_tick[i]
    
    num_ticks_in_bar <- num_ticks_in_bar + 1
    cum_theta <- cum_theta + theta
    cum_theta_list[i] <- cum_theta
    threshold_list[i] <- threshold
    
    if (threshold != 0 && abs(cum_theta) >= threshold) {

      # 이벤트 발생
      t_events <- c(t_events, i)
      num_ticks_bar_list <- c(num_ticks_bar_list, num_ticks_in_bar)

      # EMA 업데이트
      expected_T <- min(last(EWMA(num_ticks_bar_list, span = span_T)), max_expected_T)
      imbalance_expectation <- last(EWMA(df$Dollar_tick[bar_start:i], span = span_p))
      
      threshold <- expected_T * abs(imbalance_expectation)
      
      # reset
      bar_start <- i + 1
      num_ticks_in_bar <- 0
      cum_theta <- 0
    }
  }

  DIBBar <- df |> 
              mutate(bar_id = cut(c(1:nrow(df)), breaks = c(0, t_events, nrow(df)), labels = c(0:length(t_events)))) |> 
              group_by(bar_id) |> 
              summarise(
                date_time = last(Date),
                open = first(Price),
                high = max(Price),
                low = min(Price),
                close = last(Price),
                volume = sum(Volume),
                .groups = "drop"
              ) |> 
    select(-bar_id) |> 
    mutate(ret = c(NA, diff(log(close))))
  
  res <- list(
    t_events = t_events,
    DIBBar = DIBBar,
    cum_theta = cum_theta_list,
    threshold = threshold_list
  )
  
  return(res)
}

DIBBar_res <- func_DIB(E_mini_SPX)

DIBBar_res$DIBBar

## # A tibble: 3 × 7
##   date_time            open  high   low close volume     ret
##   <dttm>              <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl>
## 1 2013-09-06 09:03:20 1640  1664. 1631. 1642  311091 NA     
## 2 2013-09-09 13:20:11 1642. 1672. 1639. 1671. 115990  0.0177
## 3 2013-09-20 16:14:58 1672. 1727. 1668. 1704. 704108  0.0193

1.3 런 바

대규모 거래자들은 주문북을 전부 휩쓸어 가거나 아이스버그 주문, 혹은 부모 주문을 다수의 자식 주문으로 쪼개는 방법 등으로 거래를 실행한다. 이는 전체 거래량 안에서의 매수 시퀀스를 조사해보는 것을 유용하게 하며 매수 시퀀스가 기대에서 벗어날 경우 표본 추출하는 것이 런 바의 기본 아이디어다.

1.3.1 틱 런 바

\[ \theta_T = \text{max} \left\{ \sum_{t|b_t=1}^T b_t, -\sum_{t|b_t=-1}^T b_t\right\} \]

바의 시작에서 $\theta_T$의 기대값은 다음과 같다.

\[ E_0[\theta_T] = E_0 \left[ T \right] \text{max} \left\{ P[b_t = 1], 1 - P[b_t=1] \right\} \]

$E_0[T]$: 틱 바의 기대 크기, 이전 바들로부터의 $T$값의 지수 가중 이동 평균으로 계산
$P[b_t = 1]$: 이전 바들의 매수 틱 비율의 지수 가중 이동 평균으로 계산

\[ T^* = \arg\min_{T} \left\{ \theta_T \ge E_0 \left[ T \right] \text{max} \left\{ P[b_t = 1], 1 - P[b_t=1] \right\} \right\} \] * 위 공식은 시퀀스 단절(sequence breaks)을 허용한다. 즉 가장 긴 시퀀스의 길이를 측정하는 대신, 다른 방향의 틱을 상계하지 않고, 각 방향의 틱 개수를 측정한다. 바를 형성하는 관점에서 이는 시퀀스 길이를 측정하는 것보다 더 유용하다.

func_TRB <- function(df, 
                     init_expected_T = 1000, 
                     max_expected_T = 1000, 
                     span_T = 3,
                     span_p = 100) {
  
  t_events <- c()
  cum_theta_list <- vector("numeric", nrow(df) - 1)
  threshold_list <- vector("numeric", nrow(df) - 1)
  
  theta_pos <- 0
  theta_neg <- 0
  bar_start <- 1
  num_ticks_in_bar <- 0
  num_ticks_bar_list <- c()
  P_bt_1_list <- c()
  cum_theta <- 0
  
  # 초기 기대값 설정
  expected_T <- init_expected_T
  int_bar <- df$tick_rule[1:init_expected_T]
  
  P_bt_1_EWMA <- last(EWMA(mean(int_bar == 1, na.rm = TRUE), span = span_p))
  imbalance_expectation <- max(P_bt_1_EWMA, (1 - P_bt_1_EWMA))
  threshold <- expected_T * imbalance_expectation
  
  for (i in 2:nrow(df)) {
    
    # theta 계산
    b_t <- df$tick_rule[i]
    if (b_t == 1) {
      theta_pos <- theta_pos + 1
    } else if (b_t == -1) {
      theta_neg <- theta_neg + 1
    }
    cum_theta <- max(theta_pos, theta_neg)
    
    num_ticks_in_bar <- num_ticks_in_bar + 1
    cum_theta_list[i] <- cum_theta
    threshold_list[i] <- threshold
    
    if (threshold != 0 && cum_theta >= threshold) {
      
      #browser()
      
      # 이벤트 발생
      t_events <- c(t_events, i)
      num_ticks_bar_list <- c(num_ticks_bar_list, num_ticks_in_bar)
      P_bt_1 <- mean(df$tick_rule[bar_start:i] == 1, na.rm = TRUE)
      P_bt_1_list <- c(P_bt_1_list, P_bt_1)

      # EMA 업데이트
      expected_T <- min(last(EWMA(num_ticks_bar_list, span = span_T)), max_expected_T)
      P_bt_1_EWMA <- last(EWMA(P_bt_1_list, span = span_p))
      imbalance_expectation <- max(P_bt_1_EWMA, (1 - P_bt_1_EWMA))
      
      threshold <- expected_T * imbalance_expectation
      
      # reset
      bar_start <- i + 1
      num_ticks_in_bar <- 0
      cum_theta <- 0
      theta_pos <- 0
      theta_neg <- 0
    }
  }

  TRBBar <- df |> 
              mutate(bar_id = cut(c(1:nrow(df)), breaks = c(0, t_events, nrow(df)), labels = c(0:length(t_events)))) |> 
              group_by(bar_id) |> 
              summarise(
                date_time = last(Date),
                open = first(Price),
                high = max(Price),
                low = min(Price),
                close = last(Price),
                volume = sum(Volume),
                .groups = "drop"
              ) |> 
    select(-bar_id) |> 
    mutate(ret = c(NA, diff(log(close))))
  
  res <- list(
    t_events = t_events,
    TRBBar = TRBBar,
    cum_theta = cum_theta_list,
    threshold = threshold_list
  )
  
  return(res)
}

TRBBar_res <- func_TRB(E_mini_SPX)

TRBBar_res$TRBBar

## # A tibble: 359 × 7
##    date_time            open  high   low close volume       ret
##    <dttm>              <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
##  1 2013-09-01 19:55:22 1640  1643. 1639  1643.   4736 NA       
##  2 2013-09-02 02:27:52 1643  1645  1640. 1644.   2622  0.000608
##  3 2013-09-02 06:44:00 1644  1648. 1644. 1648    3430  0.00228 
##  4 2013-09-02 09:48:27 1648. 1648. 1645. 1647.   4877 -0.000455
##  5 2013-09-02 20:47:48 1648. 1648. 1645. 1646.   3563 -0.000455
##  6 2013-09-03 02:57:06 1647. 1649  1646. 1648.   3293  0.00121 
##  7 2013-09-03 04:09:15 1648. 1648. 1642. 1644.   4197 -0.00258 
##  8 2013-09-03 06:33:10 1644  1647  1642. 1644.   3013 -0.000456
##  9 2013-09-03 08:05:05 1644. 1647  1643. 1647    3013  0.00213 
## 10 2013-09-03 08:32:08 1647. 1649. 1646. 1648    3560  0.000607
## # ℹ 349 more rows

1.3.2 거래량 런 바/달러 런 바

\[ \theta_T = \text{max} \left\{ \sum_{t|b_t=1}^T b_t v_t, -\sum_{t|b_t=-1}^T b_t v_t\right\} \]

바의 시작에서 $\theta_T$의 기대값은 다음과 같다.

\[ E_0[\theta_T] = E_0 \left[ T \right] \text{max} \left\{ P[b_t = 1]E_0[v_t|b_t=1], (1 - P[b_t=1])E_0[v_t|b_t=-1] \right\} \]

\[ T^* = \arg\min_{T} \left\{ \theta_T \ge E_0 \left[ T \right] \text{max} \left\{ P[b_t = 1]E_0[v_t|b_t=1], (1 - P[b_t=1])E_0[v_t|b_t=-1] \right\} \right\} \]

1.3.2.1 거래량 런 바

func_VRB <- function(df, 
                     init_expected_T = 1000, 
                     max_expected_T = 1000, 
                     span_T = 3,
                     span_p = 100) {
  
  t_events <- c()
  cum_theta_list <- vector("numeric", nrow(df) - 1)
  threshold_list <- vector("numeric", nrow(df) - 1)
  
  theta_pos <- 0
  theta_neg <- 0
  bar_start <- 1
  num_ticks_in_bar <- 0
  num_ticks_bar_list <- c()
  P_bt_1_list <- c()
  cum_theta <- 0
  
  # 초기 기대값 설정
  expected_T <- init_expected_T
  
  df_sub <- df[1:init_expected_T, ]
  int_tic <- df_sub$tick_rule
  vol_buy <- df_sub$Volume[int_tic == 1]
  vol_sell <- df_sub$Volume[int_tic == -1]
  
  P_bt_1_EWMA <- last(EWMA(mean(int_tic == 1, na.rm = TRUE), span = span_p))
  P_vt_buy_EWMA <- last(EWMA(vol_buy, span = span_p))
  P_vt_sell_EWMA <- last(EWMA(vol_sell, span = span_p))
  
  imbalance_expectation <- max(P_bt_1_EWMA * P_vt_buy_EWMA, (1 - P_bt_1_EWMA) * P_vt_sell_EWMA)
  threshold <- expected_T * imbalance_expectation
  
  for (i in 2:nrow(df)) {
    
    # theta 계산
    b_t <- df$tick_rule[i]
    bv_t <- df$Volume_tick[i]
    
    if (b_t == 1) {
      theta_pos <- theta_pos + bv_t
    } else if (b_t == -1) {
      theta_neg <- theta_neg + bv_t
    }
    cum_theta <- max(theta_pos, theta_neg)
    
    num_ticks_in_bar <- num_ticks_in_bar + 1
    cum_theta_list[i] <- cum_theta
    threshold_list[i] <- threshold
    
    if (threshold != 0 && cum_theta >= threshold) {
      
      # 이벤트 발생
      t_events <- c(t_events, i)
      num_ticks_bar_list <- c(num_ticks_bar_list, num_ticks_in_bar)
      
      df_sub <- df[bar_start:i, ]
      int_tic <- df_sub$tick_rule
      P_bt_1 <- mean(int_tic == 1, na.rm = TRUE)
      P_bt_1_list <- c(P_bt_1_list, P_bt_1)
      
      vol_buy <- df_sub$Volume[int_tic == 1]
      vol_sell <- df_sub$Volume[int_tic == -1]
      
      # EMA 업데이트
      expected_T <- min(last(EWMA(num_ticks_bar_list, span = span_T)), max_expected_T)
      P_bt_1_EWMA <- last(EWMA(P_bt_1_list, span = span_p))
      P_vt_buy_EWMA <- last(EWMA(vol_buy, span = span_p))
      P_vt_sell_EWMA <- last(EWMA(vol_sell, span = span_p))
      
      imbalance_expectation <- max(P_bt_1_EWMA * P_vt_buy_EWMA, (1 - P_bt_1_EWMA) * P_vt_sell_EWMA)
      
      threshold <- expected_T * imbalance_expectation
      
      # reset
      bar_start <- i + 1
      num_ticks_in_bar <- 0
      cum_theta <- 0
      theta_pos <- 0
      theta_neg <- 0
    }
  }

  VRBBar <- df |> 
              mutate(bar_id = cut(c(1:nrow(df)), breaks = c(0, t_events, nrow(df)), labels = c(0:length(t_events)))) |> 
              group_by(bar_id) |> 
              summarise(
                date_time = last(Date),
                open = first(Price),
                high = max(Price),
                low = min(Price),
                close = last(Price),
                volume = sum(Volume),
                .groups = "drop"
              ) |> 
    select(-bar_id) |> 
    mutate(ret = c(NA, diff(log(close))))
  
  res <- list(
    t_events = t_events,
    VRBBar = VRBBar,
    cum_theta = cum_theta_list,
    threshold = threshold_list
  )
  
  return(res)
}

VRBBar_res <- func_VRB(E_mini_SPX)

VRBBar_res$VRBBar

## # A tibble: 197 × 7
##    date_time            open  high   low close volume       ret
##    <dttm>              <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
##  1 2013-09-02 02:25:26 1640  1645  1639  1645    7304 NA       
##  2 2013-09-02 06:40:19 1645. 1648. 1644. 1648.   3210  0.00167 
##  3 2013-09-02 09:34:32 1648  1648. 1645. 1647    4738 -0.000455
##  4 2013-09-03 02:20:05 1647. 1649  1645. 1647.   6740  0.000152
##  5 2013-09-03 04:04:49 1647  1648. 1642. 1646    4267 -0.000759
##  6 2013-09-03 08:35:06 1646. 1650. 1642. 1649   11742  0.00182 
##  7 2013-09-03 08:51:04 1649. 1650  1647. 1648.   4495 -0.000910
##  8 2013-09-03 09:06:42 1647. 1648  1644. 1647    4313 -0.000304
##  9 2013-09-03 09:13:30 1647. 1649. 1646. 1648.   2449  0.000910
## 10 2013-09-03 10:00:23 1648. 1648. 1644. 1645.   7170 -0.00197 
## # ℹ 187 more rows

1.3.2.2 달러 런 바

func_DRB <- function(df, 
                     init_expected_T = 1000, 
                     max_expected_T = 1000, 
                     span_T = 3,
                     span_p = 100) {
  
  t_events <- c()
  cum_theta_list <- vector("numeric", nrow(df) - 1)
  threshold_list <- vector("numeric", nrow(df) - 1)
  
  theta_pos <- 0
  theta_neg <- 0
  bar_start <- 1
  num_ticks_in_bar <- 0
  num_ticks_bar_list <- c()
  P_bt_1_list <- c()
  cum_theta <- 0
  
  # 초기 기대값 설정
  expected_T <- init_expected_T
  
  df_sub <- df[1:init_expected_T, ]
  int_tic <- df_sub$tick_rule
  dollar_buy <- df_sub$Dollar[int_tic == 1]
  dollar_sell <- df_sub$Dollar[int_tic == -1]
  
  P_bt_1_EWMA <- last(EWMA(mean(int_tic == 1, na.rm = TRUE), span = span_p))
  P_dt_buy_EWMA <- last(EWMA(dollar_buy, span = span_p))
  P_dt_sell_EWMA <- last(EWMA(dollar_sell, span = span_p))
  
  imbalance_expectation <- max(P_bt_1_EWMA * P_dt_buy_EWMA, (1 - P_bt_1_EWMA) * P_dt_sell_EWMA)
  threshold <- expected_T * imbalance_expectation
  
  for (i in 2:nrow(df)) {
    
    # theta 계산
    b_t <- df$tick_rule[i]
    bd_t <- df$Dollar_tick[i]
    
    if (b_t == 1) {
      theta_pos <- theta_pos + bd_t
    } else if (b_t == -1) {
      theta_neg <- theta_neg + bd_t
    }
    cum_theta <- max(theta_pos, theta_neg)
    
    num_ticks_in_bar <- num_ticks_in_bar + 1
    cum_theta_list[i] <- cum_theta
    threshold_list[i] <- threshold
    
    if (threshold != 0 && cum_theta >= threshold) {
      
      # 이벤트 발생
      t_events <- c(t_events, i)
      num_ticks_bar_list <- c(num_ticks_bar_list, num_ticks_in_bar)
      
      df_sub <- df[bar_start:i, ]
      int_tic <- df_sub$tick_rule
      P_bt_1 <- mean(int_tic == 1, na.rm = TRUE)
      P_bt_1_list <- c(P_bt_1_list, P_bt_1)
      
      dollar_buy <- df_sub$Dollar[int_tic == 1]
      dollar_sell <- df_sub$Dollar[int_tic == -1]
      
      # EMA 업데이트
      expected_T <- min(last(EWMA(num_ticks_bar_list, span = span_T)), max_expected_T)
      P_bt_1_EWMA <- last(EWMA(P_bt_1_list, span = span_p))
      P_dt_buy_EWMA <- last(EWMA(dollar_buy, span = span_p))
      P_dt_sell_EWMA <- last(EWMA(dollar_sell, span = span_p))
      
      imbalance_expectation <- max(P_bt_1_EWMA * P_dt_buy_EWMA, (1 - P_bt_1_EWMA) * P_dt_sell_EWMA)
      
      threshold <- expected_T * imbalance_expectation
      
      # reset
      bar_start <- i + 1
      num_ticks_in_bar <- 0
      cum_theta <- 0
      theta_pos <- 0
      theta_neg <- 0
    }
  }

  DRBBar <- df |> 
              mutate(bar_id = cut(c(1:nrow(df)), breaks = c(0, t_events, nrow(df)), labels = c(0:length(t_events)))) |> 
              group_by(bar_id) |> 
              summarise(
                date_time = last(Date),
                open = first(Price),
                high = max(Price),
                low = min(Price),
                close = last(Price),
                volume = sum(Volume),
                .groups = "drop"
              ) |> 
    select(-bar_id) |> 
    mutate(ret = c(NA, diff(log(close))))
  
  res <- list(
    t_events = t_events,
    DRBBar = DRBBar,
    cum_theta = cum_theta_list,
    threshold = threshold_list
  )
  
  return(res)
}

DRBBar_res <- func_DRB(E_mini_SPX)

DRBBar_res$DRBBar

## # A tibble: 186 × 7
##    date_time            open  high   low close volume       ret
##    <dttm>              <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
##  1 2013-09-02 02:25:26 1640  1645  1639  1645    7304 NA       
##  2 2013-09-02 06:40:19 1645. 1648. 1644. 1648.   3210  0.00167 
##  3 2013-09-02 09:34:32 1648  1648. 1645. 1647    4738 -0.000455
##  4 2013-09-03 02:19:41 1647. 1649  1645. 1647.   6724  0.000152
##  5 2013-09-03 04:04:49 1647  1648. 1642. 1646    4283 -0.000759
##  6 2013-09-03 08:35:05 1646. 1650. 1642. 1649   11736  0.00182 
##  7 2013-09-03 08:50:59 1649. 1650  1647. 1648.   4451 -0.000758
##  8 2013-09-03 09:06:31 1648. 1648  1644. 1647.   4304 -0.000303
##  9 2013-09-03 09:13:30 1647  1649. 1646. 1648.   2508  0.000759
## 10 2013-09-03 10:00:23 1648. 1648. 1644. 1645.   7170 -0.00197 
## # ℹ 176 more rows

1.3.3 수익률 정규성 비교

Bar_list <- list(
  TRBBar_res$TRBBar,
  VRBBar_res$VRBBar,
  DRBBar_res$DRBBar
)

log_ret <- map(Bar_list, ~scale(.x$ret)) |> 
  map(~as.vector(.x)) |> 
  set_names(c("Tick", "Volume", "Dollar"))

log_ret |> map(~jarque.bera.test(.x[-1])) |> 
  map("statistic") |> 
  reduce(cbind) |> 
  `colnames<-`(c("Tick", "Volume", "Dollar")) |> 
  `rownames<-`("jb_stats")

##             Tick  Volume   Dollar
## jb_stats 538.169 79.9498 174.2682

bind_rows(
  lapply(names(log_ret), function(name) {
    tibble(ret = log_ret[[name]], bar_type = name)
    })
  ) |> 
  ggplot(aes(x = ret, color = bar_type)) +
    geom_density(lwd = 1) +
    stat_function(fun = dnorm, args = list(mean = 0, sd = 1), color = "black", linetype = "dashed") +
    xlim(c(-5, 5)) +
    theme_minimal() +
    labs(title = "Standardized Log Returns[Run Bar]", x = "Z-score", color = "Bar Type")

2. ETF 트릭

가중치를 시간에 따라 동적으로 조정해야하거나 비정기적인 쿠폰이나 비정기적 배당 지급 또는 기업 행위와 관련된 상품의 시계열을 모델링 하고자 할 때가 있다. 이때 연구 중인 시계열의 속성을 변경하는 이벤트들을 적절히 다뤄야 한다.

프라도교수는 증권의 바스켓을 단일 현금 상품처럼 모델링할 수 있는 방법으로 이를 해결하려고 하는데 ‘ETF 트릭’이라고 부른다. 이 방법의 목표는 복잡한 멀티 상품 데이터셋을 토덜 리턴 ETF를 따르는 단일 데이터셋으로 변환하는 것이다. 이 방법은 상당히 유용한데 항상 현금성 상품만 거래하는 것으로 가정할 수 있기 때문이다.

2.1 ETF 트릭

선물 스프레드를 거래하는 전략을 수립한다고 가정해본다. 스프레드를 다룸에 있어 몇 가지 문제가 있는 데 다음과 같다.

스프레드는 시간에 따라 변동하는 비중 벡터로 특정지을 수 있으며 그 결과 가격은 수렴하지 않아도 스프레드 자체는 수렴할 수 있다.
스프레드는 가격을 반영하지 않으므로 음의 값을 가질 수 있다.
모든 구성 요소의 거래 시점이 정확히 일치하지 않아 스프레드는 가장 최근 가격 레벨에서 항상 거래되는 것이 아니며, 즉시 거래되는 것도 아니다. 또한 거래 실행 비용도 고려해야 한다.

이러한 문제를 해결하기 위해 스프레드에 1달러 가치를 투자한다고 가정한 시계열 데이터를 생성하는 것이 방법이 될 수 있다. 시계열의 변화는 손익을 반영할 것이고, 항상 양수이고, 거래 집행 비용도 고려될 것이다.

앞서 설명한 여러 바 기법으로 생성된 바들로 구성된 히스토리가 주어졌다고 가정해본다. 바 $B \subseteq \{ 1, ~...~ ,T\}$에 재조정(또는 롤오버)된 배분 벡터 $\omega_{i,t}$로 특정지어진 선물 바스켓의 달러 투자 가치 $\{K_t\}$는 다음과 같이 유도 가능하다.

\[ h_{i,t}= \begin{cases} \dfrac{\omega_{i,tK_t}} {\,o_{i, t+1}\phi_{i,t}\sum_{i=1}^I\lvert\omega_{i,t}\rvert\,} & \text{if }t \in B \, \\[1ex] \,h_{i,t-1} & \text{otherwise.} \end{cases} \]

\[ \delta_{i,t}= \begin{cases} {\,p_{i, t+1}-o_{i,t}\,} & \text{if }(t-1) \in B \, \\[1ex] \,\Delta p_{i,t} & \text{otherwise.} \end{cases} \]

\[ K_t = K_{t-1} + \sum_{i=1}^I \, h_{i,t-1} \phi_{i,t}(\delta_{i,t}+d_{i,t}) \]

모든 금융 상품 $i = 1,~...,~I$는 바 $t = 1,~...,~T$에서 거래가 가능하다.
최초 운용 자산에서 $K_0 = 1$이다.
$o_{i,t}$: 금융 상품 $i$의 시가
$p_{i,t}$: 금융 상품 $i$의 종가
$\phi_{i,t}$: 금융 상품 $i$의 포인트당 달러 가치(환율 포함)
$v_{i,t}$: 금융 상품 $i$의 거래량
$d_{i,t}$: 금융 상품 $i$의 캐리, 배당 또는 쿠폰에 의한 가치(한계 비용이나 조달 비용을 부과할 때 이용)
$h_{i,t}$: 시간 $t$에서의 금융 상품 $i$의 보유 자산(주식 수 또는 계약 수)
$\omega_{i,t}$: 각 자산에 대한 목표 비중
$\delta_{i,t}$: 금융 상품 $i$의 시점 $t$와 $t-1$ 사이의 시장 가격 변동
$t \in B$일 때마다 수익이나 손식은 재투자돼 음수 가격을 방지한다.
배당금 $d_{i,t}$는 이미 $K_t$에 내포돼 있으므로 별도로 고려하지 않는다.

$\omega_{i,t}(\sum_{i=1}^I\lvert\omega_{i,t}\rvert)^{-1}$의 목적은 배분에 있어 레버리지를 낮추기 위함이다.
선물 데터에서 롤(만기 연장) 시점 $t$에서 새로운 계약 $p_{i,t}$에 대해 모를 수 있으므로 시간상 가장 근접한 $o_{i, t+1}$을 사용한다.

$\tau_i$가 금융 상품 $i$의 1달러 당 거래 비용(e.g. 1bp)이라고 가정해볼 때, 모든 관측 바 $t$에 대해 전략이 알아야 할 세 가지 추가적인 변수가 있다.

재조정 비용(rebalace cost)
- 배분 재조정에 연계된 변동 비용 $\{c_t\}$
- $K_t$에서 $c_t$를 차감해야 배분 재조정 시 스프레드 매도가 허구의 이익을 발생키지 않는다.
  
  \[ c_t = \sum_{i=1}^I(\lvert h_{i,t-1} \rvert p_{i,t} + \lvert h_{i,t} \rvert o_{i,t+1})\phi_{i,t}\tau_i \\ \forall t \in B \]
매수 매도 호가 차이(bid-ask spread)
- 가상 ETF 한 단위를 매수하거나 매도하는 비용 $\{\tilde{c_t}\}$
  
  \[ \tilde{c_t} = \sum_{i=1}^I\lvert h_{i,t-1} \rvert p_{i,t} \phi_{i,t} \tau_i \]
거래량(volume)
- 바스켓 상 가장 거래가 안 된 상품에 의해 결정
  
  \[ v_t = \min_{i} \left\{ \frac{v_{i,t}}{\lvert h_{i, t-1} \rvert} \right\} \]

# 예시 데이터 생성
df <- tibble::tibble(
  date = seq.Date(as.Date("2022-01-01"), by = "day", length.out = 5),
  asset_1_open = c(100, 102, 101, 103, 104),
  asset_1_close = c(102, 101, 103, 104, 105),
  asset_2_open = c(200, 198, 197, 199, 200),
  asset_2_close = c(198, 197, 199, 200, 201),
  phi_1 = 1,
  phi_2 = 1,
  div_1 = 0,
  div_2 = 0
)

# 초기 설정
K <- c(1)
h_1 <- numeric(nrow(df))
h_2 <- numeric(nrow(df))

# 리밸런싱 시점과 비중 설정
B <- c(1, 3)  # 1-based indexing (t=0,2 in Python)
omega <- list(
  `1` = c(1, -1),     # asset 1 long, asset 2 short
  `3` = c(0.5, -0.5)  # equal weight
)

# ETF 가치 계산
for (t in 2:nrow(df)) {
  # 리밸런싱
  if (t %in% B) {
    w <- omega[[as.character(t)]]
    total_abs_weight <- sum(abs(w))
    h_1[t] <- (w[1] * K[t - 1]) / (df$asset_1_open[t] * df$phi_1[t] * total_abs_weight)
    h_2[t] <- (w[2] * K[t - 1]) / (df$asset_2_open[t] * df$phi_2[t] * total_abs_weight)
  } else {
    h_1[t] <- h_1[t - 1]
    h_2[t] <- h_2[t - 1]
  }
  
  # 가격 변화 계산
  if ((t - 1) %in% B) {
    delta_1 <- df$asset_1_close[t] - df$asset_1_open[t - 1]
    delta_2 <- df$asset_2_close[t] - df$asset_2_open[t - 1]
  } else {
    delta_1 <- df$asset_1_close[t] - df$asset_1_close[t - 1]
    delta_2 <- df$asset_2_close[t] - df$asset_2_close[t - 1]
  }
  
  # ETF 가치 업데이트
  delta_K <- h_1[t - 1] * df$phi_1[t] * (delta_1 + df$div_1[t]) +
    h_2[t - 1] * df$phi_2[t] * (delta_2 + df$div_2[t])
  
  K[t] <- K[t - 1] + delta_K
}

# 결과 결합
df$K <- K
df$h_1 <- h_1
df$h_2 <- h_2

print(df)

## # A tibble: 5 × 12
##   date       asset_1_open asset_1_close asset_2_open asset_2_close phi_1 phi_2
##   <date>            <dbl>         <dbl>        <dbl>         <dbl> <dbl> <dbl>
## 1 2022-01-01          100           102          200           198     1     1
## 2 2022-01-02          102           101          198           197     1     1
## 3 2022-01-03          101           103          197           199     1     1
## 4 2022-01-04          103           104          199           200     1     1
## 5 2022-01-05          104           105          200           201     1     1
## # ℹ 5 more variables: div_1 <dbl>, div_2 <dbl>, K <dbl>, h_1 <dbl>, h_2 <dbl>

2.2 PCA 가중값

2.1에서 사용된 비중 벡터 $\omega_t$를 도출하는 방법 중 하나이다.

크기가 $N \times I$인 벡터의 공분산 행렬 $V$의 크기는 $N \times N$이다.
공분산 행렬 $V$는 대칭 행렬이면서, 양의 준정부호(positive semi-definite)이다. 이에 따라 고유값 분해(Eigen decomposition)가 가능하다.

공분산 행렬의 스펙트럼 분해를 실시한다.
- $V = W \times \Lambda \times W^T$
  - $W$: $N \times N$크기이며, 각 열은 $V$의 고유벡터(Eigen vector) - 위험의 방향성(risk direction)
  - $\Lambda$: $N \times N$대각 행렬이며, 각 대각 원소는 $V$의 고유값(Eigen values) - 해당 방향의 변동성 크기(risk magnitude)
가중치 벡터 $\omega$ 고유공간으로 투영(projection) - 직교 기저로 표현
- $\omega$를 새로운 기저($W$)에 대해 다시 표현하는 것
- $\beta = W^T \times \omega$
- $\beta$는 $\omega$ 벡터가 각 주성분 방향으로 얼마 투영되어 있는지를 나타내는 계수
가중치 벡터 $\omega$를 반영한 분산의 재표현
- 포트폴리오의 리스크는 다음과 같이 표현 가능
  
  \[ \begin{align} \sigma^2 &= \omega^TV\omega \\ &= \omega^TW\Lambda W^T\omega \\ &= \beta^T \Lambda \beta \\ &= (\Lambda^{\frac{1}{2}}\beta)^T(\Lambda^{\frac{1}{2}}\beta) \\ &= \| \Lambda^{\frac{1}{2}}\beta \| ^2 \\ &= \sum^N_{n=1} \Lambda_n \beta_n^2 \end{align} \]
- $\beta_n$: 주성분 방향 $n$으로의 베팅 정도(노출도)
- $\Lambda^{1/2}$: 그 방향의 리스크 스케일(표준편차)
- $n$번째 성분에 해당되는 리스크 $R_n$는 다음과 같다.
  
  \[ \begin{align} R_n &= \frac{\Lambda_n \beta_n^2}{\sigma^2} \end{align} \]
- 리스크 기여도의 벡터 $R$은 전체 합이 1이 되는 확률 분포로 해석될 수 있다.
$\beta$는 새로운 직교 기저에서의 배분을 의미한다.

\[ \beta = \left\{ \sigma \sqrt{\frac{R_n}{\Lambda_n}} \right\}_{n=1,...,N} \]

이전 기저에서의 배분 $\omega = W\beta$에 의해 나타난다.
- $\omega$를 조정하는 것은 단순히 $\sigma$를 조정하는 것이므로 리스크 분포는 일정하게 유지된다.

역분산 포트폴리오(Inverse Variance PF)와 반대로 PCA 포트폴리오는 최소 분산만 리스크에 영향을 미친다.

다음은 리스크 분포 R로부터 PCA 가중치를 구하는 함수이다.

pca_weights <- function(cov_mat, riskDist = NULL, riskTarget = 1) {
  # 고유값 분해 (eigen은 기본적으로 내림차순 아님)
  eig <- eigen(cov_mat, symmetric = TRUE)
  values <- eig$values
  vectors <- eig$vectors

  # 고유값 내림차순 정렬
  idx <- order(values, decreasing = TRUE)
  lambda <- values[idx]
  V <- vectors[, idx]
  
  N <- length(lambda)

  # 리스크 분포 설정 (기본: 마지막 PC에만 1 할당)
  if (is.null(riskDist)) {
    riskDist <- rep(0, N)
    riskDist[N] <- 1
  }
  
  # β 계산: 각 주성분 방향으로의 베팅 크기
  beta <- riskTarget * sqrt(riskDist / lambda)
  
  # 가중치 계산: 자산 공간으로 복원
  omega <- V %*% beta
  
  return(as.vector(omega))
}

2.3 단일 선물 롤 오버

일반적으로 선물은 만기가 존재하기 때문에, 특정 월물의 데이터를 장기간 분석하기 어렵다. 따라서 여러 월물 데이터를 이어붙이는 ‘롤링(rolling)’작업 필요하며, 이 과정에서 발생하는 가격 차이(gap)를 처리하는 것이 중요하다. 해당 장의 핵심은 선물 데이터를 연속적이고 분석 가능한 시계열로 전처리하는 것이다.

프라도 교수는 이러한 단일 선물 롤을 ETF 트릭의 한 형태로 간주하며, 누적 롤 갭(roll gap) 시계열을 만들어 이를 원시 가격에서 차감하는 방식으로 처리가 가능하다고 한다.

롤 갭(roll gap)이란 선물 계약을 다른 월물로 교체할 때, 보통 가격이 연속적이지 않고 점프(gap)가 발생하는데 이러한 갭을 누적하여 시간에 따라 정렬된 시게열을 생성할 수 있다. 이렇게 누적된 갭 시계열을 기존 선물 가격 시계열에서 빼줌으로써, 인위적인 점프가 제거된 ’연속 가격 시계열’을 생성할 수 있다.

2.3.1 기본 함수

다음은 이와 관련된 함수들이다.

roll_gaps(): 누적 롤 갭을 생성한다.
- match_end = T: Backward roll(후방)으로 롤 끝부분이 원시 시계열과 일치
- match_end = F: Forward roll(전방)으로 롤 시작부분이 원시 시계열과 일치

roll_gaps <- function(df,
                      inst_col = "Instrument",
                      open_col = "Open",
                      close_col = "Close",
                      match_end = TRUE) {
  
  # 롤오버 발생 시점: 종목이 바뀌는 시점의 index
  roll_dates <- df %>%
    distinct(!!sym(inst_col), .keep_all = TRUE) %>%
    pull(Time)

  # 갭 시계열 초기화
  gaps <- rep(0, nrow(df))
  names(gaps) <- df$Time
  
  # 롤 직전 종가 vs 새 월물 시가 차이 계산
  for (i in 2:length(roll_dates)) {
    t_now <- roll_dates[i]
    idx_now <- which(df$Time == t_now)[1]
    idx_prev <- idx_now - 1
    
    open_now <- df[[open_col]][idx_now]
    close_prev <- df[[close_col]][idx_prev]
    
    gaps[idx_now] <- open_now - close_prev
  }
  
  # 누적 갭 계산
  cum_gaps <- cumsum(gaps)
  
  # match_end 옵션: 마지막 시점이 원래 가격과 맞도록 조정 (backward roll)
  if (match_end) {
    cum_gaps <- cum_gaps - cum_gaps[length(cum_gaps)]
  }
  
  return(cum_gaps)

}

get_rolled_series(): 원시 시계열에서 누적 롤 갭을 차감하여 연속 시계열을 생성한다.

get_rolled_series <- function(df, 
                              inst_col = "Instrument",
                              fields = c("Close", "VWAP"),
                              match_end = TRUE) {
  df <- df %>%
    arrange(Time)
  
  # 롤 갭 시계열 계산
  gaps <- roll_gaps(df, inst_col = inst_col, 
                    open_col = "Open", close_col = "Close", 
                    match_end = match_end)
  
  # 갭 제거
  for (fld in fields) {
    df[[fld]] <- df[[fld]] - gaps
  }
  
  return(df)
}

# 예시 데이터 프레임
df <- tibble(
  Time = as.POSIXct(seq(1, 10, 1), origin = "2023-01-01"),
  Instrument = c(rep("F1", 5), rep("F2", 5)),
  Open = c(100, 101, 102, 103, 104, 98, 99, 100, 101, 102),
  Close = c(101, 102, 103, 104, 105, 99, 100, 101, 102, 103),
  VWAP = c(100.5, 101.5, 102.5, 103.5, 104.5, 98.5, 99.5, 100.5, 101.5, 102.5)
)

get_rolled_series(df, match_end = T)

## # A tibble: 10 × 5
##    Time                Instrument  Open Close  VWAP
##    <dttm>              <chr>      <dbl> <dbl> <dbl>
##  1 2023-01-01 09:00:01 F1           100    94  93.5
##  2 2023-01-01 09:00:02 F1           101    95  94.5
##  3 2023-01-01 09:00:03 F1           102    96  95.5
##  4 2023-01-01 09:00:04 F1           103    97  96.5
##  5 2023-01-01 09:00:05 F1           104    98  97.5
##  6 2023-01-01 09:00:06 F2            98    99  98.5
##  7 2023-01-01 09:00:07 F2            99   100  99.5
##  8 2023-01-01 09:00:08 F2           100   101 100. 
##  9 2023-01-01 09:00:09 F2           101   102 102. 
## 10 2023-01-01 09:00:10 F2           102   103 102.

2.3.2 음의 시계열이 생성될 경우

선물가격은 기본적으로 음수가 될 수 없습니다. 하지만 콘탱고에서 기초자산 가격이 지속적으로 하락하는 상황에서는 누적 롤 갭을 차감하는 과정에서 음수로 내려가는 경우가 발생할 수 있습니다. 음의 시계열은 여러 이유로 문제를 발생시키기 때문에 음이 아닌 시계열로 변환시켜야합니다. 프라도 교수가 제시한 방법은 다음과 같다.

롤 선물 가격의 시계열을 계산한다.
수익률 (r)은 이전 원시 가격으로 나눈 롤 가격 변화로 계산한다.
이 수익률을 사용해 가격 시계열을 구성한다.

# 1. raw 데이터 로딩 (예시)
raw <- tibble::tibble(
  Time = as.Date('2023-01-01') + 0:4,
  Instrument = c("F1", "F1", "F2", "F2", "F2"),
  Open = c(100, 101, 105, 104, 103),
  Close = c(101, 102, 99, 97, 95)
)

# 2. 롤 갭 시계열 생성(과거 수익률까지 계산할 것이기에 match_end = FALSE 설정)
rolled <- get_rolled_series(raw, fields = "Close", match_end = F)

# 3. 수익률 계산
rolled$Return <- c(0, diff(rolled$Close) / dplyr::lag(raw$Close)[-1])

# 4. 누적 수익률로 비음수 가격 재구성
rolled$rPrices <- cumprod(rolled$Return+1)

rolled

## # A tibble: 5 × 6
##   Time       Instrument  Open Close   Return rPrices
##   <date>     <chr>      <dbl> <dbl>    <dbl>   <dbl>
## 1 2023-01-01 F1           100   101  0         1    
## 2 2023-01-02 F1           101   102  0.00990   1.01 
## 3 2023-01-03 F2           105    96 -0.0588    0.950
## 4 2023-01-04 F2           104    94 -0.0202    0.931
## 5 2023-01-05 F2           103    92 -0.0206    0.912

3. 특성 샘플 추출

3.1 축소를 위한 표본 추출

지금까지 비정형 금융 데이터 집합으로부터 연속이고 동질이며 구조화된 데이터셋을 생성하는 방법을 배웠다. 하지만 이러한 데이터셋을 바로 머신러닝 알고리즘에 적용하려고 한다면 몇가지 문제점이 발생할 수 있는데 대표적인 것이 바로 표본크기가 너무 큰 경우이다. 이에 대한 해결책은 알고리즘 적합화에 사용될 데이터 양을 줄이는 것인데 이를 다운 샘플링(down sampling)이라고 부른다.

linspace 샘플링: 일정한 단계별 크기로 순차적으로 표본 추출
uniform 샘플링: 균등 분포를 사용해 무작위로 표본 추출

3.2 이벤트 기반의 표본 추출

포트폴리오 매니저는 구조적 변화나 추출된 신호 또는 미시 구조적 현상 등 어떤 사건이 발생한 후에 배팅을 한다. 이러한 이벤트들을 알고리즘이 학습할 수 있다면 좋을 것이다. 다음은 프라도 교수가 제시하는 방법이다.

3.2.1 CUSUM 필터

CUSUM(Cumulative Sum) 필터는 품질 통제 기법으로서 측정값이 목표값의 평균으로부터 벗어나는지 찾을 수 있도록 설계된다.

국지적 정상성 프로세스(locally stationary process)에서 발생한 IID 관측값 $\{y_t\}_{t=1,...,T}$가 있다고 가정해보자.

\[ S_t = \max \{0,~S_{t-1} + y_t - \mathbb{E}_{t-1}[y_t] \} \]

특정 임계값 $h$에 대해 $S_t \ge h$를 만족하는 첫 번째 $t$를 식별한다.
경계 조건은 $S_0 = 0$이며 0 하한은 $S_t$가 음수가 되지 못하도록 만든다.
$\mathbb{E}_{t-1}[y_t]$: 이전 시점까지의 기대값으로 과거 정보를 포함한다. 일반적으로 $y_{t-1}$을 사용한다.
$y_t - \mathbb{E}_{t-1}[y_t]$: 기대값 대비 초과 수익률

임계값은 다음의 경우에 활성화된다.

\[ S_t \ge h \Leftrightarrow \exists \tau \in [1, t]\left| \sum_{i=\tau}^t (y_i - \mathbb{E}_{i-1}[y_t]) \ge h \right. \]

여기서 상방 누적(run-ups)과 하방 누적(run-downs) 개념을 포함해 확장이 가능하며 이는 대칭 CUSUM 필터를 형성한다.

\[ S_t^+ = \max \{0,~S_{t-1}^+ + y_t - \mathbb{E}_{t-1}[y_t] \},\:S_0^+ = 0 \\ S_t^- = \max \{0,~S_{t-1}^- + y_t - \mathbb{E}_{t-1}[y_t] \},\:S_0^- = 0 \\ S_t = \max \{S_t^+, -S_t^-\} \]

$S_t \ge h$를 만족하는 시점에서 샘플링 타임스탬프를 저장하고 누적값을 리셋한다.
get_t_events(): 대칭CUSUM 필터 함수

get_t_events <- function(gRaw, h) {
  # gRaw: 벡터 또는 시계열 (숫자형)
  # h: threshold (양수)
  
  # diff 계산
  delta <- diff(gRaw)
  tEvents <- c()
  sPos <- 0
  sNeg <- 0

  # loop 시작 (R의 diff는 길이가 n-1이므로 index를 2부터)
  for (i in 1:length(delta)) {
    sPos <- max(0, sPos + delta[i])
    sNeg <- min(0, sNeg + delta[i])
    
    if (sNeg < -h) {
      sNeg <- 0
      tEvents <- c(tEvents, i + 1)  # index 보정 (diff는 1개 짧음)
    } else if (sPos > h) {
      sPos <- 0
      tEvents <- c(tEvents, i + 1)
    }
  }

  return(tEvents)  # 시계열 인덱스가 있다면 index 벡터로 바꿔도 됨
}

예시 데이터로는 datasets::EuStockMarkets$DAX를 사용한다.

price <- datasets::EuStockMarkets[, 1]
ret <- diff(log(price))

event_idx <- get_t_events(ret, h = 0.05)  # threshold = 0.05
event_price <- price[event_idx]

plot(price, type = "l", main = "The Symmetric CUSUM filter")
points(time(price)[event_idx], event_price, col = "red", pch = 16)
legend("topleft",
       legend = c("Price", "Sampled Observations"),
       col = c("black", "red"),
       lty = c(1, NA),
       pch = c(NA, 16))

Advances in Financial Machine Learning - Part 1.2 Financial Data Structures(M.lopez.de.prado)

lumpen95

2025-08-06