Synopsis

In their book Forecasting: Principles and Practice, Rob Hyndman & George Athansopoulos state that forecasting is a difficult activity, and businesses that do it well have a big advantage over those whose forecasts fail.

This analysis uses some of the most advanced methods for producing forecasts. The emphasis will be on methods that are replicable and testable, and have been shown to work across various business sectors and different contexts.

Time Series Forecasting

Time series is a discrete or continuous sequence of observations that depend on time. Time is an important feature in natural processes such as air temperature, pulse of the heart, or waves crashing on a sandy beach.

The mechanism generating time series Yt is often thought to consist of three components. A seasonal component St, a trend component Tt, and a random error or remainder component et. Provided that seasonal variation around the trend cycle does not vary with the level of the time series the following formula can be derived: \[yt = St + Tt + et\]

Of course there are examples where time series are irregular, do not have a distinct trend or any seasonal patterns, which make them much more difficult to forecast. Fortunately, many state of the art algorithms such as XGBoost, Prophet, and others enable data scientists to handle these constraints.

For example, XGBoost’s advanced gradient boosting algorithms use for a given data set with n examples and m features D = {(xi, yi)} (|D| = n, xi ∈ Rm, yi ∈ R). Tree ensemble model as the one listed bellow uses K additive functions to predict the output.

\[ yˆi = \phi(x\iota) = \sum_{k=1}^{k}f_{k}(x\iota), f_{k} ∈ F,\]

Where F = {f(x) = wq(x)} (q : R m → T, w ∈ RT) is space of regression trees (also known as classification and regression trees or CART).

Here q represents the structure of each tree that maps an example to the corresponding leaf index. T is the number of leaves in the tree. Each fk corresponds to an independent tree structure q and leaf weights w. Unlike decision trees, each regression tree contains a continuous score on each of the leaf, and use wi to represent score on i-th leaf.

By using the decision rules in the trees (given by q) the algorithm can classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves (given by w).

This illustrates how XGBoost runs under the hood, regardless of the programming context.

1. Load required libraries and setup parallel processing compute nodes

## [1] 24

2. Pre-processing Pipeline - Load, Transform & Wrangle

## # A tibble: 110,710 × 8
##    account_number stock_ticker country filled_quantity status    filled_price
##             <dbl> <chr>        <chr>             <dbl> <chr>            <dbl>
##  1        3433920 AML          UK                 -150 FILLED            18.0
##  2        5051374 AML          UK                  -10 FILLED            18.0
##  3        9098558 AML          UK                  550 FILLED            18.3
##  4        8692010 AML          UK                   -8 FILLED            18.0
##  5        5728512 AML          UK                  -10 FILLED            20.6
##  6        5728512 AML          UK                  -10 FILLED            20.5
##  7        4964040 AML          UK                    0 CANCELLED         NA  
##  8        5728512 AML          UK                  -10 FILLED            20.7
##  9        6303034 AML          UK                    8 FILLED            18.2
## 10        4801054 AML          UK                    0 CANCELLED         NA  
## # ℹ 110,700 more rows
## # ℹ 2 more variables: time_created <dttm>, time_executed <dttm>

Prior modeling raw data requires transformation and wrangling. There are several categories (groups). The first (parent) group is “country” with two distinct labels UK & EU, then there is the status column with 3 sub-categories - “filled”, “canceled”, and “new”, and finally the stock ticker identifier column containing 6 individual stocks.

There are ways to perform multi-categorical forecasts but they tend to be poor performers in terms of accuracy as group categories get higher. Therefore, suggested approach for going forward is the development of an ensemble and nested forecasting methods to process panel data.

Moreover, the target variable contains positive and negative integers, which correspond to two additional groups - number of bought and sold shares.

Due to my limited knowledge in finance and equity markets, I will make an assumption and split the target variable column in two - negative values will be flagged as bought, and positive values will be flagged as sold. This segregation could help with the identification of patterns in how equity shares are traded.

Because of time constrains, this blueprint will examine stocks that are “sold” within UK. Identical analysis is performed using the same approach on the remainder of the categories: bought - UK, and bought/sold - EU, but it is not going to presented in a report manner.

2.1. External Regressors

Visualizing time series signatures indicates that there are reoccurring patterns on hourly, daily and weekly intervals. The following code chunk generates events that are introduced during modeling downstream as external regressor features.

## # A tibble: 9 × 2
##   event_time_executed reg_event
##   <dttm>                  <dbl>
## 1 2021-07-05 10:00:00         1
## 2 2021-07-12 10:00:00         1
## 3 2021-07-19 10:00:00         1
## 4 2021-07-26 10:00:00         1
## 5 2021-08-02 10:00:00         1
## 6 2021-08-09 10:00:00         1
## 7 2021-08-16 10:00:00         1
## 8 2021-08-23 10:00:00         1
## 9 2021-08-30 10:00:00         1

2.2. Wrangle & Transform

## # A tibble: 109,567 × 4
## # Groups:   time_executed, stock_ticker [44,428]
##    stock_ticker  sold bought time_executed      
##    <fct>        <int>  <int> <dttm>             
##  1 AML             NA    150 2021-07-19 11:34:00
##  2 AML             NA     10 2021-07-19 11:34:00
##  3 AML            550     NA 2021-07-08 10:05:00
##  4 AML             NA      8 2021-07-19 11:34:00
##  5 AML             NA     10 2021-08-06 15:32:00
##  6 AML             NA     10 2021-08-06 10:25:00
##  7 AML             NA     10 2021-08-12 10:04:00
##  8 AML              8     NA 2021-07-08 17:02:00
##  9 AML             NA     10 2021-07-20 16:02:00
## 10 AML             NA     20 2021-08-06 10:02:00
## # ℹ 109,557 more rows

Considering the limited data sample the optimal way of aggregating the date and target variables is on daily increments. This process reduces the available observations to just 258, but greatly reduces variance compared to the second, minute and hour increments.

This step is as helpful as it is not. Such aggregation leads to overall grouping of orders by the day instead of by the minute or hour. Because of that most of the seasonality is removed - periods with multiple orders within a single hour are summed up for the whole day, rather than being evaluated as unique events.

Alternatively, avoiding the daily aggregation results in variance that is too high for a machine learning model to handle accurately, especially in a data that is so volatile.

The optimal solution is to obtain much larger data sample covering data for one or two years and aggregate filled orders on daily increments. This way a trend and additional seasonal patterns could emerge that would be suitable for addition as external regressors, thus making the forecast more accurate.

## # A tibble: 258 × 4
##    time_executed       stock_ticker  sold reg_event
##    <dttm>              <fct>        <int>     <dbl>
##  1 2021-07-01 00:00:00 AML            263         0
##  2 2021-07-01 00:00:00 CCL            192         0
##  3 2021-07-01 00:00:00 HSBA           213         0
##  4 2021-07-01 00:00:00 INRG          1648         0
##  5 2021-07-01 00:00:00 JDW            114         0
##  6 2021-07-01 00:00:00 OCDO           169         0
##  7 2021-07-02 00:00:00 AML            193         0
##  8 2021-07-02 00:00:00 CCL            157         0
##  9 2021-07-02 00:00:00 HSBA           421         0
## 10 2021-07-02 00:00:00 INRG          1133         0
## # ℹ 248 more rows

2.3. Inspect temporal dynamics

The variance appears to be quite high between the different groups. This can impose challenges during the modeling phase. However, using techniques for variance reduction can be of help.

Generally, ensemble models may struggle with high variance, whereas nested forecasting techniques may prove to be better because they evaluate each time series group separately. Another blueprint reviews this possibility.

## # A tibble: 1 × 12
##   n.obs start               end                 units scale  tzone diff.minimum
##   <int> <dttm>              <dttm>              <chr> <chr>  <chr>        <dbl>
## 1   258 2021-07-01 00:00:00 2021-08-31 00:00:00 secs  second UTC              0
## # ℹ 5 more variables: diff.q1 <dbl>, diff.median <dbl>, diff.mean <dbl>,
## #   diff.q3 <dbl>, diff.maximum <dbl>

2.4. Full Data tbl

## # A tibble: 300 × 20
##    rowid stock_ticker time_executed        sold time_executed_sin2_K1
##    <int> <fct>        <dttm>              <dbl>                 <dbl>
##  1     1 AML          2021-07-01 00:00:00  5.58             -3.86e-13
##  2     2 AML          2021-07-02 00:00:00  5.27             -4.74e-12
##  3     3 AML          2021-07-05 00:00:00  5.11              5.57e-12
##  4     4 AML          2021-07-06 00:00:00  5.75             -3.42e-12
##  5     5 AML          2021-07-07 00:00:00  5.52              1.27e-12
##  6     6 AML          2021-07-08 00:00:00  5.84              8.79e-13
##  7     7 AML          2021-07-09 00:00:00  4.85              4.25e-12
##  8     8 AML          2021-07-12 00:00:00  5.12             -5.08e-12
##  9     9 AML          2021-07-13 00:00:00  5.12              2.93e-12
## 10    10 AML          2021-07-14 00:00:00  4.77             -7.77e-13
## # ℹ 290 more rows
## # ℹ 15 more variables: time_executed_cos2_K1 <dbl>,
## #   time_executed_sin4_K1 <dbl>, time_executed_cos4_K1 <dbl>,
## #   time_executed_sin7_K1 <dbl>, time_executed_cos7_K1 <dbl>,
## #   time_executed_sin14_K1 <dbl>, time_executed_cos14_K1 <dbl>,
## #   time_executed_sin30_K1 <dbl>, time_executed_cos30_K1 <dbl>,
## #   sold_lag7 <dbl>, sold_lag7_roll_2 <dbl>, sold_lag7_roll_4 <dbl>, …

This step takes care of most of the pre-processing and feature engineering workflow. Introducing Fourier features helps greatly in time series forecasting. However, it can be quite difficult to create the “proper” features that are effective for each group and fit the overall temporal dynamics.

The Fourier series formula gives an expansion of a periodic function f(x) in terms of an infinite sum of sines and cosines. It is used to decompose any periodic function or periodic signal into the sum of a set of simple oscillating functions, namely sines and cosines. Fourier series makes use of the orthogonal relationships of the cosine and sine functions.

Fourier features are defined through this expression:

\[\large f(x)=\frac{1}{2}a_{0}+\sum_{n=1}^{\infty}a_{n}cos\;nx+\sum_{n=1}^{\infty}b_{n}sin\;nx\]

Moreover, the addition of lags further helps stabilize the mapping of the seasonal patterns and boosts model accuracy.

A “lag” is a fixed amount of passing time; One set of observations in a time series is plotted (lagged) against a second, later set of data. The kth lag is the time period that happened k time points before time i. For example: Lag1(Y2) = Y1 and Lag4(Y9) = Y5.

2.5. Data Prepared

## # A tibble: 216 × 20
##    rowid stock_ticker time_executed        sold time_executed_sin2_K1
##    <int> <fct>        <dttm>              <dbl>                 <dbl>
##  1     8 AML          2021-07-12 00:00:00  5.12             -5.08e-12
##  2     9 AML          2021-07-13 00:00:00  5.12              2.93e-12
##  3    10 AML          2021-07-14 00:00:00  4.77             -7.77e-13
##  4    11 AML          2021-07-15 00:00:00  5.20              5.90e-12
##  5    12 AML          2021-07-16 00:00:00  5.02             -3.75e-12
##  6    13 AML          2021-07-19 00:00:00  6.28              4.58e-12
##  7    14 AML          2021-07-20 00:00:00  6.00             -2.43e-12
##  8    15 AML          2021-07-21 00:00:00  5.06              2.84e-13
##  9    16 AML          2021-07-22 00:00:00  5.12             -5.41e-12
## 10    17 AML          2021-07-23 00:00:00  4.93              3.26e-12
## # ℹ 206 more rows
## # ℹ 15 more variables: time_executed_cos2_K1 <dbl>,
## #   time_executed_sin4_K1 <dbl>, time_executed_cos4_K1 <dbl>,
## #   time_executed_sin7_K1 <dbl>, time_executed_cos7_K1 <dbl>,
## #   time_executed_sin14_K1 <dbl>, time_executed_cos14_K1 <dbl>,
## #   time_executed_sin30_K1 <dbl>, time_executed_cos30_K1 <dbl>,
## #   sold_lag7 <dbl>, sold_lag7_roll_2 <dbl>, sold_lag7_roll_4 <dbl>, …

Tibble without empty rows. Contains engineered features.

2.6. Time Split

Initial time split - approx 60/40.

2.7. Future Data

## # A tibble: 42 × 20
##    rowid stock_ticker time_executed        sold time_executed_sin2_K1
##    <int> <fct>        <dttm>              <dbl>                 <dbl>
##  1    44 AML          2021-09-01 00:00:00    NA              4.60e-12
##  2    45 AML          2021-09-02 00:00:00    NA             -2.45e-12
##  3    46 AML          2021-09-03 00:00:00    NA              3.04e-13
##  4    47 AML          2021-09-04 00:00:00    NA             -5.43e-12
##  5    48 AML          2021-09-05 00:00:00    NA              3.28e-12
##  6    49 AML          2021-09-06 00:00:00    NA             -1.13e-12
##  7    50 AML          2021-09-07 00:00:00    NA             -1.02e-12
##  8    94 CCL          2021-09-01 00:00:00    NA              4.60e-12
##  9    95 CCL          2021-09-02 00:00:00    NA             -2.45e-12
## 10    96 CCL          2021-09-03 00:00:00    NA              3.04e-13
## # ℹ 32 more rows
## # ℹ 15 more variables: time_executed_cos2_K1 <dbl>,
## #   time_executed_sin4_K1 <dbl>, time_executed_cos4_K1 <dbl>,
## #   time_executed_sin7_K1 <dbl>, time_executed_cos7_K1 <dbl>,
## #   time_executed_sin14_K1 <dbl>, time_executed_cos14_K1 <dbl>,
## #   time_executed_sin30_K1 <dbl>, time_executed_cos30_K1 <dbl>,
## #   sold_lag7 <dbl>, sold_lag7_roll_2 <dbl>, sold_lag7_roll_4 <dbl>, …

This neat function preps the future tibble to store the predictions.

3. Recipe

Normalized time series look completely different. Now peak periods are more or less within the same range of the rest of the observed values, thus reducing the variance significantly. This process is reversible.

## Rows: 132
## Columns: 64
## $ rowid                      <int> 8, 58, 108, 158, 208, 258, 9, 59, 109, 159,…
## $ time_executed              <dttm> 2021-07-12, 2021-07-12, 2021-07-12, 2021-0…
## $ time_executed_sin2_K1      <dbl> -5.075826e-12, -5.075826e-12, -5.075826e-12…
## $ time_executed_cos2_K1      <dbl> 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, 1…
## $ time_executed_sin4_K1      <dbl> -2.537913e-12, -2.537913e-12, -2.537913e-12…
## $ time_executed_cos4_K1      <dbl> 1.000000e+00, 1.000000e+00, 1.000000e+00, 1…
## $ time_executed_sin7_K1      <dbl> -4.338837e-01, -4.338837e-01, -4.338837e-01…
## $ time_executed_cos7_K1      <dbl> -0.9009689, -0.9009689, -0.9009689, -0.9009…
## $ time_executed_sin14_K1     <dbl> 9.749279e-01, 9.749279e-01, 9.749279e-01, 9…
## $ time_executed_cos14_K1     <dbl> -0.2225209, -0.2225209, -0.2225209, -0.2225…
## $ time_executed_sin30_K1     <dbl> 0.8660254, 0.8660254, 0.8660254, 0.8660254,…
## $ time_executed_cos30_K1     <dbl> -0.5000000, -0.5000000, -0.5000000, -0.5000…
## $ sold_lag7                  <dbl> 5.575949, 5.262690, 5.365976, 7.407924, 4.7…
## $ sold_lag7_roll_2           <dbl> 5.421904, 5.162643, 5.705491, 7.220715, 4.4…
## $ sold_lag7_roll_4           <dbl> 5.316584, 5.236593, 5.603048, 7.265904, 4.5…
## $ sold_lag7_roll_7           <dbl> 5.424786, 5.252024, 5.737795, 7.332048, 4.6…
## $ sold_lag7_roll_14          <dbl> 5.378408, 5.588113, 5.710531, 7.112832, 4.8…
## $ sold_lag7_roll_30          <dbl> 5.349945, 5.714226, 5.554215, 6.926324, 4.9…
## $ sold                       <dbl> 5.117994, 5.476464, 5.645447, 6.975414, 5.4…
## $ time_executed_index.num    <dbl> -1.567231, -1.567231, -1.567231, -1.567231,…
## $ time_executed_half         <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ time_executed_quarter      <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ time_executed_month        <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7…
## $ time_executed_day          <int> 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13,…
## $ time_executed_hour         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_minute       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_second       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_hour12       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_am.pm        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ time_executed_wday         <int> 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4…
## $ time_executed_mday         <int> 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13,…
## $ time_executed_qday         <int> 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13,…
## $ time_executed_yday         <int> 193, 193, 193, 193, 193, 193, 194, 194, 194…
## $ time_executed_mweek        <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ time_executed_week         <int> 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,…
## $ time_executed_week2        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_week3        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ time_executed_week4        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_mday7        <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3…
## $ stock_ticker_AML           <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0…
## $ stock_ticker_CCL           <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0…
## $ stock_ticker_HSBA          <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1…
## $ stock_ticker_INRG          <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ stock_ticker_JDW           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ stock_ticker_OCDO          <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ time_executed_month.lbl_01 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_month.lbl_02 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_month.lbl_03 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_month.lbl_04 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_month.lbl_05 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_month.lbl_06 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_month.lbl_07 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ time_executed_month.lbl_08 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_month.lbl_09 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_month.lbl_10 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_month.lbl_11 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_month.lbl_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_wday.lbl_1   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_wday.lbl_2   <dbl> 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_wday.lbl_3   <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0…
## $ time_executed_wday.lbl_4   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1…
## $ time_executed_wday.lbl_5   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_wday.lbl_6   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_executed_wday.lbl_7   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

We end up with 58 features in our recipe spec.

4. Modeling

## # A tibble: 4 × 9
##   .model_id .model_desc               .type   mae  mape  mase smape  rmse   rsq
##       <int> <chr>                     <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         4 RANGER                    Test  0.295  6.11 0.299  6.04 0.403 0.812
## 2         2 XGBOOST                   Test  0.317  6.38 0.322  6.58 0.437 0.807
## 3         1 PROPHET W/ REGRESSORS     Test  0.476  9.55 0.483  9.13 0.582 0.774
## 4         3 PROPHET W/ XGBOOST ERRORS Test  0.459  9.06 0.466  9.71 0.612 0.734

Out of the box all models except for Prophet Boost seem to be a good and consistent fit to the testing data. This step indicates how important is to use the proper algorithm for the available data. During the next step, all models are tuned iteratively using 5 fold latin hyercube grid ensuring good balance of sufficient temporal space accounting for the available hyper parameters. Grid of 10 is also good alternative but this increases computation cost by two fold.

5. Hyper Parameter Tuning

Due to the limited sample 2 fold cross validation may not be enough to ensure sufficient amount of samples for proper validation and test of the models.

## 13.92 sec elapsed

## # A tibble: 5 × 12
##    mtry trees min_n tree_depth learn_rate loss_reduction .metric .estimator
##   <int> <int> <int>      <int>      <dbl>          <dbl> <chr>   <chr>     
## 1    51    63     4         14     0.331   0.000846      rmse    standard  
## 2    26  1816    29          2     0.0152  0.00000000223 rmse    standard  
## 3    12   698    16          7     0.157   3.86          rmse    standard  
## 4    15  1127    22          6     0.305   0.000000100   rmse    standard  
## 5    43  1445    35         12     0.192   0.000115      rmse    standard  
## # ℹ 4 more variables: mean <dbl>, n <int>, std_err <dbl>, .config <chr>

## 2.68 sec elapsed

## # A tibble: 5 × 12
##    mtry trees min_n tree_depth learn_rate loss_reduction .metric .estimator
##   <int> <int> <int>      <int>      <dbl>          <dbl> <chr>   <chr>     
## 1    50    63     4         14     0.331   0.000846      rmse    standard  
## 2    12   698    16          7     0.157   3.86          rmse    standard  
## 3    15  1127    22          6     0.305   0.000000100   rmse    standard  
## 4    25  1816    29          2     0.0152  0.00000000223 rmse    standard  
## 5    42  1445    35         12     0.192   0.000115      rmse    standard  
## # ℹ 4 more variables: mean <dbl>, n <int>, std_err <dbl>, .config <chr>

## 1.52 sec elapsed

## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config             
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1    39   728    15 rmse    standard   0.396     2  0.0204 Preprocessor1_Model3
## 2    24  1498     9 rmse    standard   0.398     2  0.0338 Preprocessor1_Model5
## 3    50   863    20 rmse    standard   0.402     2  0.0151 Preprocessor1_Model2
## 4    30  1845    27 rmse    standard   0.424     2  0.0145 Preprocessor1_Model1
## 5     1   217    37 rmse    standard   0.741     2  0.0448 Preprocessor1_Model4

6. Evaluate Panel Forecasts

## # A tibble: 7 × 9
##   .model_id .model_desc               .type   mae  mape  mase smape  rmse   rsq
##       <int> <chr>                     <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         3 RANGER - Tuned            Test  0.286  5.74 0.290  5.77 0.381 0.830
## 2         7 RANGER                    Test  0.295  6.11 0.299  6.04 0.403 0.812
## 3         2 XGBOOST - Tuned           Test  0.300  6.01 0.305  6.16 0.416 0.812
## 4         5 XGBOOST                   Test  0.317  6.38 0.322  6.58 0.437 0.807
## 5         4 PROPHET W/ REGRESSORS     Test  0.476  9.55 0.483  9.13 0.582 0.774
## 6         6 PROPHET W/ XGBOOST ERRORS Test  0.459  9.06 0.466  9.71 0.612 0.734
## 7         1 PROPHET - Tuned           Test  0.502 10.0  0.510 10.8  0.642 0.743

Good accuracy for the first 4 models on the testing set. Using Tune() to automate this process is precious time saver as it spares the data scientist from manually running and tweaking multiple iterations with sets of hyper parameters.

Generally, most of the time all advanced machine learning models are able to fit the testing data well. This however, is not a clear indication that they will be able to forecast accurately. For that reason, resampling can be evaluated on different time intervals to assess model performance further in the future.

It is also clear that all of the models have difficulty “catching up” with the big spike at the last day of the time series in the HSBA group.

This could be caused by the month closure activities that potentially can be regressed with external feature during the pre-processing pipeline, given that more data is available.

7. Resampling - Assess stability of our models over time, helps to strategize for an ensemble approach

Due to the limited data sample, only 2 resamples are used for evaluation. For long term forecasts, 5-8 resamples are recommended.

## ── Fitting Resamples ────────────────────────────────────────────
## 
## 6.7 sec elapsed

## # A tibble: 7 × 8
##   .model_id .model_desc            .type     n rmse_mean rmse_sd rsq_mean rsq_sd
##       <int> <chr>                  <chr> <int>     <dbl>   <dbl>    <dbl>  <dbl>
## 1         5 XGBOOST                Resa…     2     0.414  0.0468    0.796 0.0908
## 2         2 XGBOOST - Tuned        Resa…     2     0.458  0.0926    0.756 0.126 
## 3         3 RANGER - Tuned         Resa…     2     0.468  0.120     0.814 0.0609
## 4         7 RANGER                 Resa…     2     0.497  0.0711    0.814 0.0227
## 5         1 PROPHET - Tuned        Resa…     2     0.616  0.234     0.748 0.0201
## 6         6 PROPHET W/ XGBOOST ER… Resa…     2     0.719  0.229     0.640 0.0516
## 7         4 PROPHET W/ REGRESSORS  Resa…     2     2.00   0.676     0.212 0.143

Clear pattern emerges when evaluating the resampled accuracy of the models. XGBoost & Ranger are on top of the list.

This graph indicates that some of the models are not good, e.g: Prophet with Regressors. This finding would be of importance if the data sample were bigger - the more slices are added (simulating the test of time in the future), the worse this model performs.

It is obvious that accuracy degrades across the board for all models in the slice two resample. This isn’t necessarily bad as equity foresting is targeting shorter future intervals and we may not care if our model would be still accurate looking 10 weeks or more into the future.

8. Ensemble Panel Model

Fitted an ensemble model is the average prediction across the 3 best performing (with lowest RMSE) models - XGBoost, XGBoost - Tuned, and Ranger - Tuned

## # A tibble: 1 × 9
##   .model_id .model_desc               .type   mae  mape  mase smape  rmse   rsq
##       <int> <chr>                     <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         1 ENSEMBLE (MEAN): 3 MODELS Test  0.289  5.78 0.294  5.90 0.396 0.826

Observed metrics of the ensemble model on the testing set indicate that the variance in the time series is explained by 82% which is quite remarkable. RMSE of 0.39 is also very good. With additional feature engineering this metric can be reduced even further.

Tuning the models and ensembling them seem to marginally improve accuracy and reliability compared to using the default hyper parameters. As always - any gains count.

## # A tibble: 6 × 4
##   stock_ticker   mae  rmse     rsq
##   <fct>        <dbl> <dbl>   <dbl>
## 1 AML           41.0  59.8 0.00588
## 2 CCL           27.4  32.8 0.431  
## 3 HSBA         139.  207.  0.00363
## 4 INRG         149.  242.  0.270  
## 5 JDW           12.2  16.2 0.122  
## 6 OCDO          21.8  28.1 0.0273

Predicted values for the next time window of 54 hours for each stock seem to be a good fit to the overall seasonal pattern and with a few exceptions, they fall within the range of the temporal dynamics of the time series. Boosting the forecast with more features could yield even better results.

Equity Trading Forecast Analysis

Global Ensemble Approach - Superlearner

Metodi Simeonov

October 7, 2021