Synopsis

In their book Forecasting: Principles and Practice, Rob Hyndman & George Athansopoulos state that forecasting is a difficult activity, and businesses that do it well have a big advantage over those whose forecasts fail.

This analysis uses some of the most advanced methods for producing forecasts. The emphasis will be on methods that are replicable and testable, and have been shown to work across various business sectors and different contexts.

Time Series Forecasting

Time series is a discrete or continuous sequence of observations that depend on time. Time is an important feature in natural processes such as air temperature, pulse of the heart, or waves crashing on a sandy beach.

The mechanism generating time series Yt is often thought to consist of three components. A seasonal component St, a trend component Tt, and a random error or remainder component et. Provided that seasonal variation around the trend cycle does not vary with the level of the time series the following formula can be derived: \[yt = St + Tt + et\]

There are examples where time series are irregular, do not have a distinct trend or any seasonal patterns, which make them much more difficult to forecast. In similar scenarios simple models such as ARIMA, TBATS, ETS etc, are not going to do a good job. Fortunately, many state of the art algorithms such as XGBoost, Prophet, and others enable data scientists to handle these constraints.

For example, XGBoost’s advanced gradient boosting algorithms use for a given data set with n examples and m features D = {(xi, yi)} (|D| = n, xi ∈ Rm, yi ∈ R), and a tree ensemble model as the one listed bellow uses K additive functions to predict the output.

\[ yˆi = \phi(x\iota) = \sum_{k=1}^{k}f_{k}(x\iota), f_{k} ∈ F,\]

Where F = {f(x) = wq(x)} (q : R m → T, w ∈ RT) is space of regression trees (also known as classification and regression trees or CART).

Here q represents the structure of each tree that maps an example to the corresponding leaf index. T is the number of leaves in the tree. Each fk corresponds to an independent tree structure q and leaf weights w. Unlike decision trees, each regression tree contains a continuous score on each of the leaf, and use wi to represent score on i-th leaf.

By using the decision rules in the trees (given by q) the algorithm can classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves (given by w).

This illustrates how XGBoost runs under the hood, regardless of the programming context.

For comparison, simplified example of a basic (0,1,0) random walk ARIMA model (stands for Auto-Regressive Integrated Moving Average) is denoted like this \[Ŷt - Yt-1 = μ\]

or equivalently \[Ŷt = μ + Yt-1\]

1. Load required libraries and setup parallel processing compute nodes

## [1] 24

2. Pre-processing Pipeline - Load, Transform & Wrangle

This dataset represents two months of equity orders from six stocks of the S&P 500 index. There are many ways to harness and explore this dataset. In this analysis, we will use the data to explore a very specific question that is:

Given recent sold quantity, what is the expected sold quantity for the 1st of September 2021?

This requires that a predictive model forecast the total filled orders of each stock for each day.

Technically, this framing of the problem is referred to as a multi-step time series forecasting problem, given the multiple forecast steps. A model that makes use of multiple input variables to predict the target variable (such as engineered features) is also referred to as a multivariate multi-step time series forecasting model.

## # A tibble: 110,710 x 8
##    account_number stock_ticker country filled_quantity status    filled_price
##             <dbl> <chr>        <chr>             <dbl> <chr>            <dbl>
##  1        3433920 AML          UK                 -150 FILLED            18.0
##  2        5051374 AML          UK                  -10 FILLED            18.0
##  3        9098558 AML          UK                  550 FILLED            18.3
##  4        8692010 AML          UK                   -8 FILLED            18.0
##  5        5728512 AML          UK                  -10 FILLED            20.6
##  6        5728512 AML          UK                  -10 FILLED            20.5
##  7        4964040 AML          UK                    0 CANCELLED         NA  
##  8        5728512 AML          UK                  -10 FILLED            20.7
##  9        6303034 AML          UK                    8 FILLED            18.2
## 10        4801054 AML          UK                    0 CANCELLED         NA  
## # ... with 110,700 more rows, and 2 more variables: time_created <dttm>,
## #   time_executed <dttm>

Data summary
Name	equity_test_tbl
Number of rows	110710
Number of columns	8
_______________________
Column type frequency:
character	3
numeric	3
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
stock_ticker	1	3	4	6
country	1	2	2	2
status	1	3	9	4

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
account_number	0	1.00	5834151.84	1956311.33	1489016.00	4150806.00	5754036.00	7563634.00	9327748.00	▂▇▇▆▆
filled_quantity	0	1.00	1.08	169.31	-18068.00	-0.25	0.50	2.80	30000.00	▁▇▁▁▁
filled_price	603	0.99	11.53	4.81	3.83	9.32	9.67	15.19	21.19	▂▇▁▂▃

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
time_created	0	1	2021-04-01 10:11:00	2021-08-31 23:01:00	2021-07-28 10:15:00	27952
time_executed	0	1	2021-07-01 00:00:00	2021-08-31 23:53:00	2021-07-28 10:15:00	20012

Preliminary analysis suggests that data quality is good and that there are no apparent anomalies.

Prior modeling, raw data requires transformation and wrangling. There are several categories (groups). The first (parent) group is “country” with two distinct labels UK & EU, then there is the status column with 4 sub-categories - “filled”, “canceled”, “rejected”, and “new”, and finally the stock ticker identifier column containing 6 individual stocks.

There are ways to perform multi-categorical forecasts but they tend to be poor performers in terms of accuracy as group categories get higher (for reference - https://metodisimeonov.shinyapps.io/Hierarchical_Forecaster/). Therefore, suggested approach for going forward is the development of an ensemble (current analysis) and nested forecasting methods to process panel data.

Moreover, the target variable contains positive and negative integers, which correspond to two additional groups - number of bought and sold shares.

For the purpose of this analysis the target variable will be transformed to an absolute value.

The forecast will account for the UK region only.

2.1. External Regressors

Visualizing time series signatures indicates that there are reoccurring patterns on hourly, daily and weekly intervals.

The following code chunk generates events that are introduced downstream as external regressor features.

## # A tibble: 9 x 2
##   event_time_executed reg_event
##   <dttm>                  <dbl>
## 1 2021-07-05 10:00:00         1
## 2 2021-07-12 10:00:00         1
## 3 2021-07-19 10:00:00         1
## 4 2021-07-26 10:00:00         1
## 5 2021-08-02 10:00:00         1
## 6 2021-08-09 10:00:00         1
## 7 2021-08-16 10:00:00         1
## 8 2021-08-23 10:00:00         1
## 9 2021-08-30 10:00:00         1

2.2. Wrangle & Transform

Considering the limited data sample the optimal way of aggregating the date and target variables is on daily increments. This process reduces the available observations to just 258, but greatly reduces variance compared to the second, minute and hour increments.

This step is as helpful as it is not. Such aggregation leads to overall grouping of orders by the day instead of by the minute or hour. Because of that most of the seasonality is removed - periods with multiple orders within a single hour are summed up for the whole day, rather than being evaluated as unique events.

Alternatively, avoiding the daily aggregation results in variance that is too high for a machine learning model to handle accurately by using traditional approaches, especially with data that is so volatile.

The optimal solution is to obtain much larger data sample covering data for one or two years and aggregate filled orders on daily increments. This way a trend and additional seasonal patterns could emerge that would be suitable for addition as external regressors, thus making the forecast more accurate. Additionally, given that more data were available a decision to completely remove the outliers would have been possible.

## # A tibble: 109,567 x 3
##    stock_ticker filled_quantity time_executed      
##    <fct>                  <dbl> <dttm>             
##  1 AML                      150 2021-07-19 11:34:00
##  2 AML                       10 2021-07-19 11:34:00
##  3 AML                      550 2021-07-08 10:05:00
##  4 AML                        8 2021-07-19 11:34:00
##  5 AML                       10 2021-08-06 15:32:00
##  6 AML                       10 2021-08-06 10:25:00
##  7 AML                       10 2021-08-12 10:04:00
##  8 AML                        8 2021-07-08 17:02:00
##  9 AML                       10 2021-07-20 16:02:00
## 10 AML                       20 2021-08-06 10:02:00
## # ... with 109,557 more rows

## # A tibble: 258 x 4
##    stock_ticker time_executed       filled_quantity reg_event
##    <fct>        <dttm>                        <dbl>     <dbl>
##  1 AML          2021-07-01 00:00:00           5011.         0
##  2 AML          2021-07-02 00:00:00           4305.         0
##  3 AML          2021-07-05 00:00:00           3983.         0
##  4 AML          2021-07-06 00:00:00           6367.         0
##  5 AML          2021-07-07 00:00:00           8725.         0
##  6 AML          2021-07-08 00:00:00           6576.         0
##  7 AML          2021-07-09 00:00:00          13319.         0
##  8 AML          2021-07-12 00:00:00          10934.         0
##  9 AML          2021-07-13 00:00:00           6543.         0
## 10 AML          2021-07-14 00:00:00           4206.         0
## # ... with 248 more rows

2.3. Inspect temporal dynamics

Rename column names to “date”, “value” and “item_id” for easier reusability of the workflow.

Aggregating daily still results in high variance between and within the groups. This will introduce challenges during modeling phase. However, using techniques for variance reduction and outlier detection can be of help.

Generally, all types of models struggle with high variance. Another blueprint reviews the possibility of applying nested forecasting technique that is later comparing the differences between each approach.

## # A tibble: 1 x 12
##   n.obs start               end                 units scale  tzone diff.minimum
##   <int> <dttm>              <dttm>              <chr> <chr>  <chr>        <dbl>
## 1   258 2021-07-01 00:00:00 2021-08-31 00:00:00 secs  second UTC       -5270400
## # ... with 5 more variables: diff.q1 <dbl>, diff.median <dbl>, diff.mean <dbl>,
## #   diff.q3 <dbl>, diff.maximum <dbl>

TS Diag to analyze seasonal patterns

ACF - Autocorrelation between a target variable and lagged versions of itself.

PACF - Partial Autocorrelation removes the dependence of lags on other lags highlighting key seasonalities.

CCF - Shows how lagged predictors can be used for prediction of a target variable.

Sample is too small to draw definitive conclusions but it seems some additional features can be added.

2.4. Full Data tbl

## # A tibble: 300 x 20
##    rowid item_id date                value date_sin2_K1 date_cos2_K1
##    <int> <fct>   <dttm>              <dbl>        <dbl>        <dbl>
##  1     1 AML     2021-07-01 00:00:00  8.52    -3.86e-13           -1
##  2     2 AML     2021-07-02 00:00:00  8.37    -4.74e-12            1
##  3     3 AML     2021-07-05 00:00:00  8.29     5.57e-12           -1
##  4     4 AML     2021-07-06 00:00:00  8.76    -3.42e-12            1
##  5     5 AML     2021-07-07 00:00:00  9.07     1.27e-12           -1
##  6     6 AML     2021-07-08 00:00:00  8.79     8.79e-13            1
##  7     7 AML     2021-07-09 00:00:00  9.50     4.25e-12           -1
##  8     8 AML     2021-07-12 00:00:00  9.30    -5.08e-12            1
##  9     9 AML     2021-07-13 00:00:00  8.79     2.93e-12           -1
## 10    10 AML     2021-07-14 00:00:00  8.34    -7.77e-13            1
## # ... with 290 more rows, and 14 more variables: date_sin4_K1 <dbl>,
## #   date_cos4_K1 <dbl>, date_sin7_K1 <dbl>, date_cos7_K1 <dbl>,
## #   date_sin14_K1 <dbl>, date_cos14_K1 <dbl>, date_sin30_K1 <dbl>,
## #   date_cos30_K1 <dbl>, value_lag7 <dbl>, value_lag7_roll_2 <dbl>,
## #   value_lag7_roll_4 <dbl>, value_lag7_roll_7 <dbl>, value_lag7_roll_14 <dbl>,
## #   value_lag7_roll_30 <dbl>

This step takes care of most of the pre-processing and feature engineering workflow. Introducing Fourier features greatly helps in time series forecasting. However, it can be quite difficult to create the “proper” features that are effective for each group and fit the overall temporal dynamics.

The Fourier series formula gives an expansion of a periodic function f(x) in terms of an infinite sum of sines and cosines. It is used to decompose any periodic function or periodic signal into the sum of a set of simple oscillating functions, namely sines and cosines. Fourier series makes use of the orthogonal relationships of the cosine and sine functions.

Fourier features can be defined through similar expression:

\[\large f(x)=\frac{1}{2}a_{0}+\sum_{n=1}^{\infty}a_{n}cos\;nx+\sum_{n=1}^{\infty}b_{n}sin\;nx\]

Moreover, the addition of lags further helps stabilize the mapping of the seasonal patterns and boosts model accuracy.

A “lag” is a fixed amount of passing time; One set of observations in a time series is plotted (lagged) against a second, later set of data. The kth lag is the time period that happened k time points before time i. For example: Lag1(Y2) = Y1 and Lag4(Y9) = Y5.

2.5. Data Prepared

## # A tibble: 216 x 20
##    rowid item_id date                value date_sin2_K1 date_cos2_K1
##    <int> <fct>   <dttm>              <dbl>        <dbl>        <dbl>
##  1     8 AML     2021-07-12 00:00:00  9.30    -5.08e-12            1
##  2     9 AML     2021-07-13 00:00:00  8.79     2.93e-12           -1
##  3    10 AML     2021-07-14 00:00:00  8.34    -7.77e-13            1
##  4    11 AML     2021-07-15 00:00:00  8.64     5.90e-12           -1
##  5    12 AML     2021-07-16 00:00:00  7.41    -3.75e-12            1
##  6    13 AML     2021-07-19 00:00:00  9.40     4.58e-12           -1
##  7    14 AML     2021-07-20 00:00:00  8.95    -2.43e-12            1
##  8    15 AML     2021-07-21 00:00:00  8.82     2.84e-13           -1
##  9    16 AML     2021-07-22 00:00:00  9.02    -5.41e-12            1
## 10    17 AML     2021-07-23 00:00:00  8.17     3.26e-12           -1
## # ... with 206 more rows, and 14 more variables: date_sin4_K1 <dbl>,
## #   date_cos4_K1 <dbl>, date_sin7_K1 <dbl>, date_cos7_K1 <dbl>,
## #   date_sin14_K1 <dbl>, date_cos14_K1 <dbl>, date_sin30_K1 <dbl>,
## #   date_cos30_K1 <dbl>, value_lag7 <dbl>, value_lag7_roll_2 <dbl>,
## #   value_lag7_roll_4 <dbl>, value_lag7_roll_7 <dbl>, value_lag7_roll_14 <dbl>,
## #   value_lag7_roll_30 <dbl>

Tibble without empty rows. Contains engineered features.

2.6. Time Split

Initial time split - approx 60/40 due to the limited sample.

2.7. Future Data

## # A tibble: 42 x 20
##    rowid item_id date                value date_sin2_K1 date_cos2_K1
##    <int> <fct>   <dttm>              <dbl>        <dbl>        <dbl>
##  1    44 AML     2021-09-01 00:00:00    NA     4.60e-12           -1
##  2    45 AML     2021-09-02 00:00:00    NA    -2.45e-12            1
##  3    46 AML     2021-09-03 00:00:00    NA     3.04e-13           -1
##  4    47 AML     2021-09-04 00:00:00    NA    -5.43e-12            1
##  5    48 AML     2021-09-05 00:00:00    NA     3.28e-12           -1
##  6    49 AML     2021-09-06 00:00:00    NA    -1.13e-12            1
##  7    50 AML     2021-09-07 00:00:00    NA    -1.02e-12           -1
##  8    94 CCL     2021-09-01 00:00:00    NA     4.60e-12           -1
##  9    95 CCL     2021-09-02 00:00:00    NA    -2.45e-12            1
## 10    96 CCL     2021-09-03 00:00:00    NA     3.04e-13           -1
## # ... with 32 more rows, and 14 more variables: date_sin4_K1 <dbl>,
## #   date_cos4_K1 <dbl>, date_sin7_K1 <dbl>, date_cos7_K1 <dbl>,
## #   date_sin14_K1 <dbl>, date_cos14_K1 <dbl>, date_sin30_K1 <dbl>,
## #   date_cos30_K1 <dbl>, value_lag7 <dbl>, value_lag7_roll_2 <dbl>,
## #   value_lag7_roll_4 <dbl>, value_lag7_roll_7 <dbl>, value_lag7_roll_14 <dbl>,
## #   value_lag7_roll_30 <dbl>

This neat function preps the future tibble to store the predictions.

3. Recipe

Introducing a wrapper function to estimate outlier replacement, using linear interpolation on the seasonally adjusted series. As in some groups the series is multiplicative meaning the variance grows exponentially, a Box Cox transform is included for an automatic lambda selection.

Normalized time series look completely different. Now peak periods are more or less within the same range of the rest of the observed values, thus reducing the variance significantly. This process is reversible.

Some of the bigger spikes could be caused by month closure activities that potentially can be regressed, given that more data is available.

## Rows: 132
## Columns: 65
## $ rowid              <int> 8, 58, 108, 158, 208, 258, 9, 59, 109, 159, 209, 25~
## $ date               <dttm> 2021-07-12, 2021-07-12, 2021-07-12, 2021-07-12, 20~
## $ date_sin2_K1       <dbl> -5.075826e-12, -5.075826e-12, -5.075826e-12, -5.075~
## $ date_cos2_K1       <dbl> 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, 1, 1, 1, ~
## $ date_sin4_K1       <dbl> -2.537913e-12, -2.537913e-12, -2.537913e-12, -2.537~
## $ date_cos4_K1       <dbl> 1.000000e+00, 1.000000e+00, 1.000000e+00, 1.000000e~
## $ date_sin7_K1       <dbl> -4.338837e-01, -4.338837e-01, -4.338837e-01, -4.338~
## $ date_cos7_K1       <dbl> -0.9009689, -0.9009689, -0.9009689, -0.9009689, -0.~
## $ date_sin14_K1      <dbl> 9.749279e-01, 9.749279e-01, 9.749279e-01, 9.749279e~
## $ date_cos14_K1      <dbl> -0.2225209, -0.2225209, -0.2225209, -0.2225209, -0.~
## $ date_sin30_K1      <dbl> 0.8660254, 0.8660254, 0.8660254, 0.8660254, 0.86602~
## $ date_cos30_K1      <dbl> -0.5000000, -0.5000000, -0.5000000, -0.5000000, -0.~
## $ value_lag7         <dbl> 8.519602, 8.742095, 9.074949, 10.150222, 7.465358, ~
## $ value_lag7_roll_2  <dbl> 8.443690, 8.284885, 9.666220, 9.864192, 7.540808, 7~
## $ value_lag7_roll_4  <dbl> 8.392505, 8.088322, 9.657075, 9.825064, 7.906218, 7~
## $ value_lag7_roll_7  <dbl> 8.484152, 8.155853, 9.814177, 9.743540, 7.935658, 7~
## $ value_lag7_roll_14 <dbl> 8.824833, 8.499091, 9.722932, 9.712857, 7.939522, 8~
## $ value_lag7_roll_30 <dbl> 8.748093, 8.740054, 9.509160, 9.661446, 7.855750, 8~
## $ value              <dbl> 9.299692, 8.313068, 9.852617, 9.419039, 8.181474, 8~
## $ date_index.num     <dbl> -1.567231, -1.567231, -1.567231, -1.567231, -1.5672~
## $ date_year          <int> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 202~
## $ date_half          <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ date_quarter       <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
## $ date_month         <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, ~
## $ date_day           <int> 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 14,~
## $ date_hour          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_minute        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_second        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_hour12        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_am.pm         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ date_wday          <int> 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, ~
## $ date_mday          <int> 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 14,~
## $ date_qday          <int> 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 14,~
## $ date_yday          <int> 193, 193, 193, 193, 193, 193, 194, 194, 194, 194, 1~
## $ date_mweek         <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
## $ date_week          <int> 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28, 28,~
## $ date_week2         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_week3         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ date_week4         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_mday7         <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, ~
## $ item_id_AML        <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ~
## $ item_id_CCL        <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, ~
## $ item_id_HSBA       <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, ~
## $ item_id_INRG       <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, ~
## $ item_id_JDW        <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, ~
## $ item_id_OCDO       <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ~
## $ date_month.lbl_01  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_month.lbl_02  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_month.lbl_03  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_month.lbl_04  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_month.lbl_05  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_month.lbl_06  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_month.lbl_07  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ date_month.lbl_08  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_month.lbl_09  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_month.lbl_10  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_month.lbl_11  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_month.lbl_12  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_wday.lbl_1    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_wday.lbl_2    <dbl> 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_wday.lbl_3    <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, ~
## $ date_wday.lbl_4    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ~
## $ date_wday.lbl_5    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_wday.lbl_6    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ date_wday.lbl_7    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

Result of the recipe spec and pre-processing pipeline is a dataset with 64 predictors.

4. Modeling workflow

## # A tibble: 4 x 9
##   .model_id .model_desc               .type   mae  mape  mase smape  rmse   rsq
##       <int> <chr>                     <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         4 RANGER                    Test  0.384  4.53 0.318  4.53 0.504 0.808
## 2         2 XGBOOST                   Test  0.477  5.62 0.394  5.73 0.596 0.743
## 3         3 PROPHET W/ XGBOOST ERRORS Test  0.520  6.06 0.430  6.26 0.659 0.720
## 4         1 PROPHET W/ REGRESSORS     Test  0.988 11.9  0.817 13.2  1.22  0.593

Out of the box all models except for Prophet Boost seem to be a good and consistent fit to the testing data. This step indicates how important is to use the proper algorithm for the available data. During the next step, all models are tuned using 10 fold latin hyercube grid to iterate through hyper parameter space. Grid of 20 is also good alternative but this increases computation cost by two fold and can have diminishing results.

5. Hyper Parameter Tuning

Due to the limited sample 2 fold cross validation may not be enough to ensure sufficient amount of samples for proper validation and test of the models.

## 15.56 sec elapsed

## # A tibble: 10 x 12
##     mtry trees min_n tree_depth learn_rate loss_reduction .metric .estimator
##    <int> <int> <int>      <int>      <dbl>          <dbl> <chr>   <chr>     
##  1    48  1570    12          8     0.141        2.09e- 1 rmse    standard  
##  2    13  1773     6         10     0.181        1.02e- 5 rmse    standard  
##  3     4   470     7          6     0.273        4.13e- 2 rmse    standard  
##  4    57   866    14          8     0.419        3.46e- 3 rmse    standard  
##  5    32  1324    18         12     0.0420       6.43e- 8 rmse    standard  
##  6    25   325    22          2     0.349        2.02e- 6 rmse    standard  
##  7    39  1950    32          3     0.0863       3.56e- 4 rmse    standard  
##  8    18  1041    27          5     0.360        1.39e-10 rmse    standard  
##  9    52   737    39         13     0.470        1.46e- 9 rmse    standard  
## 10    29   105    31         15     0.247        2.85e+ 1 rmse    standard  
## # ... with 4 more variables: mean <dbl>, n <int>, std_err <dbl>, .config <chr>

## 5.67 sec elapsed

## # A tibble: 10 x 12
##     mtry trees min_n tree_depth learn_rate loss_reduction .metric .estimator
##    <int> <int> <int>      <int>      <dbl>          <dbl> <chr>   <chr>     
##  1    47  1570    12          8     0.141        2.09e- 1 rmse    standard  
##  2     4   470     7          6     0.273        4.13e- 2 rmse    standard  
##  3    12  1773     6         10     0.181        1.02e- 5 rmse    standard  
##  4    57   866    14          8     0.419        3.46e- 3 rmse    standard  
##  5    32  1324    18         12     0.0420       6.43e- 8 rmse    standard  
##  6    24   325    22          2     0.349        2.02e- 6 rmse    standard  
##  7    39  1950    32          3     0.0863       3.56e- 4 rmse    standard  
##  8    51   737    39         13     0.470        1.46e- 9 rmse    standard  
##  9    29   105    31         15     0.247        2.85e+ 1 rmse    standard  
## 10    18  1041    27          5     0.360        1.39e-10 rmse    standard  
## # ... with 4 more variables: mean <dbl>, n <int>, std_err <dbl>, .config <chr>

## 3.57 sec elapsed

## # A tibble: 10 x 9
##     mtry trees min_n .metric .estimator  mean     n std_err .config             
##    <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
##  1    61   773     6 rmse    standard   0.478     2  0.0304 Preprocessor1_Model~
##  2    44   467    10 rmse    standard   0.479     2  0.0360 Preprocessor1_Model~
##  3    51   950    25 rmse    standard   0.490     2  0.0197 Preprocessor1_Model~
##  4    18   325     7 rmse    standard   0.491     2  0.0260 Preprocessor1_Model~
##  5    29  1904    23 rmse    standard   0.492     2  0.0242 Preprocessor1_Model~
##  6    33  1737    35 rmse    standard   0.502     2  0.0202 Preprocessor1_Model~
##  7    22    70    30 rmse    standard   0.503     2  0.0296 Preprocessor1_Model~
##  8    38  1124    37 rmse    standard   0.503     2  0.0202 Preprocessor1_Model~
##  9    12  1441    16 rmse    standard   0.503     2  0.0295 Preprocessor1_Model~
## 10     5  1370    20 rmse    standard   0.569     2  0.0449 Preprocessor1_Model~

6. Evaluate Panel Forecasts

## # A tibble: 7 x 9
##   .model_id .model_desc               .type   mae  mape  mase smape  rmse   rsq
##       <int> <chr>                     <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         7 RANGER                    Test  0.384  4.53 0.318  4.53 0.504 0.808
## 2         3 RANGER - Tuned            Test  0.416  4.89 0.344  4.91 0.524 0.785
## 3         2 XGBOOST - Tuned           Test  0.463  5.40 0.383  5.49 0.586 0.740
## 4         5 XGBOOST                   Test  0.477  5.62 0.394  5.73 0.596 0.743
## 5         6 PROPHET W/ XGBOOST ERRORS Test  0.520  6.06 0.430  6.26 0.659 0.720
## 6         1 PROPHET - Tuned           Test  0.517  5.97 0.428  6.25 0.702 0.722
## 7         4 PROPHET W/ REGRESSORS     Test  0.988 11.9  0.817 13.2  1.22  0.593

Good accuracy for the first 2 models on the testing set. Using Tune() to automate this process is precious time saver as it spares the data scientist from manually running and tweaking multiple iterations with sets of hyper parameters.

With some exceptions the models are able to pick up the seasonal trends. Generally, most of the time advanced machine learning models are able to fit the testing data well. This however is not a clear indication that they will be able to forecast accurately. For that reason, resampling can be evaluated on different time intervals to assess model performance further in the future.

7. Resampling - Assess stability of our models over time, helps to strategize for an ensemble approach

Due to the limited data sample, only 2 resamples are used for evaluation. For long term forecasts, 5-8 resamples are recommended.

## -- Fitting Resamples --------------------------------------------
## 
## 57.74 sec elapsed

## # A tibble: 7 x 8
##   .model_id .model_desc         .type       n rmse_mean rmse_sd rsq_mean  rsq_sd
##       <int> <chr>               <chr>   <int>     <dbl>   <dbl>    <dbl>   <dbl>
## 1         3 RANGER - Tuned      Resamp~     2     0.573  0.0607  0.689   2.11e-1
## 2         5 XGBOOST             Resamp~     2     0.587  0.0359  0.642   1.71e-1
## 3         7 RANGER              Resamp~     2     0.627  0.0358  0.666   2.02e-1
## 4         2 XGBOOST - Tuned     Resamp~     2     0.628  0.0739  0.659   2.10e-1
## 5         1 PROPHET - Tuned     Resamp~     2     0.704  0.117   0.602   2.08e-1
## 6         6 PROPHET W/ XGBOOST~ Resamp~     2     0.726  0.0947  0.578   2.27e-1
## 7         4 PROPHET W/ REGRESS~ Resamp~     2     8.09   5.35    0.00874 4.66e-4

Clear pattern emerges when evaluating the resampled accuracy of the models. Ranger & XGBoost are on top of the list.

This graph indicates that some of the models are not good, e.g: Prophet with Regressors. This finding would be of importance if the data sample were bigger - the more slices are added (simulating the test of time in the future), the worse this model performs.

It is obvious that accuracy degrades across the board for all models in the slice two resample. This isn’t necessarily bad as equity foresting is targeting shorter future intervals and we may not care if our model would be still accurate looking 10 weeks or more into the future.

8. Ensemble Panel Model

Fitting an ensemble model is the average prediction across the top performing (with lowest RMSE) models: in this case Ranger - Tuned, XGBoost and Ranger.

## # A tibble: 1 x 9
##   .model_id .model_desc               .type   mae  mape  mase smape  rmse   rsq
##       <int> <chr>                     <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         1 ENSEMBLE (MEAN): 3 MODELS Test  0.415  4.88 0.344  4.92 0.526 0.785

Observed metrics of the ensemble model on the testing set indicate that the variance in the time series is explained by 79% which is moderately good. RMSE of 0.52 is good. With additional feature engineering this metric can be reduced even further.

Tuning the models and ensembling them seem to marginally decrease accuracy but increases reliability compared to using the default hyper parameters. Any gains count.

## # A tibble: 6 x 4
##   item_id    mae   rmse         rsq
##   <fct>    <dbl>  <dbl>       <dbl>
## 1 AML      1787.  2164. 0.0409     
## 2 CCL      2093.  2531. 0.00469    
## 3 HSBA    12678. 17388. 0.146      
## 4 INRG     4238.  5732. 0.0646     
## 5 JDW       274.   338. 0.000000205
## 6 OCDO     1149.  2070. 0.0768

Predicted values for the next time window of 7 days for each stock seem to be a relatively good fit to the overall seasonal pattern and with a few exceptions, they fall within the range of the temporal dynamics of the time series. Boosting the forecast with more features could yield even better results.

9. Save & export forecast results

## # A tibble: 42 x 25
##    .model_id .model_desc        .key    .index              .value rowid item_id
##        <int> <chr>              <fct>   <dttm>               <dbl> <int> <fct>  
##  1         1 ENSEMBLE (MEAN): ~ predic~ 2021-09-01 00:00:00  5861.    44 AML    
##  2         1 ENSEMBLE (MEAN): ~ predic~ 2021-09-02 00:00:00  5295.    45 AML    
##  3         1 ENSEMBLE (MEAN): ~ predic~ 2021-09-03 00:00:00  4673.    46 AML    
##  4         1 ENSEMBLE (MEAN): ~ predic~ 2021-09-04 00:00:00  4680.    47 AML    
##  5         1 ENSEMBLE (MEAN): ~ predic~ 2021-09-05 00:00:00  4405.    48 AML    
##  6         1 ENSEMBLE (MEAN): ~ predic~ 2021-09-06 00:00:00  3817.    49 AML    
##  7         1 ENSEMBLE (MEAN): ~ predic~ 2021-09-07 00:00:00  3967.    50 AML    
##  8         1 ENSEMBLE (MEAN): ~ predic~ 2021-09-01 00:00:00  4103.    94 CCL    
##  9         1 ENSEMBLE (MEAN): ~ predic~ 2021-09-02 00:00:00  4341.    95 CCL    
## 10         1 ENSEMBLE (MEAN): ~ predic~ 2021-09-03 00:00:00  4028.    96 CCL    
## # ... with 32 more rows, and 18 more variables: date <dttm>, value <dbl>,
## #   date_sin2_K1 <dbl>, date_cos2_K1 <dbl>, date_sin4_K1 <dbl>,
## #   date_cos4_K1 <dbl>, date_sin7_K1 <dbl>, date_cos7_K1 <dbl>,
## #   date_sin14_K1 <dbl>, date_cos14_K1 <dbl>, date_sin30_K1 <dbl>,
## #   date_cos30_K1 <dbl>, value_lag7 <dbl>, value_lag7_roll_2 <dbl>,
## #   value_lag7_roll_4 <dbl>, value_lag7_roll_7 <dbl>, value_lag7_roll_14 <dbl>,
## #   value_lag7_roll_30 <dbl>

Equity Trading Forecast Analysis

Global Ensemble Approach - Superlearner

Metodi Simeonov

October, 2021