Synopsis
In their book Forecasting: Principles and Practice, Rob Hyndman & George Athansopoulos state that forecasting is a difficult activity, and businesses that do it well have a big advantage over those whose forecasts fail.
This analysis uses some of the most advanced methods for producing forecasts. The emphasis will be on methods that are replicable and testable, and have been shown to work across various business sectors and different contexts.
Time Series Forecasting
Time series is a discrete or continuous sequence of observations that depend on time. Time is an important feature in natural processes such as air temperature, pulse of the heart, or waves crashing on a sandy beach.
The mechanism generating time series Yt is often thought to consist of three components. A seasonal component St, a trend component Tt, and a random error or remainder component et. Provided that seasonal variation around the trend cycle does not vary with the level of the time series the following formula can be derived: \[yt = St + Tt + et\]
There are examples where time series are irregular, do not have a distinct trend or any seasonal patterns, which make them much more difficult to forecast. In similar scenarios simple models such as ARIMA, TBATS, ETS etc, are not going to do a good job. Fortunately, many state of the art algorithms such as XGBoost, Prophet, and others enable data scientists to handle these constraints.
For example, XGBoost’s advanced gradient boosting algorithms use for a given data set with n examples and m features D = {(xi, yi)} (|D| = n, xi ∈ Rm, yi ∈ R), and a tree ensemble model as the one listed bellow uses K additive functions to predict the output.
\[ yˆi = \phi(x\iota) = \sum_{k=1}^{k}f_{k}(x\iota), f_{k} ∈ F,\]
Where F = {f(x) = wq(x)} (q : R m → T, w ∈ RT) is space of regression trees (also known as classification and regression trees or CART).
Here q represents the structure of each tree that maps an example to the corresponding leaf index. T is the number of leaves in the tree. Each fk corresponds to an independent tree structure q and leaf weights w. Unlike decision trees, each regression tree contains a continuous score on each of the leaf, and use wi to represent score on i-th leaf.
By using the decision rules in the trees (given by q) the algorithm can classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves (given by w).
This illustrates how XGBoost runs under the hood, regardless of the programming context.
For comparison, simplified example of a basic (0,1,0) random walk ARIMA model (stands for Auto-Regressive Integrated Moving Average) is denoted like this \[Ŷt - Yt-1 = μ\]
or equivalently \[Ŷt = μ + Yt-1\]
1. Load required libraries and setup parallel processing compute nodes
## [1] 24
2. Pre-processing Pipeline - Load, Transform & Wrangle
This dataset represents two months of equity orders from six stocks of the S&P 500 index. There are many ways to harness and explore this dataset. In this analysis, we will use the data to explore a very specific question that is:
Given recent sold quantity, what is the expected sold quantity for the 1st of September 2021?
This requires that a predictive model forecast the total filled orders of each stock for each day.
Technically, this framing of the problem is referred to as a multi-step time series forecasting problem, given the multiple forecast steps. A model that makes use of multiple input variables to predict the target variable (such as engineered features) is also referred to as a multivariate multi-step time series forecasting model.
## # A tibble: 110,710 x 8
## account_number stock_ticker country filled_quantity status filled_price
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 3433920 AML UK -150 FILLED 18.0
## 2 5051374 AML UK -10 FILLED 18.0
## 3 9098558 AML UK 550 FILLED 18.3
## 4 8692010 AML UK -8 FILLED 18.0
## 5 5728512 AML UK -10 FILLED 20.6
## 6 5728512 AML UK -10 FILLED 20.5
## 7 4964040 AML UK 0 CANCELLED NA
## 8 5728512 AML UK -10 FILLED 20.7
## 9 6303034 AML UK 8 FILLED 18.2
## 10 4801054 AML UK 0 CANCELLED NA
## # ... with 110,700 more rows, and 2 more variables: time_created <dttm>,
## # time_executed <dttm>
| Name | equity_test_tbl |
| Number of rows | 110710 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 3 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| stock_ticker | 0 | 1 | 3 | 4 | 0 | 6 | 0 |
| country | 0 | 1 | 2 | 2 | 0 | 2 | 0 |
| status | 0 | 1 | 3 | 9 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| account_number | 0 | 1.00 | 5834151.84 | 1956311.33 | 1489016.00 | 4150806.00 | 5754036.00 | 7563634.00 | 9327748.00 | ▂▇▇▆▆ |
| filled_quantity | 0 | 1.00 | 1.08 | 169.31 | -18068.00 | -0.25 | 0.50 | 2.80 | 30000.00 | ▁▇▁▁▁ |
| filled_price | 603 | 0.99 | 11.53 | 4.81 | 3.83 | 9.32 | 9.67 | 15.19 | 21.19 | ▂▇▁▂▃ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| time_created | 0 | 1 | 2021-04-01 10:11:00 | 2021-08-31 23:01:00 | 2021-07-28 10:15:00 | 27952 |
| time_executed | 0 | 1 | 2021-07-01 00:00:00 | 2021-08-31 23:53:00 | 2021-07-28 10:15:00 | 20012 |
Preliminary analysis suggests that data quality is good and that there are no apparent anomalies.
Prior modeling, raw data requires transformation and wrangling. There are several categories (groups). The first (parent) group is “country” with two distinct labels UK & EU, then there is the status column with 4 sub-categories - “filled”, “canceled”, “rejected”, and “new”, and finally the stock ticker identifier column containing 6 individual stocks.
There are ways to perform multi-categorical forecasts but they tend to be poor performers in terms of accuracy as group categories get higher (for reference - https://metodisimeonov.shinyapps.io/Hierarchical_Forecaster/). Therefore, suggested approach for going forward is the development of an ensemble (current analysis) and nested forecasting methods to process panel data.
Moreover, the target variable contains positive and negative integers, which correspond to two additional groups - number of bought and sold shares.
For the purpose of this analysis the target variable will be transformed to an absolute value.
The forecast will account for the UK region only.
2.1. External Regressors
Visualizing time series signatures indicates that there are reoccurring patterns on hourly, daily and weekly intervals.