1. Environment Setup & Parallel Processing
The set of libraries and features used in the nested forecast analysis differ compared to the ensembled blueprint. This approach is considered lighter and faster alternative to the ensemble technique. Instead of creating a superlearner model here the aim is at selecting the best performing model for each of the time series groups.
2. Pre-processing Pipeline
This step is identical to the ensembled blueprint.
## Rows: 110,710
## Columns: 8
## $ account_number <dbl> 3433920, 5051374, 9098558, 8692010, 5728512, 5728512, ~
## $ stock_ticker <chr> "AML", "AML", "AML", "AML", "AML", "AML", "AML", "AML"~
## $ country <chr> "UK", "UK", "UK", "UK", "UK", "UK", "UK", "UK", "UK", ~
## $ filled_quantity <dbl> -150.000000, -10.000000, 550.000000, -8.000000, -10.00~
## $ status <chr> "FILLED", "FILLED", "FILLED", "FILLED", "FILLED", "FIL~
## $ filled_price <dbl> 17.975, 17.960, 18.300, 17.975, 20.620, 20.460, NA, 20~
## $ time_created <dttm> 2021-04-28 16:56:00, 2021-05-13 23:25:00, 2021-05-14 ~
## $ time_executed <dttm> 2021-07-19 11:34:00, 2021-07-19 11:34:00, 2021-07-08 ~
2.1. Wrangle & Transform
Same assumptions are applied here about categories and groups. The grouping will be done on daily basis as this proves to be the most accurate way to forecast this particular dataset.
## # A tibble: 109,567 x 3
## # Groups: time_executed, stock_ticker [44,428]
## stock_ticker filled_quantity time_executed
## <fct> <dbl> <dttm>
## 1 AML 150 2021-07-19 11:34:00
## 2 AML 10 2021-07-19 11:34:00
## 3 AML 550 2021-07-08 10:05:00
## 4 AML 8 2021-07-19 11:34:00
## 5 AML 10 2021-08-06 15:32:00
## 6 AML 10 2021-08-06 10:25:00
## 7 AML 10 2021-08-12 10:04:00
## 8 AML 8 2021-07-08 17:02:00
## 9 AML 10 2021-07-20 16:02:00
## 10 AML 20 2021-08-06 10:02:00
## # ... with 109,557 more rows
## # A tibble: 258 x 3
## stock_ticker time_executed filled_quantity
## <fct> <dttm> <dbl>
## 1 AML 2021-07-01 00:00:00 5011.
## 2 AML 2021-07-02 00:00:00 4305.
## 3 AML 2021-07-05 00:00:00 3983.
## 4 AML 2021-07-06 00:00:00 6367.
## 5 AML 2021-07-07 00:00:00 8725.
## 6 AML 2021-07-08 00:00:00 6576.
## 7 AML 2021-07-09 00:00:00 13319.
## 8 AML 2021-07-12 00:00:00 10934.
## 9 AML 2021-07-13 00:00:00 6543.
## 10 AML 2021-07-14 00:00:00 4206.
## # ... with 248 more rows
One major drawback of using nested forecasting is that external regressors are very difficult to introduce in the modeling workflow. For the time being they have not been added.
2.2. Inspect temporal dynamics
There are apparent events (spikes) that can drastically worsen the predictions. These events represent seasonal patterns and are the result of periodic orders occurring on those days. Some are more pronounced than others and less sporadic.
2.3. Summary Diagnostics
## # A tibble: 1 x 12
## n.obs start end units scale tzone diff.minimum
## <int> <dttm> <dttm> <chr> <chr> <chr> <dbl>
## 1 258 2021-07-01 00:00:00 2021-08-31 00:00:00 secs second UTC -5270400
## # ... with 5 more variables: diff.q1 <dbl>, diff.median <dbl>, diff.mean <dbl>,
## # diff.q3 <dbl>, diff.maximum <dbl>
2.4. TS Diag - analyze seasonal patterns
ACF - Autocorrelation between a target variable and lagged versions of itself.
PACF - Partial Autocorrelation removes the dependence of lags on other lags highlighting key seasonalities.
CCF - Shows how lagged predictors can be used for prediction of a target variable.
2.5. Rename column names to “date”, “value” and “item_id”
## # A tibble: 258 x 3
## item_id date value
## <fct> <dttm> <dbl>
## 1 AML 2021-07-01 00:00:00 5011.
## 2 AML 2021-07-02 00:00:00 4305.
## 3 AML 2021-07-05 00:00:00 3983.
## 4 AML 2021-07-06 00:00:00 6367.
## 5 AML 2021-07-07 00:00:00 8725.
## 6 AML 2021-07-08 00:00:00 6576.
## 7 AML 2021-07-09 00:00:00 13319.
## 8 AML 2021-07-12 00:00:00 10934.
## 9 AML 2021-07-13 00:00:00 6543.
## 10 AML 2021-07-14 00:00:00 4206.
## # ... with 248 more rows
3. Nested Time Series Data Tibble
## # A tibble: 6 x 4
## item_id .actual_data .future_data .splits
## <fct> <list> <list> <list>
## 1 AML <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]>
## 2 CCL <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]>
## 3 HSBA <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]>
## 4 INRG <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]>
## 5 JDW <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]>
## 6 OCDO <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]>
4. Modeling
Recipe Specification
## # A tibble: 29 x 52
## value date_index.num date_month date_day date_wday date_mday date_qday
## <dbl> <dbl> <int> <int> <int> <int> <int>
## 1 5011. 1625097600 7 1 5 1 1
## 2 4305. 1625184000 7 2 6 2 2
## 3 3983. 1625443200 7 5 2 5 5
## 4 6367. 1625529600 7 6 3 6 6
## 5 8725. 1625616000 7 7 4 7 7
## 6 6576. 1625702400 7 8 5 8 8
## 7 13319. 1625788800 7 9 6 9 9
## 8 10934. 1626048000 7 12 2 12 12
## 9 6543. 1626134400 7 13 3 13 13
## 10 4206. 1626220800 7 14 4 14 14
## # ... with 19 more rows, and 45 more variables: date_yday <int>,
## # date_mweek <int>, date_week <int>, date_week2 <int>, date_week3 <int>,
## # date_week4 <int>, date_mday7 <int>, date_sin2_K1 <dbl>, date_cos2_K1 <dbl>,
## # date_sin2_K2 <dbl>, date_sin4_K1 <dbl>, date_cos4_K1 <dbl>,
## # date_sin4_K2 <dbl>, date_cos4_K2 <dbl>, date_sin7_K1 <dbl>,
## # date_cos7_K1 <dbl>, date_sin7_K2 <dbl>, date_cos7_K2 <dbl>,
## # date_sin14_K1 <dbl>, date_cos14_K1 <dbl>, date_sin14_K2 <dbl>, ...
Using combination of XGBoost, Random Forest & Prophet Boost for the Workflow
## # Nested Modeltime Table
## # A tibble: 6 x 5
## item_id .actual_data .future_data .splits .modeltime_tables
## <fct> <list> <list> <list> <list>
## 1 AML <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]> <mdl_time_tbl [3 ~
## 2 CCL <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]> <mdl_time_tbl [3 ~
## 3 HSBA <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]> <mdl_time_tbl [3 ~
## 4 INRG <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]> <mdl_time_tbl [3 ~
## 5 JDW <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]> <mdl_time_tbl [3 ~
## 6 OCDO <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]> <mdl_time_tbl [3 ~
4.2. Check for Errors
## # A tibble: 0 x 4
## # ... with 4 variables: item_id <fct>, .model_id <int>, .model_desc <chr>,
## # .error_desc <chr>
4.3. Review Test Accuracy
4.4. Visualize Test Forecast
5.0 Select the Best Models
## # Nested Modeltime Table
## # A tibble: 6 x 5
## item_id .actual_data .future_data .splits .modeltime_tables
## <fct> <list> <list> <list> <list>
## 1 AML <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]> <mdl_time_tbl [1 ~
## 2 CCL <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]> <mdl_time_tbl [1 ~
## 3 HSBA <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]> <mdl_time_tbl [1 ~
## 4 INRG <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]> <mdl_time_tbl [1 ~
## 5 JDW <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]> <mdl_time_tbl [1 ~
## 6 OCDO <tibble [43 x 2]> <tibble [14 x 2]> <split [29|14]> <mdl_time_tbl [1 ~
5.1. Visualize the Best Models
Fitting the best models for each of the time series to the testing dataset is not great but given the limited data it is a good fit.
6.0 Refit
6.1. Check for Errors
## # A tibble: 0 x 4
## # ... with 4 variables: item_id <fct>, .model_id <int>, .model_desc <chr>,
## # .error_desc <chr>
6.2 Visualize Future Forecast
For the most part the future forecast seems to be providing reasonable approximation of the future values, following the range of variance across the time series.
7. Save and export forecast results
## # A tibble: 6 x 5
## item_id .model_desc .conf_hi .conf_lo predicted
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 AML RANGER 12149. 1123. 6636.
## 2 CCL PROPHET 11266. -960. 5153.
## 3 HSBA PROPHET 51815. -7055. 22380.
## 4 INRG PROPHET 20101. 1435. 10768.
## 5 JDW PROPHET 3157. -846. 1155.
## 6 OCDO RANGER 9203. -1990. 3607.