spooky: a brief introduction

Giancarlo Vercellino

13-April-2022

I will live in the Past, the Present, and the Future.” (Ebenezer Scrooge, A Christmas Carol)

Expanding time features using Spectral analysis and Jack-Knife resampling

spooky uses Discrete Fast Fourier Transformation to extrapolate time features beyond their boundaries. As you can see below, moving inside out the picture, the main points of spooky are the following:

  1. We can decompose a time feature signal in all its components using a Discrete Fast-Fourier Transformation: some components will be highly related to future trends and will be relevant for any purpose of future generalizations, while others will probably be no more than noises and errors without any representative power for extrapolation: we sift through all the possible frequencies using a jack-knife resampling to magnify the predicting power of best frequencies. Each time feature is expanded inside each validation window: after the spectral analysis, the signal of each time feature is extended into the future for the requested sequence of time steps (seq_len), then the expansion is being carried out on the resampled set from a jack-knife leaving N points out (lno).

  2. Some basic transformation are directly managed in background. Differentiation and integration are automatically managed by spooky using the maximal p-value in a recursive F-test for de-trending each time-feature: this allows to easily determine the different dynamic characteristics of each time feature (random walk, trend, exponential). Maybe not cool like Augmented Dickey-Fuller or Ljung Box Test, but it works for our humble purpose. If you have limited missing values in your time features, spooky automatically proceeds with the imputation using the Kalman filter method1. If you prefer to project into the future the smoothed version, you can set smoother = TRUE to use loess2 function.

  3. The test errors are cross-validated through an expanding validation n_windows: the default value is set to 10, meaning that the time features are divided into 10 + 1 segments guaranteeing at least ten validation sets to measure the error on unforeseen data. For each point in the prediction sequence, a thousand samples are collected for the calculation of quantiles, mean, mode, standard deviation, skewness and kurtosis, and other less common measures that are provided for each time step (see below).

  4. Concluding this brief excursus inside out the mode, beyond the windowing phase, spooky manages the random search sampling n_samp models, randomly selected within the cartesian space delimited by seq_len and lno. The process is really fast.

    Easy, that’s all.

The process flow of spooky

The process flow of spooky

A look at the price dynamics of tech giants’ stocks

The dataset time features included with spooky is a recent take on IBM and Microsoft stock prices (source: Yahoo Finance). The data is expected in a dataframe format, where each column represents a different time series (the date information is not mandatory and could be provided separately).

Examples of time features: IBM and MICROSOFT Close Prices
IBM.Close MSFT.Close
2017-01-03 159.8375 62.58
2017-01-04 161.8164 62.30
2017-01-05 161.2811 62.30
2017-01-06 162.0746 62.84
2017-01-09 160.2773 62.64
2017-01-10 158.2409 62.62
2017-01-11 160.3728 63.19
2017-01-12 160.5641 62.61
2017-01-13 159.9809 62.70
2017-01-17 160.5067 62.53

In the first example, we are predicting the close price for IBM and Microsoft. In this example we try to set seq_len= 100 (sequence length) and lno= 1 (leave-N-out), using a cross-validation scheme of 10 n_windows for error measurement.


example1 <- spooky(time_features, n_samp = 1, seq_len = 50, n_windows = 10,  dates = as.Date(rownames(time_features)), lno = 1)
  time: 10.11 sec elapsed

The result is a list of different components, as you can see below.

names(example1)
  [1] "exploration" "history"     "best_model"  "time_log"

The first variable, exploration, includes all the models evaluated during the random search within the boundaries decided by users: besides the predictions for each feature, exploration includes the test error statistics for each time sequence (length equals to seq_len) and model information. The second variable, history, summarizes the hyper-parameters selected by users, the relative max error metrics across different time features (me, mae, rmsse, mpe, mape, rmae, rrmse, rame, mase, smse, sce, gmrae 3) and the weighted average rank used to sort out the best model. The last variable, best_model, collects testing_errors, preds, plots for the selected model.

The prediction is a list including the predicted results for each time-feature (quantile, min, max, mean, mode, sd, skewness, kurtosis, iqr to range, risk ratio, upside probability, divergence for each time point in the seq_len sequence). The IQR to range is the interquartile range to the min-max range, the risk ratio is the range above median to the range below it, the upside probability is the probability of growth compared to the former point in the time sequence, the divergence is the maximum distance of cumulative normal curve of each point to the former point in the sequence.

Examples of prediction for IBM Close Prices
min 10% 25% 50% 75% 90% max mean sd mode kurtosis skewness iqr_to_range risk_ratio upside_prob divergence
2022-04-05 123.291 124.218 127.284 127.630 129.509 130.075 130.711 127.933 1.879 127.974 3.141 -0.870 0.300 0.710 NA NA
2022-04-06 125.432 126.038 128.061 129.656 132.766 134.182 134.520 129.884 2.757 128.825 1.855 0.307 0.518 1.152 0.736 0.002
2022-04-07 122.122 125.304 127.102 129.556 131.683 133.359 134.253 129.367 3.006 131.076 2.352 -0.408 0.378 0.632 0.439 0.079
2022-04-09 124.420 126.501 129.360 131.037 132.109 133.458 133.810 130.665 2.262 131.462 3.468 -1.034 0.293 0.419 0.658 0.005
2022-04-10 124.202 127.552 129.701 130.403 133.145 136.468 139.255 130.984 3.143 130.105 3.036 0.334 0.229 1.428 0.522 0.053
2022-04-12 116.776 117.239 125.797 131.133 134.117 134.889 138.390 129.220 6.045 132.026 2.806 -1.026 0.385 0.505 0.416 0.255
2022-04-13 116.053 116.383 124.565 130.119 135.276 136.724 138.422 128.558 6.938 131.590 2.322 -0.670 0.479 0.590 0.457 0.062
2022-04-15 116.745 117.101 124.420 130.333 132.792 133.730 136.335 128.007 5.905 132.052 2.453 -0.892 0.427 0.442 0.488 0.062
2022-04-16 117.312 117.605 124.523 131.834 135.699 137.980 139.263 129.911 7.181 135.024 2.028 -0.635 0.509 0.512 0.570 0.005
2022-04-18 115.223 115.576 126.419 130.807 134.335 139.075 140.759 128.658 7.594 128.949 2.338 -0.504 0.310 0.639 0.475 0.071
2022-04-19 115.664 116.105 125.830 131.873 138.004 144.732 146.708 130.285 9.052 129.497 2.179 -0.122 0.392 0.915 0.539 0.010
2022-04-21 114.624 115.039 122.820 128.364 138.167 146.664 148.452 129.973 10.096 127.120 2.115 0.191 0.454 1.462 0.479 0.035
2022-04-22 115.823 116.287 120.958 126.506 138.906 152.756 154.379 130.110 11.487 122.010 2.522 0.689 0.466 2.609 0.504 0.028
2022-04-24 113.660 115.374 121.403 126.972 136.247 153.948 155.031 129.599 11.550 130.326 2.933 0.726 0.359 2.108 0.492 0.018
2022-04-25 115.752 116.243 121.544 127.314 136.236 145.722 146.593 129.349 9.492 130.122 2.118 0.287 0.476 1.667 0.488 0.053
2022-04-27 114.999 116.134 126.372 128.135 136.108 148.476 149.468 130.624 9.988 130.807 2.415 0.281 0.282 1.624 0.508 0.000
2022-04-28 112.883 115.959 123.957 124.776 135.817 139.903 142.994 127.736 8.203 121.056 1.783 0.053 0.394 1.532 0.418 0.146
2022-04-29 114.891 116.189 121.962 124.254 139.427 145.720 147.639 129.188 10.492 120.690 1.713 0.364 0.533 2.498 0.551 0.028
2022-05-01 109.913 110.997 122.091 123.697 139.137 145.209 146.779 126.925 11.265 124.611 1.994 0.241 0.462 1.674 0.438 0.087
2022-05-02 108.429 108.918 120.886 124.215 137.317 141.225 143.251 125.629 10.958 124.663 1.894 -0.112 0.472 1.206 0.500 0.048
2022-05-04 107.332 107.716 119.847 125.924 135.959 140.008 142.550 124.754 10.634 123.869 2.026 -0.192 0.458 0.894 0.459 0.034
2022-05-05 107.833 108.467 121.217 127.392 131.165 137.919 143.016 124.996 9.829 125.016 2.271 -0.330 0.283 0.799 0.521 0.014
2022-05-07 107.280 108.062 118.923 124.399 130.997 138.203 140.394 124.131 10.155 124.228 1.912 -0.244 0.365 0.934 0.471 0.037
2022-05-08 105.531 106.093 117.417 123.886 130.399 137.078 139.115 122.990 10.639 124.373 1.899 -0.382 0.387 0.830 0.486 0.047
2022-05-10 107.562 109.164 121.019 130.978 135.424 143.467 144.917 127.628 11.799 137.613 1.836 -0.299 0.386 0.595 0.603 0.004
2022-05-11 108.599 110.260 121.968 129.937 136.626 141.369 143.628 127.944 11.343 137.399 1.761 -0.403 0.418 0.642 0.522 0.004
2022-05-13 108.665 108.969 123.950 130.296 137.776 142.943 144.433 129.147 11.083 129.750 2.307 -0.659 0.387 0.654 0.530 0.001
2022-05-14 110.037 110.395 123.012 129.699 138.215 142.718 143.598 128.394 10.729 128.560 2.126 -0.389 0.453 0.707 0.471 0.030
2022-05-16 108.876 109.498 122.273 129.333 136.187 141.189 141.876 126.725 10.773 135.420 1.975 -0.364 0.422 0.613 0.453 0.062
2022-05-17 108.552 110.412 120.509 126.952 135.333 141.401 142.254 126.290 10.225 125.635 1.946 -0.173 0.440 0.832 0.493 0.024
2022-05-19 108.521 109.124 121.614 126.610 135.248 138.392 139.764 125.740 10.000 133.624 2.071 -0.434 0.436 0.727 0.476 0.023
2022-05-20 110.892 112.294 124.817 128.389 135.515 141.067 141.872 128.265 9.216 127.673 2.318 -0.445 0.345 0.771 0.581 0.000
2022-05-22 109.897 111.294 126.043 127.756 137.501 139.531 146.453 128.488 10.359 128.462 2.325 -0.202 0.313 1.047 0.518 0.023
2022-05-23 109.773 110.401 127.153 128.851 135.375 138.804 147.791 128.199 10.041 127.903 2.768 -0.365 0.216 0.993 0.502 0.016
2022-05-24 108.138 109.786 124.906 129.702 140.268 144.722 149.081 129.683 12.381 130.612 2.051 -0.307 0.375 0.899 0.536 0.024
2022-05-26 110.257 111.699 125.345 128.316 138.698 146.659 149.671 129.579 11.595 129.372 2.168 -0.108 0.339 1.183 0.519 0.018
2022-05-27 110.915 112.697 123.521 128.045 140.102 146.726 150.733 129.436 12.028 120.362 1.947 0.094 0.416 1.325 0.468 0.012
2022-05-29 112.964 114.502 121.570 127.502 140.362 150.480 150.932 130.296 12.237 120.112 1.925 0.321 0.495 1.612 0.537 0.002
2022-05-30 113.753 115.900 119.003 125.171 141.510 148.352 149.944 129.289 12.141 118.785 1.710 0.421 0.622 2.170 0.485 0.033
2022-06-01 114.559 116.623 122.662 129.952 141.969 147.978 151.934 130.947 11.399 122.484 1.816 0.361 0.517 1.428 0.538 0.000
2022-06-02 113.436 117.261 121.820 128.589 140.516 147.806 153.463 130.029 11.602 121.422 2.084 0.597 0.467 1.641 0.478 0.033
2022-06-04 112.127 114.354 121.302 127.910 140.282 148.327 152.344 130.213 12.259 120.393 1.923 0.370 0.472 1.548 0.506 0.010
2022-06-05 114.654 115.648 123.411 128.819 144.328 148.091 154.735 131.665 12.104 127.914 1.918 0.337 0.522 1.830 0.534 0.003
2022-06-07 115.035 115.334 125.738 130.378 144.334 147.042 150.160 132.188 11.679 122.441 1.710 -0.067 0.529 1.289 0.526 0.002
2022-06-08 116.846 119.064 125.014 134.047 147.141 149.536 152.449 134.079 11.259 131.993 1.656 0.062 0.621 1.070 0.530 0.003
2022-06-10 116.519 121.699 127.446 135.198 146.573 150.488 152.567 135.130 10.783 127.208 1.763 0.222 0.531 0.930 0.509 0.000
2022-06-11 115.928 118.430 123.879 133.948 140.346 150.998 153.221 132.994 11.539 125.937 1.921 0.285 0.442 1.070 0.460 0.081
2022-06-13 114.817 119.164 120.716 135.881 140.683 152.237 155.009 134.392 12.775 125.108 1.703 0.187 0.497 0.908 0.536 0.005
2022-06-14 106.221 114.362 123.434 134.880 140.841 156.219 158.294 134.105 14.002 130.288 1.963 0.221 0.334 0.817 0.480 0.028
2022-06-16 108.645 110.757 122.794 132.532 141.286 153.733 156.718 132.076 13.910 129.217 1.999 0.049 0.385 1.013 0.469 0.058

For each time features included in the model, you get a plot of the median with the chosen confidence interval (ci default is 0.8). As in other packages4, we provide different stats to give a better hint on the different dynamics related to aleatoric and epistemic uncertainty.

  $IBM.Close

  
  $MSFT.Close

Fast model optimization in a compact space

Now, let’s try a fast random search to optimize our estimations. The following example show how to sample 30 different models for a sequence of 50 time steps. Setting lno = NULL starts a search between 1 and the square root of time feature length.

example2 <- spooky(time_features, n_samp = 30, seq_len = 50, n_windows = 10,  dates = as.Date(rownames(time_features)), lno = NULL)
  time: 489.36 sec elapsed
History table with ranking of 30 different models
seq_len lno max_me max_mae max_mse max_rmsse max_mpe max_mape max_rmae max_rrmse max_rame max_mase max_smse max_sce max_gmrae wgt_avg_rank
20 50 21 4.23 7.76 114.80 6.80 0.03 0.06 0.06 0.07 0.05 4.27 58.03 110.94 0.05 7.51
5 50 23 4.31 7.77 115.35 6.80 0.03 0.06 0.06 0.07 0.05 4.26 57.73 113.45 0.05 8.18
9 50 24 4.37 7.77 116.10 6.80 0.03 0.06 0.06 0.07 0.05 4.27 57.62 115.56 0.05 8.87
6 50 19 4.22 7.80 115.36 6.82 0.03 0.06 0.06 0.07 0.05 4.29 58.31 110.42 0.05 8.92
26 50 19 4.22 7.80 115.36 6.82 0.03 0.06 0.06 0.07 0.05 4.29 58.31 110.42 0.05 8.92
7 50 27 4.56 7.80 119.68 6.79 0.03 0.06 0.06 0.07 0.05 4.28 56.86 118.91 0.05 9.47
10 50 26 4.51 7.81 118.33 6.81 0.03 0.06 0.06 0.07 0.05 4.28 57.26 119.12 0.05 10.13
12 50 26 4.51 7.81 118.33 6.81 0.03 0.06 0.06 0.07 0.05 4.28 57.26 119.12 0.05 10.13
19 50 18 4.24 7.83 116.07 6.83 0.03 0.06 0.06 0.07 0.05 4.31 58.49 111.43 0.05 10.62
11 50 17 4.27 7.84 116.96 6.82 0.03 0.06 0.06 0.07 0.05 4.32 58.45 112.18 0.05 10.99
15 50 17 4.27 7.84 116.96 6.82 0.03 0.06 0.06 0.07 0.05 4.32 58.45 112.18 0.05 10.99
29 50 17 4.27 7.84 116.96 6.82 0.03 0.06 0.06 0.07 0.05 4.32 58.45 112.18 0.05 10.99
4 50 30 4.74 7.91 125.26 6.81 0.03 0.06 0.06 0.07 0.05 4.37 57.16 122.23 0.05 12.13
30 50 30 4.74 7.91 125.26 6.81 0.03 0.06 0.06 0.07 0.05 4.37 57.16 122.23 0.05 12.13
27 50 15 4.36 7.86 119.08 6.85 0.03 0.06 0.06 0.07 0.05 4.32 58.90 113.67 0.05 13.23
1 50 33 4.87 8.09 130.40 6.83 0.03 0.06 0.06 0.07 0.05 4.44 58.03 124.46 0.05 15.83
21 50 33 4.87 8.09 130.40 6.83 0.03 0.06 0.06 0.07 0.05 4.44 58.03 124.46 0.05 15.83
28 50 33 4.87 8.09 130.40 6.83 0.03 0.06 0.06 0.07 0.05 4.44 58.03 124.46 0.05 15.83
2 50 34 4.89 8.11 131.29 6.87 0.03 0.06 0.06 0.07 0.05 4.42 58.59 124.13 0.05 17.81
13 50 34 4.89 8.11 131.29 6.87 0.03 0.06 0.06 0.07 0.05 4.42 58.59 124.13 0.05 17.81
16 50 34 4.89 8.11 131.29 6.87 0.03 0.06 0.06 0.07 0.05 4.42 58.59 124.13 0.05 17.81
24 50 34 4.89 8.11 131.29 6.87 0.03 0.06 0.06 0.07 0.05 4.42 58.59 124.13 0.05 17.81
17 50 35 4.91 8.11 131.72 6.88 0.03 0.06 0.06 0.07 0.05 4.41 58.94 124.66 0.05 19.31
23 50 36 4.91 8.11 131.65 6.88 0.03 0.06 0.06 0.07 0.05 4.42 59.04 125.77 0.05 19.86
14 50 10 4.72 8.50 124.99 7.49 0.03 0.07 0.07 0.08 0.06 4.77 67.88 123.80 0.05 23.76
3 50 11 4.65 8.51 123.89 7.49 0.03 0.07 0.07 0.08 0.06 4.78 68.25 123.36 0.05 23.85
8 50 6 4.86 8.57 128.90 7.57 0.03 0.07 0.07 0.08 0.06 4.80 68.27 125.30 0.05 26.25
22 50 6 4.86 8.57 128.90 7.57 0.03 0.07 0.07 0.08 0.06 4.80 68.27 125.30 0.05 26.25
18 50 5 4.88 8.57 129.80 7.57 0.03 0.07 0.07 0.08 0.06 4.80 68.06 125.41 0.05 26.65
25 50 4 4.90 8.56 130.65 7.57 0.03 0.07 0.07 0.08 0.06 4.79 67.76 125.82 0.05 27.17

If we compare the error statistics from the best model in example2 with the model in example1, we see an interesting improvement in the testing errors for IBM and Microsoft (the default error measurement for relative and scaled error is naive). The error statistics from example1, evaluated as the mean on 10 validation windows, are the following:

example1$best_model$testing_errors
                me   mae     mse rmsse   mpe  mape  rmae rrmse  rame  mase   smse
  IBM.Close  4.283 8.575 122.207 7.580 0.032 0.067 0.068 0.080 0.062 4.793 67.670
  MSFT.Close 4.911 7.782 132.215 6.302 0.021 0.042 0.042 0.048 0.039 4.330 57.677
                 sce gmrae
  IBM.Close  117.787 0.048
  MSFT.Close 124.995 0.034

The error statistics from example2.

example2$best_model$testing_errors
                me   mae     mse rmsse   mpe  mape  rmae rrmse  rame  mase   smse
  IBM.Close  3.478 7.763 107.472 6.804 0.027 0.061 0.062 0.073 0.051 4.266 58.027
  MSFT.Close 4.229 7.439 114.796 6.092 0.018 0.041 0.041 0.046 0.037 4.189 51.030
                 sce gmrae
  IBM.Close   92.959 0.046
  MSFT.Close 110.937 0.033

The improvement is clear for both the time features, but we are still using a naive approach to measure scaled and relative errors. Let’s try to shift to deviation as scale, and average as benchmark, that are more challenging evaluation criteria.

example3 <- spooky(time_features, n_samp = 30, seq_len = 50, n_windows = 10,  dates = as.Date(rownames(time_features)), lno = NULL, error_scale = "deviation", error_benchmark = "average")
  time: 736.65 sec elapsed

As you can see, the relative and scaled measures change sensibly as we raise the bar of our expectations

example3$best_model$testing_errors
                me   mae     mse rmsse   mpe  mape  rmae rrmse rame  mase   smse
  IBM.Close  3.478 7.763 107.472  6.71 0.027 0.061 1.141 1.180    1 4.151 56.380
  MSFT.Close 4.229 7.439 114.796  6.02 0.018 0.041 0.837 0.843    1 4.143 49.313
                 sce gmrae
  IBM.Close   88.589 1.099
  MSFT.Close 109.244 0.840

Prospecting the forecasting horizon

You can also explore for the best possible forecasting horizon, choosing a range for seq_len besides lno. Let’s see the best foreseeability using the ranges lno = c(1, 50) and seq_len = c(10, 20).

example4 <- spooky(time_features, n_samp = 10, n_windows = 10,  dates = as.Date(rownames(time_features)), lno = c(1, 50), seq_len = c(10, 20))
  time: 117.86 sec elapsed

As we can see from history, 13 days and 47 left-out are the best parameters to reduce the max joint error on the time features.

Prospecting the joint foreseeability of IBM and Microsoft stock prices
seq_len lno max_me max_mae max_mse max_rmsse max_mpe max_mape max_rmae max_rrmse max_rame max_mase max_smse max_sce max_gmrae wgt_avg_rank
3 13 47 -0.19 8.27 152.32 6.61 0.00 0.04 0.04 0.05 0.04 4.32 64.82 0.05 0.04 2.92
8 11 7 -0.59 6.52 88.80 5.45 0.00 0.04 0.04 0.05 0.04 3.46 39.51 -3.71 0.03 3.83
1 19 23 0.03 8.52 186.20 6.92 0.00 0.05 0.05 0.05 0.04 4.53 78.47 -0.54 0.04 3.96
4 18 14 -0.04 8.72 188.35 7.06 0.00 0.05 0.05 0.05 0.04 4.68 79.87 0.43 0.04 4.17
7 17 49 -0.50 8.61 182.56 6.91 0.00 0.05 0.05 0.05 0.04 4.49 76.97 -4.70 0.04 5.32
2 19 36 -0.11 8.52 183.66 6.95 -0.01 0.05 0.05 0.05 0.04 4.52 77.76 -2.37 0.04 5.99
10 17 28 -0.71 8.37 178.40 6.83 -0.01 0.04 0.04 0.05 0.04 4.45 75.44 -4.97 0.03 6.63
9 17 24 -0.79 8.35 178.53 6.82 -0.01 0.04 0.04 0.05 0.04 4.45 75.53 -5.73 0.03 7.13
6 15 47 -1.12 7.98 165.54 6.47 -0.01 0.04 0.04 0.05 0.04 4.19 69.83 -8.34 0.03 7.33
5 16 24 -1.14 8.21 172.40 6.68 -0.01 0.04 0.04 0.05 0.04 4.40 73.33 -6.74 0.04 7.73

Let’s see the plots.

  $IBM.Close

  
  $MSFT.Close

Some useful references


  1. The missing imputation is managed through imputeTS package. For any information: https://cran.r-project.org/web/packages/imputeTS/index.html.↩︎

  2. In some cases, maybe you want to operate on smoothed time-features. In this case, spooky calls on fANCOVA package. Here you can find all the latest: https://cran.r-project.org/web/packages/fANCOVA/index.html↩︎

  3. The metrics are calculated using the greybox package. For any reference, please take a look here: https://cran.r-project.org/web/packages/greybox/index.html↩︎

  4. Other packages focused on time feature analysis that could be of interest here:

    - AUDREX, https://cran.r-project.org/web/packages/audrex/index.html
    - PROTEUS, https://cran.r-project.org/web/packages/proteus/index.html
    - JENGA, https://cran.r-project.org/web/packages/jenga/index.html
    - TETRAGON, https://cran.r-project.org/web/packages/tetragon/index.html
    ↩︎