spooky: a brief introduction

Expanding time features using Spectral analysis and Jack-Knife resampling

spooky uses Discrete Fast Fourier Transformation to extrapolate time features beyond their boundaries. As you can see below, moving inside out the picture, the main points of spooky are the following:

We can decompose a time feature signal in all its components using a Discrete Fast-Fourier Transformation: some components will be highly related to future trends and will be relevant for any purpose of future generalizations, while others will probably be no more than noises and errors without any representative power for extrapolation: we sift through all the possible frequencies using a jack-knife resampling to magnify the predicting power of best frequencies. Each time feature is expanded inside each validation window: after the spectral analysis, the signal of each time feature is extended into the future for the requested sequence of time steps (seq_len), then the expansion is being carried out on the resampled set from a jack-knife leaving N points out (lno).
Some basic transformation are directly managed in background. Differentiation and integration are automatically managed by spooky using the maximal p-value in a recursive F-test for de-trending each time-feature: this allows to easily determine the different dynamic characteristics of each time feature (random walk, trend, exponential). Maybe not cool like Augmented Dickey-Fuller or Ljung Box Test, but it works for our humble purpose. If you have limited missing values in your time features, spooky automatically proceeds with the imputation using the Kalman filter method¹. If you prefer to project into the future the smoothed version, you can set smoother = TRUE to use loess² function.
The test errors are cross-validated through an expanding validation n_windows: the default value is set to 10, meaning that the time features are divided into 10 + 1 segments guaranteeing at least ten validation sets to measure the error on unforeseen data. For each point in the prediction sequence, a thousand samples are collected for the calculation of quantiles, mean, mode, standard deviation, skewness and kurtosis, and other less common measures that are provided for each time step (see below).
Concluding this brief excursus inside out the mode, beyond the windowing phase, spooky manages the random search sampling n_samp models, randomly selected within the cartesian space delimited by seq_len and lno. The process is really fast.

Easy, that’s all.

The process flow of spooky

A look at the price dynamics of tech giants’ stocks

The dataset time features included with spooky is a recent take on IBM and Microsoft stock prices (source: Yahoo Finance). The data is expected in a dataframe format, where each column represents a different time series (the date information is not mandatory and could be provided separately).

Examples of time features: IBM and MICROSOFT Close Prices
	IBM.Close	MSFT.Close
2017-01-03	159.8375	62.58
2017-01-04	161.8164	62.30
2017-01-05	161.2811	62.30
2017-01-06	162.0746	62.84
2017-01-09	160.2773	62.64
2017-01-10	158.2409	62.62
2017-01-11	160.3728	63.19
2017-01-12	160.5641	62.61
2017-01-13	159.9809	62.70
2017-01-17	160.5067	62.53

In the first example, we are predicting the close price for IBM and Microsoft. In this example we try to set seq_len= 100 (sequence length) and lno= 1 (leave-N-out), using a cross-validation scheme of 10 n_windows for error measurement.


example1 <- spooky(time_features, n_samp = 1, seq_len = 50, n_windows = 10,  dates = as.Date(rownames(time_features)), lno = 1)
  time: 10.11 sec elapsed

The result is a list of different components, as you can see below.

names(example1)
  [1] "exploration" "history"     "best_model"  "time_log"

The first variable, exploration, includes all the models evaluated during the random search within the boundaries decided by users: besides the predictions for each feature, exploration includes the test error statistics for each time sequence (length equals to seq_len) and model information. The second variable, history, summarizes the hyper-parameters selected by users, the relative max error metrics across different time features (me, mae, rmsse, mpe, mape, rmae, rrmse, rame, mase, smse, sce, gmrae ³) and the weighted average rank used to sort out the best model. The last variable, best_model, collects testing_errors, preds, plots for the selected model.

The prediction is a list including the predicted results for each time-feature (quantile, min, max, mean, mode, sd, skewness, kurtosis, iqr to range, risk ratio, upside probability, divergence for each time point in the seq_len sequence). The IQR to range is the interquartile range to the min-max range, the risk ratio is the range above median to the range below it, the upside probability is the probability of growth compared to the former point in the time sequence, the divergence is the maximum distance of cumulative normal curve of each point to the former point in the sequence.

Examples of prediction for IBM Close Prices
	min	10%	25%	50%	75%	90%	max	mean	sd	mode	kurtosis	skewness	iqr_to_range	risk_ratio	upside_prob	divergence
2022-04-05	123.291	124.218	127.284	127.630	129.509	130.075	130.711	127.933	1.879	127.974	3.141	-0.870	0.300	0.710	NA	NA
2022-04-06	125.432	126.038	128.061	129.656	132.766	134.182	134.520	129.884	2.757	128.825	1.855	0.307	0.518	1.152	0.736	0.002
2022-04-07	122.122	125.304	127.102	129.556	131.683	133.359	134.253	129.367	3.006	131.076	2.352	-0.408	0.378	0.632	0.439	0.079
2022-04-09	124.420	126.501	129.360	131.037	132.109	133.458	133.810	130.665	2.262	131.462	3.468	-1.034	0.293	0.419	0.658	0.005
2022-04-10	124.202	127.552	129.701	130.403	133.145	136.468	139.255	130.984	3.143	130.105	3.036	0.334	0.229	1.428	0.522	0.053
2022-04-12	116.776	117.239	125.797	131.133	134.117	134.889	138.390	129.220	6.045	132.026	2.806	-1.026	0.385	0.505	0.416	0.255
2022-04-13	116.053	116.383	124.565	130.119	135.276	136.724	138.422	128.558	6.938	131.590	2.322	-0.670	0.479	0.590	0.457	0.062
2022-04-15	116.745	117.101	124.420	130.333	132.792	133.730	136.335	128.007	5.905	132.052	2.453	-0.892	0.427	0.442	0.488	0.062
2022-04-16	117.312	117.605	124.523	131.834	135.699	137.980	139.263	129.911	7.181	135.024	2.028	-0.635	0.509	0.512	0.570	0.005
2022-04-18	115.223	115.576	126.419	130.807	134.335	139.075	140.759	128.658	7.594	128.949	2.338	-0.504	0.310	0.639	0.475	0.071
2022-04-19	115.664	116.105	125.830	131.873	138.004	144.732	146.708	130.285	9.052	129.497	2.179	-0.122	0.392	0.915	0.539	0.010
2022-04-21	114.624	115.039	122.820	128.364	138.167	146.664	148.452	129.973	10.096	127.120	2.115	0.191	0.454	1.462	0.479	0.035
2022-04-22	115.823	116.287	120.958	126.506	138.906	152.756	154.379	130.110	11.487	122.010	2.522	0.689	0.466	2.609	0.504	0.028
2022-04-24	113.660	115.374	121.403	126.972	136.247	153.948	155.031	129.599	11.550	130.326	2.933	0.726	0.359	2.108	0.492	0.018
2022-04-25	115.752	116.243	121.544	127.314	136.236	145.722	146.593	129.349	9.492	130.122	2.118	0.287	0.476	1.667	0.488	0.053
2022-04-27	114.999	116.134	126.372	128.135	136.108	148.476	149.468	130.624	9.988	130.807	2.415	0.281	0.282	1.624	0.508	0.000
2022-04-28	112.883	115.959	123.957	124.776	135.817	139.903	142.994	127.736	8.203	121.056	1.783	0.053	0.394	1.532	0.418	0.146
2022-04-29	114.891	116.189	121.962	124.254	139.427	145.720	147.639	129.188	10.492	120.690	1.713	0.364	0.533	2.498	0.551	0.028
2022-05-01	109.913	110.997	122.091	123.697	139.137	145.209	146.779	126.925	11.265	124.611	1.994	0.241	0.462	1.674	0.438	0.087
2022-05-02	108.429	108.918	120.886	124.215	137.317	141.225	143.251	125.629	10.958	124.663	1.894	-0.112	0.472	1.206	0.500	0.048
2022-05-04	107.332	107.716	119.847	125.924	135.959	140.008	142.550	124.754	10.634	123.869	2.026	-0.192	0.458	0.894	0.459	0.034
2022-05-05	107.833	108.467	121.217	127.392	131.165	137.919	143.016	124.996	9.829	125.016	2.271	-0.330	0.283	0.799	0.521	0.014
2022-05-07	107.280	108.062	118.923	124.399	130.997	138.203	140.394	124.131	10.155	124.228	1.912	-0.244	0.365	0.934	0.471	0.037
2022-05-08	105.531	106.093	117.417	123.886	130.399	137.078	139.115	122.990	10.639	124.373	1.899	-0.382	0.387	0.830	0.486	0.047
2022-05-10	107.562	109.164	121.019	130.978	135.424	143.467	144.917	127.628	11.799	137.613	1.836	-0.299	0.386	0.595	0.603	0.004
2022-05-11	108.599	110.260	121.968	129.937	136.626	141.369	143.628	127.944	11.343	137.399	1.761	-0.403	0.418	0.642	0.522	0.004
2022-05-13	108.665	108.969	123.950	130.296	137.776	142.943	144.433	129.147	11.083	129.750	2.307	-0.659	0.387	0.654	0.530	0.001
2022-05-14	110.037	110.395	123.012	129.699	138.215	142.718	143.598	128.394	10.729	128.560	2.126	-0.389	0.453	0.707	0.471	0.030
2022-05-16	108.876	109.498	122.273	129.333	136.187	141.189	141.876	126.725	10.773	135.420	1.975	-0.364	0.422	0.613	0.453	0.062
2022-05-17	108.552	110.412	120.509	126.952	135.333	141.401	142.254	126.290	10.225	125.635	1.946	-0.173	0.440	0.832	0.493	0.024
2022-05-19	108.521	109.124	121.614	126.610	135.248	138.392	139.764	125.740	10.000	133.624	2.071	-0.434	0.436	0.727	0.476	0.023
2022-05-20	110.892	112.294	124.817	128.389	135.515	141.067	141.872	128.265	9.216	127.673	2.318	-0.445	0.345	0.771	0.581	0.000
2022-05-22	109.897	111.294	126.043	127.756	137.501	139.531	146.453	128.488	10.359	128.462	2.325	-0.202	0.313	1.047	0.518	0.023
2022-05-23	109.773	110.401	127.153	128.851	135.375	138.804	147.791	128.199	10.041	127.903	2.768	-0.365	0.216	0.993	0.502	0.016
2022-05-24	108.138	109.786	124.906	129.702	140.268	144.722	149.081	129.683	12.381	130.612	2.051	-0.307	0.375	0.899	0.536	0.024
2022-05-26	110.257	111.699	125.345	128.316	138.698	146.659	149.671	129.579	11.595	129.372	2.168	-0.108	0.339	1.183	0.519	0.018
2022-05-27	110.915	112.697	123.521	128.045	140.102	146.726	150.733	129.436	12.028	120.362	1.947	0.094	0.416	1.325	0.468	0.012
2022-05-29	112.964	114.502	121.570	127.502	140.362	150.480	150.932	130.296	12.237	120.112	1.925	0.321	0.495	1.612	0.537	0.002
2022-05-30	113.753	115.900	119.003	125.171	141.510	148.352	149.944	129.289	12.141	118.785	1.710	0.421	0.622	2.170	0.485	0.033
2022-06-01	114.559	116.623	122.662	129.952	141.969	147.978	151.934	130.947	11.399	122.484	1.816	0.361	0.517	1.428	0.538	0.000
2022-06-02	113.436	117.261	121.820	128.589	140.516	147.806	153.463	130.029	11.602	121.422	2.084	0.597	0.467	1.641	0.478	0.033
2022-06-04	112.127	114.354	121.302	127.910	140.282	148.327	152.344	130.213	12.259	120.393	1.923	0.370	0.472	1.548	0.506	0.010
2022-06-05	114.654	115.648	123.411	128.819	144.328	148.091	154.735	131.665	12.104	127.914	1.918	0.337	0.522	1.830	0.534	0.003
2022-06-07	115.035	115.334	125.738	130.378	144.334	147.042	150.160	132.188	11.679	122.441	1.710	-0.067	0.529	1.289	0.526	0.002
2022-06-08	116.846	119.064	125.014	134.047	147.141	149.536	152.449	134.079	11.259	131.993	1.656	0.062	0.621	1.070	0.530	0.003
2022-06-10	116.519	121.699	127.446	135.198	146.573	150.488	152.567	135.130	10.783	127.208	1.763	0.222	0.531	0.930	0.509	0.000
2022-06-11	115.928	118.430	123.879	133.948	140.346	150.998	153.221	132.994	11.539	125.937	1.921	0.285	0.442	1.070	0.460	0.081
2022-06-13	114.817	119.164	120.716	135.881	140.683	152.237	155.009	134.392	12.775	125.108	1.703	0.187	0.497	0.908	0.536	0.005
2022-06-14	106.221	114.362	123.434	134.880	140.841	156.219	158.294	134.105	14.002	130.288	1.963	0.221	0.334	0.817	0.480	0.028
2022-06-16	108.645	110.757	122.794	132.532	141.286	153.733	156.718	132.076	13.910	129.217	1.999	0.049	0.385	1.013	0.469	0.058

For each time features included in the model, you get a plot of the median with the chosen confidence interval (ci default is 0.8). As in other packages⁴, we provide different stats to give a better hint on the different dynamics related to aleatoric and epistemic uncertainty.

  $IBM.Close

  
  $MSFT.Close

Fast model optimization in a compact space

Now, let’s try a fast random search to optimize our estimations. The following example show how to sample 30 different models for a sequence of 50 time steps. Setting lno = NULL starts a search between 1 and the square root of time feature length.

example2 <- spooky(time_features, n_samp = 30, seq_len = 50, n_windows = 10,  dates = as.Date(rownames(time_features)), lno = NULL)
  time: 489.36 sec elapsed

History table with ranking of 30 different models
	seq_len	lno	max_me	max_mae	max_mse	max_rmsse	max_mpe	max_mape	max_rmae	max_rrmse	max_rame	max_mase	max_smse	max_sce	max_gmrae	wgt_avg_rank
20	50	21	4.23	7.76	114.80	6.80	0.03	0.06	0.06	0.07	0.05	4.27	58.03	110.94	0.05	7.51
5	50	23	4.31	7.77	115.35	6.80	0.03	0.06	0.06	0.07	0.05	4.26	57.73	113.45	0.05	8.18
9	50	24	4.37	7.77	116.10	6.80	0.03	0.06	0.06	0.07	0.05	4.27	57.62	115.56	0.05	8.87
6	50	19	4.22	7.80	115.36	6.82	0.03	0.06	0.06	0.07	0.05	4.29	58.31	110.42	0.05	8.92
26	50	19	4.22	7.80	115.36	6.82	0.03	0.06	0.06	0.07	0.05	4.29	58.31	110.42	0.05	8.92
7	50	27	4.56	7.80	119.68	6.79	0.03	0.06	0.06	0.07	0.05	4.28	56.86	118.91	0.05	9.47
10	50	26	4.51	7.81	118.33	6.81	0.03	0.06	0.06	0.07	0.05	4.28	57.26	119.12	0.05	10.13
12	50	26	4.51	7.81	118.33	6.81	0.03	0.06	0.06	0.07	0.05	4.28	57.26	119.12	0.05	10.13
19	50	18	4.24	7.83	116.07	6.83	0.03	0.06	0.06	0.07	0.05	4.31	58.49	111.43	0.05	10.62
11	50	17	4.27	7.84	116.96	6.82	0.03	0.06	0.06	0.07	0.05	4.32	58.45	112.18	0.05	10.99
15	50	17	4.27	7.84	116.96	6.82	0.03	0.06	0.06	0.07	0.05	4.32	58.45	112.18	0.05	10.99
29	50	17	4.27	7.84	116.96	6.82	0.03	0.06	0.06	0.07	0.05	4.32	58.45	112.18	0.05	10.99
4	50	30	4.74	7.91	125.26	6.81	0.03	0.06	0.06	0.07	0.05	4.37	57.16	122.23	0.05	12.13
30	50	30	4.74	7.91	125.26	6.81	0.03	0.06	0.06	0.07	0.05	4.37	57.16	122.23	0.05	12.13
27	50	15	4.36	7.86	119.08	6.85	0.03	0.06	0.06	0.07	0.05	4.32	58.90	113.67	0.05	13.23
1	50	33	4.87	8.09	130.40	6.83	0.03	0.06	0.06	0.07	0.05	4.44	58.03	124.46	0.05	15.83
21	50	33	4.87	8.09	130.40	6.83	0.03	0.06	0.06	0.07	0.05	4.44	58.03	124.46	0.05	15.83
28	50	33	4.87	8.09	130.40	6.83	0.03	0.06	0.06	0.07	0.05	4.44	58.03	124.46	0.05	15.83
2	50	34	4.89	8.11	131.29	6.87	0.03	0.06	0.06	0.07	0.05	4.42	58.59	124.13	0.05	17.81
13	50	34	4.89	8.11	131.29	6.87	0.03	0.06	0.06	0.07	0.05	4.42	58.59	124.13	0.05	17.81
16	50	34	4.89	8.11	131.29	6.87	0.03	0.06	0.06	0.07	0.05	4.42	58.59	124.13	0.05	17.81
24	50	34	4.89	8.11	131.29	6.87	0.03	0.06	0.06	0.07	0.05	4.42	58.59	124.13	0.05	17.81
17	50	35	4.91	8.11	131.72	6.88	0.03	0.06	0.06	0.07	0.05	4.41	58.94	124.66	0.05	19.31
23	50	36	4.91	8.11	131.65	6.88	0.03	0.06	0.06	0.07	0.05	4.42	59.04	125.77	0.05	19.86
14	50	10	4.72	8.50	124.99	7.49	0.03	0.07	0.07	0.08	0.06	4.77	67.88	123.80	0.05	23.76
3	50	11	4.65	8.51	123.89	7.49	0.03	0.07	0.07	0.08	0.06	4.78	68.25	123.36	0.05	23.85
8	50	6	4.86	8.57	128.90	7.57	0.03	0.07	0.07	0.08	0.06	4.80	68.27	125.30	0.05	26.25
22	50	6	4.86	8.57	128.90	7.57	0.03	0.07	0.07	0.08	0.06	4.80	68.27	125.30	0.05	26.25
18	50	5	4.88	8.57	129.80	7.57	0.03	0.07	0.07	0.08	0.06	4.80	68.06	125.41	0.05	26.65
25	50	4	4.90	8.56	130.65	7.57	0.03	0.07	0.07	0.08	0.06	4.79	67.76	125.82	0.05	27.17

If we compare the error statistics from the best model in example2 with the model in example1, we see an interesting improvement in the testing errors for IBM and Microsoft (the default error measurement for relative and scaled error is naive). The error statistics from example1, evaluated as the mean on 10 validation windows, are the following:

example1$best_model$testing_errors
                me   mae     mse rmsse   mpe  mape  rmae rrmse  rame  mase   smse
  IBM.Close  4.283 8.575 122.207 7.580 0.032 0.067 0.068 0.080 0.062 4.793 67.670
  MSFT.Close 4.911 7.782 132.215 6.302 0.021 0.042 0.042 0.048 0.039 4.330 57.677
                 sce gmrae
  IBM.Close  117.787 0.048
  MSFT.Close 124.995 0.034

The error statistics from example2.

example2$best_model$testing_errors
                me   mae     mse rmsse   mpe  mape  rmae rrmse  rame  mase   smse
  IBM.Close  3.478 7.763 107.472 6.804 0.027 0.061 0.062 0.073 0.051 4.266 58.027
  MSFT.Close 4.229 7.439 114.796 6.092 0.018 0.041 0.041 0.046 0.037 4.189 51.030
                 sce gmrae
  IBM.Close   92.959 0.046
  MSFT.Close 110.937 0.033

The improvement is clear for both the time features, but we are still using a naive approach to measure scaled and relative errors. Let’s try to shift to deviation as scale, and average as benchmark, that are more challenging evaluation criteria.

example3 <- spooky(time_features, n_samp = 30, seq_len = 50, n_windows = 10,  dates = as.Date(rownames(time_features)), lno = NULL, error_scale = "deviation", error_benchmark = "average")
  time: 736.65 sec elapsed

As you can see, the relative and scaled measures change sensibly as we raise the bar of our expectations

example3$best_model$testing_errors
                me   mae     mse rmsse   mpe  mape  rmae rrmse rame  mase   smse
  IBM.Close  3.478 7.763 107.472  6.71 0.027 0.061 1.141 1.180    1 4.151 56.380
  MSFT.Close 4.229 7.439 114.796  6.02 0.018 0.041 0.837 0.843    1 4.143 49.313
                 sce gmrae
  IBM.Close   88.589 1.099
  MSFT.Close 109.244 0.840

Prospecting the forecasting horizon

You can also explore for the best possible forecasting horizon, choosing a range for seq_len besides lno. Let’s see the best foreseeability using the ranges lno = c(1, 50) and seq_len = c(10, 20).

example4 <- spooky(time_features, n_samp = 10, n_windows = 10,  dates = as.Date(rownames(time_features)), lno = c(1, 50), seq_len = c(10, 20))
  time: 117.86 sec elapsed

As we can see from history, 13 days and 47 left-out are the best parameters to reduce the max joint error on the time features.

Prospecting the joint foreseeability of IBM and Microsoft stock prices
	seq_len	lno	max_me	max_mae	max_mse	max_rmsse	max_mpe	max_mape	max_rmae	max_rrmse	max_rame	max_mase	max_smse	max_sce	max_gmrae	wgt_avg_rank
3	13	47	-0.19	8.27	152.32	6.61	0.00	0.04	0.04	0.05	0.04	4.32	64.82	0.05	0.04	2.92
8	11	7	-0.59	6.52	88.80	5.45	0.00	0.04	0.04	0.05	0.04	3.46	39.51	-3.71	0.03	3.83
1	19	23	0.03	8.52	186.20	6.92	0.00	0.05	0.05	0.05	0.04	4.53	78.47	-0.54	0.04	3.96
4	18	14	-0.04	8.72	188.35	7.06	0.00	0.05	0.05	0.05	0.04	4.68	79.87	0.43	0.04	4.17
7	17	49	-0.50	8.61	182.56	6.91	0.00	0.05	0.05	0.05	0.04	4.49	76.97	-4.70	0.04	5.32
2	19	36	-0.11	8.52	183.66	6.95	-0.01	0.05	0.05	0.05	0.04	4.52	77.76	-2.37	0.04	5.99
10	17	28	-0.71	8.37	178.40	6.83	-0.01	0.04	0.04	0.05	0.04	4.45	75.44	-4.97	0.03	6.63
9	17	24	-0.79	8.35	178.53	6.82	-0.01	0.04	0.04	0.05	0.04	4.45	75.53	-5.73	0.03	7.13
6	15	47	-1.12	7.98	165.54	6.47	-0.01	0.04	0.04	0.05	0.04	4.19	69.83	-8.34	0.03	7.33
5	16	24	-1.14	8.21	172.40	6.68	-0.01	0.04	0.04	0.05	0.04	4.40	73.33	-6.74	0.04	7.73

Let’s see the plots.

  $IBM.Close

  
  $MSFT.Close

The missing imputation is managed through imputeTS package. For any information: https://cran.r-project.org/web/packages/imputeTS/index.html.↩︎
In some cases, maybe you want to operate on smoothed time-features. In this case, spooky calls on fANCOVA package. Here you can find all the latest: https://cran.r-project.org/web/packages/fANCOVA/index.html ↩︎
The metrics are calculated using the greybox package. For any reference, please take a look here: https://cran.r-project.org/web/packages/greybox/index.html ↩︎
Other packages focused on time feature analysis that could be of interest here:

- AUDREX, https://cran.r-project.org/web/packages/audrex/index.html
- PROTEUS, https://cran.r-project.org/web/packages/proteus/index.html
- JENGA, https://cran.r-project.org/web/packages/jenga/index.html
- TETRAGON, https://cran.r-project.org/web/packages/tetragon/index.html
↩︎

spooky: a brief introduction

Giancarlo Vercellino

13-April-2022

Expanding time features using Spectral analysis and Jack-Knife resampling

A look at the price dynamics of tech giants’ stocks

Fast model optimization in a compact space

Prospecting the forecasting horizon

Some useful references