segen: a brief introduction

When is a sequence considered “general”?

Segen is a model for sequence generalization using the “network” of similarities among sequences for the extrapolation of the next sequence. The notion of “network” here is related to the computation of a matrix of distances that is converted to the equivalent of an adjacency matrix using a similarity threshold. The idea behind segen is that the lower triangle of such matrix is enough to predict (to some extent) the behavior of a sequence of sequences.

A brief overview on the process:

The test errors are cross-validated through expanding validation windows, where the default value of n_windows is set to 10, meaning that the time features are divided into 10 + 1 segments guaranteeing at least ten validation sets to measure the error on unforeseen data.
Some basic transformation are directly managed in background. Differentiation and integration are automatically managed by segen using the maximal p-value in a recursive F-test for de-trending each time-feature: this allows to easily determine the different dynamic characteristics of each time feature, random walk, trend, exponential (somehow more simple and practical compared to other formal approaches like Augmented Dickey-Fuller or Ljung Box Test). If you have limited missing values in your time features, segen automatically proceeds with the imputation using the Kalman filter method¹. If you prefer to project into the future the smoothed version, you can set smoother = TRUE to use loess² function.
After differentiation, each time features is reframed according to sequence length (each time feature is segmented in sequences of seq_len). A distance metric is calculated among each sequence and the value are compared to a similarity threshold (the quantile value of distances, where 0 means maximum and 1 minimum difference respectively). Location and scales parameters are calculated for each column and used to simulate weights used to calculate the generalized sequence as weighted average. The weights could optionally be rescaled using min-max normalization so to make the generalization somehow more “fuzzy”.
For each point in the prediction sequence, a thousand samples are collected for the calculation of quantiles, mean, mode, standard deviation, skewness and kurtosis, and other less common measures.

That’s all. The process is quite easy and relatively fast (as always, the only bottleneck is represented by the quadratic complexity in the calculation of distance matrix for very large sets of sequences, but that’s another issue).

The process flow of segen

Standing on the shoulders’ of (Tech) giants

The dataset time features included with segen is a recent take on some Big Techs’ stock prices (source: Yahoo Finance). The data is expected in a dataframe format, where each column represents a different time series (the date information is not mandatory and could be provided separately).

Examples of time features: Tech Giants Share
	IBM.Close	MSFT.Close
2020-04-14	118.4608	173.70
2020-04-15	113.4704	171.88
2020-04-16	110.6405	177.04
2020-04-17	114.8375	178.60
2020-04-20	115.1147	175.06
2020-04-21	111.6252	167.82
2020-04-22	114.0631	173.52
2020-04-23	116.0134	171.42
2020-04-24	119.2352	174.55
2020-04-27	120.3824	174.05

In the first example, we are predicting the close price for IBM and Microsoft. In this example we try to set seq_len = 20 (sequence length), using a cross-validation scheme of 5 n_windows for error measurement.


example1 <- segen(time_features, seq_len = 20, n_windows = 5,  n_samp = 10, dates = rownames(time_features))
  time: 163.96 sec elapsed

The result is a list of different components, as you can see below.

names(example1)
  [1] "exploration" "history"     "best_model"  "time_log"
names(example1$best_model)
  [1] "exploration"    "predictions"    "testing_errors" "plots"

exploration includes all models tested during the exploration.history includes selected parameters and error metrics for the explored space during random search (beside prediction score, me, mae, mse, rmsse, mpe, mape, rmae, rrmse, rame, mase, smse, sce, gmrae ³, averaged across features and validation windows). best_model collects a list of information for the best model selected according to the average error metric: you will find the prediction intervals (predictions), the visualizations (plots) and the testing error metric for each time feature (testing_errors).

The predictions is a list including the predicted results for each time-feature (quantile, min, max, mean, mode, sd, skewness, kurtosis, iqr to range, median range ratio, upside probability, divergence for each time point in the seq_len sequence). The IQR to range is the interquartile range to the min-max range, the median range ratio is the range above median to the range below it, the upside probability is the probability of growth compared to the former point in the time sequence, the divergence is the maximum distance of cumulative normal curve of each point to the former point in the sequence.

Examples of prediction for IBM Close Prices
	min	10%	25%	50%	75%	90%	max	mean	sd	mode	kurtosis	skewness	iqr_to_range	median_range_ratio	upside_prob	divergence	pred_scores
t1	127.8516	128.2531	128.3254	128.4082	128.4903	128.5640	128.8226	128.4073	0.1237	128.4242	3.2527	-0.1179	0.1699	0.7443	NA	NA	0.0000
t2	126.9680	127.4031	127.5324	127.6759	127.8195	127.9464	128.2141	127.6719	0.2093	127.6846	2.8055	-0.1445	0.2303	0.7603	0.000	0.980	0.0032
t3	127.3178	127.6683	127.7810	127.8939	128.0134	128.1245	128.3855	127.8963	0.1747	127.8936	2.8452	-0.0880	0.2177	0.8534	1.000	0.000	0.0628
t4	127.3703	127.7146	127.8111	127.9325	128.0434	128.1447	128.3939	127.9254	0.1667	127.9076	2.8223	-0.1259	0.2269	0.8209	0.630	0.006	0.0560
t5	127.8910	128.2116	128.3412	128.4801	128.6170	128.7466	129.2714	128.4815	0.2077	128.4584	2.9902	0.1464	0.1999	1.3435	1.000	0.000	0.1624
t6	127.7559	128.1905	128.3525	128.5371	128.7100	128.8641	129.1736	128.5293	0.2564	128.5108	2.6854	-0.1378	0.2522	0.8147	0.580	0.028	0.1528
t7	128.0041	128.5290	128.6945	128.8855	129.0639	129.2214	129.5156	128.8783	0.2687	128.8833	2.7146	-0.1801	0.2444	0.7150	1.000	0.000	0.0680
t8	128.0279	128.5882	128.7722	128.9775	129.1817	129.3474	129.7463	128.9714	0.2868	129.0125	2.6691	-0.1429	0.2383	0.8097	0.894	0.000	0.0632
t9	127.5506	128.3522	128.5615	128.8017	129.0163	129.2372	129.7563	128.7896	0.3404	128.8004	2.8880	-0.1356	0.2062	0.7630	0.068	0.240	0.0076
t10	127.9141	128.5187	128.7343	128.9614	129.2012	129.4071	129.8295	128.9597	0.3395	129.0172	2.7016	-0.1387	0.2438	0.8288	0.945	0.000	0.1936
t11	127.6921	128.3722	128.5972	128.8525	129.0872	129.3067	129.8348	128.8406	0.3571	128.8363	2.7317	-0.0199	0.2287	0.8465	0.258	0.150	0.1356
t12	128.2199	128.8153	129.0177	129.2782	129.5136	129.7353	130.4222	129.2736	0.3527	129.3141	2.8526	-0.0116	0.2252	1.0810	1.000	0.000	0.1032
t13	128.4538	128.9156	129.1316	129.3742	129.6215	129.8130	130.6285	129.3748	0.3406	129.4079	2.7068	0.0534	0.2253	1.3629	0.762	0.000	0.1548
t14	127.9915	128.5318	128.7430	128.9666	129.2299	129.3995	130.1690	128.9728	0.3397	128.9976	2.7477	0.0003	0.2236	1.2330	0.000	0.413	0.1152
t15	128.0190	128.5794	128.8006	129.0241	129.2810	129.4668	130.2045	129.0292	0.3433	129.0196	2.8140	-0.0171	0.2198	1.1743	0.893	0.002	0.0208
t16	128.0201	128.6425	128.8484	129.0883	129.3754	129.5724	130.2160	129.1075	0.3631	129.0246	2.7381	0.0472	0.2400	1.0558	0.817	0.001	0.1860
t17	128.5602	129.2172	129.4342	129.7232	130.0100	130.2227	130.9997	129.7193	0.3964	129.6997	2.6407	-0.0238	0.2360	1.0977	1.000	0.000	0.0156
t18	128.5751	129.4400	129.7254	130.0500	130.3792	130.6583	131.5158	130.0529	0.4719	130.0280	2.7332	-0.0006	0.2223	0.9938	0.997	0.000	0.0860
t19	128.2061	129.1017	129.4036	129.7729	130.1543	130.4623	131.3085	129.7759	0.5292	129.7075	2.6832	-0.0060	0.2420	0.9801	0.011	0.245	0.1908
t20	128.8143	129.5700	129.8824	130.2327	130.6225	130.8888	131.7961	130.2401	0.5130	130.2154	2.6916	-0.0211	0.2482	1.1022	1.000	0.000	0.0000

For each time features included in the model, you get a plot of the median with the chosen confidence interval (ci default is 0.8). As in other packages⁴, we provide different stats to give a better hint on the different dynamics related to aleatoric and epistemic uncertainty.

  $IBM.Close

  
  $MSFT.Close

Wandering around the hyper-parameter space

The hyper-parameter space is defined by seq_len, dist_method, similarity and rescale. Now, let’s try a random search for the best parameter settings. The following example shows how to sample 30 different models for a sequence of 20 time steps.

example2 <- segen(time_features, seq_len = 20, n_samp = 30, n_windows = 5, dates = rownames(time_features))
  time: 360.14 sec elapsed

History table with ranking of 30 different models
	seq_len	similarity	dist_method	rescale	me	mae	mse	rmsse	mpe	mape	rmae	rrmse	rame	mase	smse	sce	gmrae
23	20	0.3003704	manhattan, minkowski	TRUE	2.3806	5.0645	41.3617	4.0693	0.0147	0.0280	0.9702	1.0020	1.7991	2.4432	19.6314	26.2307	0.8760
18	20	0.5309009	manhattan, manhattan	FALSE	2.5255	5.0408	40.8800	4.0377	0.0154	0.0281	0.9808	1.0011	1.6266	2.4423	19.5230	27.2844	0.9493
7	20	0.1522422	manhattan, manhattan	FALSE	2.2284	5.0719	42.0416	4.0896	0.0142	0.0281	0.9744	1.0111	1.9361	2.4502	19.9772	25.1723	0.8954
4	20	0.1591091	euclidean, manhattan	TRUE	2.1947	5.0696	41.9408	4.0939	0.0140	0.0281	0.9807	1.0198	1.9671	2.4469	19.8683	24.7886	0.8752
16	20	0.1708809	euclidean, manhattan	TRUE	2.1998	5.0668	41.8181	4.0924	0.0141	0.0281	0.9845	1.0225	1.9646	2.4499	19.8622	24.9219	0.8855
3	20	0.3239139	maximum , manhattan	FALSE	2.3029	5.0752	41.6974	4.0923	0.0144	0.0279	0.9846	1.0211	1.8848	2.4548	19.7961	25.6876	0.8952
11	20	0.3062563	minkowski, euclidean	TRUE	2.3869	5.0926	41.6240	4.0999	0.0147	0.0281	0.9840	1.0183	1.8285	2.4632	19.8150	26.2914	0.8798
15	20	0.5985886	maximum , euclidean	FALSE	2.5675	5.0735	41.0758	4.0600	0.0157	0.0284	0.9859	1.0055	1.5592	2.4619	19.6821	27.5574	0.9608
14	20	0.3582482	maximum , minkowski	TRUE	2.4170	5.0956	41.3851	4.0894	0.0148	0.0283	0.9830	1.0132	1.7917	2.4655	19.7753	26.5630	0.9030
1	20	0.5593493	minkowski, manhattan	FALSE	2.4612	5.0563	41.2578	4.0678	0.0155	0.0283	0.9918	1.0160	1.6723	2.4579	19.7827	26.8860	0.9591
10	20	0.1345846	manhattan, euclidean	TRUE	2.3393	5.1443	43.2689	4.1402	0.0146	0.0282	0.9792	1.0186	1.8985	2.4691	20.3400	25.8723	0.8752
24	20	0.5985886	euclidean, euclidean	FALSE	2.6001	5.0996	41.4231	4.0822	0.0160	0.0286	0.9965	1.0149	1.5871	2.4787	19.9242	27.9292	0.9749
5	20	0.0816116	minkowski, manhattan	TRUE	2.3567	5.1664	43.5632	4.1778	0.0145	0.0285	0.9920	1.0364	1.9341	2.4862	20.5938	25.9480	0.8367
17	20	0.6191892	manhattan, manhattan	TRUE	2.5605	5.1186	42.3399	4.1150	0.0159	0.0285	0.9966	1.0224	1.7418	2.4806	20.2108	27.7833	0.9475
9	20	0.0570871	minkowski, manhattan	TRUE	2.3912	5.1831	43.8489	4.1908	0.0147	0.0287	0.9930	1.0375	1.9397	2.4937	20.7392	26.2293	0.8486
30	20	0.5161862	maximum, maximum	TRUE	2.4082	5.1895	43.2879	4.1574	0.0148	0.0286	0.9991	1.0292	1.9112	2.5027	20.5427	26.6881	0.9282
29	20	0.6182082	minkowski, euclidean	FALSE	2.6350	5.1538	42.2233	4.1224	0.0162	0.0288	1.0103	1.0275	1.5924	2.4991	20.2152	28.2977	0.9874
12	20	0.0325626	minkowski, manhattan	TRUE	2.4141	5.2051	44.3456	4.2127	0.0148	0.0288	0.9956	1.0401	1.9424	2.5098	21.0985	26.5579	0.8485
8	20	0.6309610	minkowski, minkowski	TRUE	2.5307	5.2073	43.1336	4.1686	0.0156	0.0290	1.0080	1.0332	1.8107	2.5206	20.6629	27.8829	0.9546
25	20	0.2866366	euclidean, maximum	TRUE	2.4947	5.2637	44.5242	4.1988	0.0153	0.0289	1.0043	1.0368	1.9772	2.5138	20.7724	27.1552	0.9378
6	20	0.2326827	maximum, maximum	TRUE	2.6533	5.2996	44.5266	4.2317	0.0158	0.0291	1.0128	1.0462	1.8788	2.5365	20.8903	28.2270	0.9169
19	20	0.4112212	minkowski, maximum	TRUE	2.5731	5.2898	44.4426	4.2270	0.0156	0.0292	1.0199	1.0527	1.9393	2.5359	20.8520	27.7506	0.9320
21	20	0.8713013	manhattan, maximum	TRUE	2.8299	5.3540	45.9143	4.2717	0.0171	0.0294	1.0193	1.0450	1.8228	2.5741	21.7734	30.0763	0.9817
26	20	0.9232933	minkowski, euclidean	TRUE	2.8954	5.3907	46.5403	4.2988	0.0176	0.0297	1.0217	1.0467	1.8128	2.5934	22.1322	30.7822	0.9829
2	20	0.9870571	minkowski, minkowski	TRUE	2.9195	5.4071	47.0234	4.3173	0.0178	0.0299	1.0224	1.0478	1.8153	2.6045	22.4494	31.0731	0.9725
20	20	0.8742442	manhattan, euclidean	FALSE	2.8778	5.4222	46.8438	4.2977	0.0173	0.0300	1.0397	1.0571	1.7345	2.6075	22.2516	30.5022	1.0234
13	20	0.8320621	minkowski, manhattan	FALSE	2.8321	5.4193	46.8758	4.3164	0.0174	0.0301	1.0476	1.0658	1.7919	2.6179	22.3652	30.4645	1.0289
22	20	0.8909209	manhattan, manhattan	FALSE	2.9624	5.4719	47.6212	4.3346	0.0178	0.0303	1.0447	1.0604	1.6839	2.6374	22.7997	31.3705	1.0284
27	20	0.9870571	manhattan, euclidean	FALSE	3.0344	5.5304	49.2702	4.3868	0.0183	0.0307	1.0472	1.0628	1.7189	2.6734	23.8156	32.4001	1.0379
28	20	0.9811712	euclidean, minkowski	FALSE	3.0344	5.5304	49.2702	4.3868	0.0183	0.0307	1.0472	1.0628	1.7189	2.6734	23.8156	32.4001	1.0379

If we compare the error statistics from the best model in example2 with the model in example1, for IBM and Microsoft we see consistent improvement. All the relative and scaled error metrics defaults to segen, but you can choose more challenging thresholds (like the deviation of the whole time feature or the average of the whole predicted sequence).

The error statistics from example1 (averaged across 10 expanding validation windows):

example1$best_model$testing_errors
                 me    mae     mse  rmsse    mpe   mape   rmae  rrmse   rame
  IBM.Close  3.0814 4.3986 32.5930 4.0150 0.0234 0.0338 1.1202 1.1316 1.2604
  MSFT.Close 1.7566 5.7480 50.7096 4.1506 0.0064 0.0222 0.8200 0.8740 2.3276
               mase    smse     sce  gmrae
  IBM.Close  2.9098 22.1456 41.0728 1.0792
  MSFT.Close 1.9926 17.5462 12.3304 0.6554

The error statistics from example2 (as above, averaged across 10 expanding validation windows):

example2$best_model$testing_errors
                 me   mae     mse  rmsse    mpe   mape   rmae  rrmse   rame
  IBM.Close  3.0196 4.371 31.9238 3.9840 0.0230 0.0336 1.1188 1.1292 1.2572
  MSFT.Close 1.7416 5.758 50.7996 4.1546 0.0064 0.0224 0.8216 0.8748 2.3410
               mase    smse     sce  gmrae
  IBM.Close  2.8904 21.6836 40.2366 1.0872
  MSFT.Close 1.9960 17.5792 12.2248 0.6648

The improvement is somehow unclear between the two examples, but we are still using a naive approach to measure scaled and relative errors. Let’s try to shift to deviation as scale, and average as benchmark, that are more challenging evaluation criteria, and extend to 100 samples.

example3 <- segen(time_features, seq_len = 20, n_windows = 5, dates = rownames(time_features), error_scale = "deviation", error_benchmark = "average", n_samp = 100)
  time: 1282.79 sec elapsed

As you can see, the relative and scaled measures shift sensibly as we change the bar of our expectations:

example3$best_model$testing_errors
                 me    mae     mse  rmsse    mpe   mape   rmae  rrmse rame   mase
  IBM.Close  3.0090 4.4084 32.3898 1.9778 0.0230 0.0338 1.0928 1.0548    1 0.7206
  MSFT.Close 1.3992 5.7070 50.8614 1.5518 0.0052 0.0222 0.9808 0.9970    1 0.2892
               smse     sce  gmrae
  IBM.Close  4.8464 10.4262 1.1220
  MSFT.Close 2.6318  1.4278 0.9256

segen: a brief introduction

Giancarlo Vercellino

08-July-2022

When is a sequence considered “general”?

Standing on the shoulders’ of (Tech) giants

Wandering around the hyper-parameter space

Some useful references