segen: a brief introduction

Giancarlo Vercellino

08-July-2022

Data without generalization is just gossip.” (Robert Pirsig)

We continually judge the whole from the part we are familiar with.” (Henry Louis Mencken)

All generalizations are false, including this one.” (Marc Twain)

When is a sequence considered “general”?

Segen is a model for sequence generalization using the “network” of similarities among sequences for the extrapolation of the next sequence. The notion of “network” here is related to the computation of a matrix of distances that is converted to the equivalent of an adjacency matrix using a similarity threshold. The idea behind segen is that the lower triangle of such matrix is enough to predict (to some extent) the behavior of a sequence of sequences.

A brief overview on the process:

  1. The test errors are cross-validated through expanding validation windows, where the default value of n_windows is set to 10, meaning that the time features are divided into 10 + 1 segments guaranteeing at least ten validation sets to measure the error on unforeseen data.

  2. Some basic transformation are directly managed in background. Differentiation and integration are automatically managed by segen using the maximal p-value in a recursive F-test for de-trending each time-feature: this allows to easily determine the different dynamic characteristics of each time feature, random walk, trend, exponential (somehow more simple and practical compared to other formal approaches like Augmented Dickey-Fuller or Ljung Box Test). If you have limited missing values in your time features, segen automatically proceeds with the imputation using the Kalman filter method1. If you prefer to project into the future the smoothed version, you can set smoother = TRUE to use loess2 function.

  3. After differentiation, each time features is reframed according to sequence length (each time feature is segmented in sequences of seq_len). A distance metric is calculated among each sequence and the value are compared to a similarity threshold (the quantile value of distances, where 0 means maximum and 1 minimum difference respectively). Location and scales parameters are calculated for each column and used to simulate weights used to calculate the generalized sequence as weighted average. The weights could optionally be rescaled using min-max normalization so to make the generalization somehow more “fuzzy”.

  4. For each point in the prediction sequence, a thousand samples are collected for the calculation of quantiles, mean, mode, standard deviation, skewness and kurtosis, and other less common measures.

That’s all. The process is quite easy and relatively fast (as always, the only bottleneck is represented by the quadratic complexity in the calculation of distance matrix for very large sets of sequences, but that’s another issue).

The process flow of segen

The process flow of segen

Standing on the shoulders’ of (Tech) giants

The dataset time features included with segen is a recent take on some Big Techs’ stock prices (source: Yahoo Finance). The data is expected in a dataframe format, where each column represents a different time series (the date information is not mandatory and could be provided separately).

Examples of time features: Tech Giants Share
IBM.Close MSFT.Close
2020-04-14 118.4608 173.70
2020-04-15 113.4704 171.88
2020-04-16 110.6405 177.04
2020-04-17 114.8375 178.60
2020-04-20 115.1147 175.06
2020-04-21 111.6252 167.82
2020-04-22 114.0631 173.52
2020-04-23 116.0134 171.42
2020-04-24 119.2352 174.55
2020-04-27 120.3824 174.05

In the first example, we are predicting the close price for IBM and Microsoft. In this example we try to set seq_len = 20 (sequence length), using a cross-validation scheme of 5 n_windows for error measurement.


example1 <- segen(time_features, seq_len = 20, n_windows = 5,  n_samp = 10, dates = rownames(time_features))
  time: 163.96 sec elapsed

The result is a list of different components, as you can see below.

names(example1)
  [1] "exploration" "history"     "best_model"  "time_log"
names(example1$best_model)
  [1] "exploration"    "predictions"    "testing_errors" "plots"

exploration includes all models tested during the exploration.history includes selected parameters and error metrics for the explored space during random search (beside prediction score, me, mae, mse, rmsse, mpe, mape, rmae, rrmse, rame, mase, smse, sce, gmrae 3, averaged across features and validation windows). best_model collects a list of information for the best model selected according to the average error metric: you will find the prediction intervals (predictions), the visualizations (plots) and the testing error metric for each time feature (testing_errors).

The predictions is a list including the predicted results for each time-feature (quantile, min, max, mean, mode, sd, skewness, kurtosis, iqr to range, median range ratio, upside probability, divergence for each time point in the seq_len sequence). The IQR to range is the interquartile range to the min-max range, the median range ratio is the range above median to the range below it, the upside probability is the probability of growth compared to the former point in the time sequence, the divergence is the maximum distance of cumulative normal curve of each point to the former point in the sequence.

Examples of prediction for IBM Close Prices
min 10% 25% 50% 75% 90% max mean sd mode kurtosis skewness iqr_to_range median_range_ratio upside_prob divergence pred_scores
t1 127.8516 128.2531 128.3254 128.4082 128.4903 128.5640 128.8226 128.4073 0.1237 128.4242 3.2527 -0.1179 0.1699 0.7443 NA NA 0.0000
t2 126.9680 127.4031 127.5324 127.6759 127.8195 127.9464 128.2141 127.6719 0.2093 127.6846 2.8055 -0.1445 0.2303 0.7603 0.000 0.980 0.0032
t3 127.3178 127.6683 127.7810 127.8939 128.0134 128.1245 128.3855 127.8963 0.1747 127.8936 2.8452 -0.0880 0.2177 0.8534 1.000 0.000 0.0628
t4 127.3703 127.7146 127.8111 127.9325 128.0434 128.1447 128.3939 127.9254 0.1667 127.9076 2.8223 -0.1259 0.2269 0.8209 0.630 0.006 0.0560
t5 127.8910 128.2116 128.3412 128.4801 128.6170 128.7466 129.2714 128.4815 0.2077 128.4584 2.9902 0.1464 0.1999 1.3435 1.000 0.000 0.1624
t6 127.7559 128.1905 128.3525 128.5371 128.7100 128.8641 129.1736 128.5293 0.2564 128.5108 2.6854 -0.1378 0.2522 0.8147 0.580 0.028 0.1528
t7 128.0041 128.5290 128.6945 128.8855 129.0639 129.2214 129.5156 128.8783 0.2687 128.8833 2.7146 -0.1801 0.2444 0.7150 1.000 0.000 0.0680
t8 128.0279 128.5882 128.7722 128.9775 129.1817 129.3474 129.7463 128.9714 0.2868 129.0125 2.6691 -0.1429 0.2383 0.8097 0.894 0.000 0.0632
t9 127.5506 128.3522 128.5615 128.8017 129.0163 129.2372 129.7563 128.7896 0.3404 128.8004 2.8880 -0.1356 0.2062 0.7630 0.068 0.240 0.0076
t10 127.9141 128.5187 128.7343 128.9614 129.2012 129.4071 129.8295 128.9597 0.3395 129.0172 2.7016 -0.1387 0.2438 0.8288 0.945 0.000 0.1936
t11 127.6921 128.3722 128.5972 128.8525 129.0872 129.3067 129.8348 128.8406 0.3571 128.8363 2.7317 -0.0199 0.2287 0.8465 0.258 0.150 0.1356
t12 128.2199 128.8153 129.0177 129.2782 129.5136 129.7353 130.4222 129.2736 0.3527 129.3141 2.8526 -0.0116 0.2252 1.0810 1.000 0.000 0.1032
t13 128.4538 128.9156 129.1316 129.3742 129.6215 129.8130 130.6285 129.3748 0.3406 129.4079 2.7068 0.0534 0.2253 1.3629 0.762 0.000 0.1548
t14 127.9915 128.5318 128.7430 128.9666 129.2299 129.3995 130.1690 128.9728 0.3397 128.9976 2.7477 0.0003 0.2236 1.2330 0.000 0.413 0.1152
t15 128.0190 128.5794 128.8006 129.0241 129.2810 129.4668 130.2045 129.0292 0.3433 129.0196 2.8140 -0.0171 0.2198 1.1743 0.893 0.002 0.0208
t16 128.0201 128.6425 128.8484 129.0883 129.3754 129.5724 130.2160 129.1075 0.3631 129.0246 2.7381 0.0472 0.2400 1.0558 0.817 0.001 0.1860
t17 128.5602 129.2172 129.4342 129.7232 130.0100 130.2227 130.9997 129.7193 0.3964 129.6997 2.6407 -0.0238 0.2360 1.0977 1.000 0.000 0.0156
t18 128.5751 129.4400 129.7254 130.0500 130.3792 130.6583 131.5158 130.0529 0.4719 130.0280 2.7332 -0.0006 0.2223 0.9938 0.997 0.000 0.0860
t19 128.2061 129.1017 129.4036 129.7729 130.1543 130.4623 131.3085 129.7759 0.5292 129.7075 2.6832 -0.0060 0.2420 0.9801 0.011 0.245 0.1908
t20 128.8143 129.5700 129.8824 130.2327 130.6225 130.8888 131.7961 130.2401 0.5130 130.2154 2.6916 -0.0211 0.2482 1.1022 1.000 0.000 0.0000

For each time features included in the model, you get a plot of the median with the chosen confidence interval (ci default is 0.8). As in other packages4, we provide different stats to give a better hint on the different dynamics related to aleatoric and epistemic uncertainty.

  $IBM.Close

  
  $MSFT.Close

Wandering around the hyper-parameter space

The hyper-parameter space is defined by seq_len, dist_method, similarity and rescale. Now, let’s try a random search for the best parameter settings. The following example shows how to sample 30 different models for a sequence of 20 time steps.

example2 <- segen(time_features, seq_len = 20, n_samp = 30, n_windows = 5, dates = rownames(time_features))
  time: 360.14 sec elapsed
History table with ranking of 30 different models
seq_len similarity dist_method rescale me mae mse rmsse mpe mape rmae rrmse rame mase smse sce gmrae
23 20 0.3003704 manhattan, minkowski TRUE 2.3806 5.0645 41.3617 4.0693 0.0147 0.0280 0.9702 1.0020 1.7991 2.4432 19.6314 26.2307 0.8760
18 20 0.5309009 manhattan, manhattan FALSE 2.5255 5.0408 40.8800 4.0377 0.0154 0.0281 0.9808 1.0011 1.6266 2.4423 19.5230 27.2844 0.9493
7 20 0.1522422 manhattan, manhattan FALSE 2.2284 5.0719 42.0416 4.0896 0.0142 0.0281 0.9744 1.0111 1.9361 2.4502 19.9772 25.1723 0.8954
4 20 0.1591091 euclidean, manhattan TRUE 2.1947 5.0696 41.9408 4.0939 0.0140 0.0281 0.9807 1.0198 1.9671 2.4469 19.8683 24.7886 0.8752
16 20 0.1708809 euclidean, manhattan TRUE 2.1998 5.0668 41.8181 4.0924 0.0141 0.0281 0.9845 1.0225 1.9646 2.4499 19.8622 24.9219 0.8855
3 20 0.3239139 maximum , manhattan FALSE 2.3029 5.0752 41.6974 4.0923 0.0144 0.0279 0.9846 1.0211 1.8848 2.4548 19.7961 25.6876 0.8952
11 20 0.3062563 minkowski, euclidean TRUE 2.3869 5.0926 41.6240 4.0999 0.0147 0.0281 0.9840 1.0183 1.8285 2.4632 19.8150 26.2914 0.8798
15 20 0.5985886 maximum , euclidean FALSE 2.5675 5.0735 41.0758 4.0600 0.0157 0.0284 0.9859 1.0055 1.5592 2.4619 19.6821 27.5574 0.9608
14 20 0.3582482 maximum , minkowski TRUE 2.4170 5.0956 41.3851 4.0894 0.0148 0.0283 0.9830 1.0132 1.7917 2.4655 19.7753 26.5630 0.9030
1 20 0.5593493 minkowski, manhattan FALSE 2.4612 5.0563 41.2578 4.0678 0.0155 0.0283 0.9918 1.0160 1.6723 2.4579 19.7827 26.8860 0.9591
10 20 0.1345846 manhattan, euclidean TRUE 2.3393 5.1443 43.2689 4.1402 0.0146 0.0282 0.9792 1.0186 1.8985 2.4691 20.3400 25.8723 0.8752
24 20 0.5985886 euclidean, euclidean FALSE 2.6001 5.0996 41.4231 4.0822 0.0160 0.0286 0.9965 1.0149 1.5871 2.4787 19.9242 27.9292 0.9749
5 20 0.0816116 minkowski, manhattan TRUE 2.3567 5.1664 43.5632 4.1778 0.0145 0.0285 0.9920 1.0364 1.9341 2.4862 20.5938 25.9480 0.8367
17 20 0.6191892 manhattan, manhattan TRUE 2.5605 5.1186 42.3399 4.1150 0.0159 0.0285 0.9966 1.0224 1.7418 2.4806 20.2108 27.7833 0.9475
9 20 0.0570871 minkowski, manhattan TRUE 2.3912 5.1831 43.8489 4.1908 0.0147 0.0287 0.9930 1.0375 1.9397 2.4937 20.7392 26.2293 0.8486
30 20 0.5161862 maximum, maximum TRUE 2.4082 5.1895 43.2879 4.1574 0.0148 0.0286 0.9991 1.0292 1.9112 2.5027 20.5427 26.6881 0.9282
29 20 0.6182082 minkowski, euclidean FALSE 2.6350 5.1538 42.2233 4.1224 0.0162 0.0288 1.0103 1.0275 1.5924 2.4991 20.2152 28.2977 0.9874
12 20 0.0325626 minkowski, manhattan TRUE 2.4141 5.2051 44.3456 4.2127 0.0148 0.0288 0.9956 1.0401 1.9424 2.5098 21.0985 26.5579 0.8485
8 20 0.6309610 minkowski, minkowski TRUE 2.5307 5.2073 43.1336 4.1686 0.0156 0.0290 1.0080 1.0332 1.8107 2.5206 20.6629 27.8829 0.9546
25 20 0.2866366 euclidean, maximum TRUE 2.4947 5.2637 44.5242 4.1988 0.0153 0.0289 1.0043 1.0368 1.9772 2.5138 20.7724 27.1552 0.9378
6 20 0.2326827 maximum, maximum TRUE 2.6533 5.2996 44.5266 4.2317 0.0158 0.0291 1.0128 1.0462 1.8788 2.5365 20.8903 28.2270 0.9169
19 20 0.4112212 minkowski, maximum TRUE 2.5731 5.2898 44.4426 4.2270 0.0156 0.0292 1.0199 1.0527 1.9393 2.5359 20.8520 27.7506 0.9320
21 20 0.8713013 manhattan, maximum TRUE 2.8299 5.3540 45.9143 4.2717 0.0171 0.0294 1.0193 1.0450 1.8228 2.5741 21.7734 30.0763 0.9817
26 20 0.9232933 minkowski, euclidean TRUE 2.8954 5.3907 46.5403 4.2988 0.0176 0.0297 1.0217 1.0467 1.8128 2.5934 22.1322 30.7822 0.9829
2 20 0.9870571 minkowski, minkowski TRUE 2.9195 5.4071 47.0234 4.3173 0.0178 0.0299 1.0224 1.0478 1.8153 2.6045 22.4494 31.0731 0.9725
20 20 0.8742442 manhattan, euclidean FALSE 2.8778 5.4222 46.8438 4.2977 0.0173 0.0300 1.0397 1.0571 1.7345 2.6075 22.2516 30.5022 1.0234
13 20 0.8320621 minkowski, manhattan FALSE 2.8321 5.4193 46.8758 4.3164 0.0174 0.0301 1.0476 1.0658 1.7919 2.6179 22.3652 30.4645 1.0289
22 20 0.8909209 manhattan, manhattan FALSE 2.9624 5.4719 47.6212 4.3346 0.0178 0.0303 1.0447 1.0604 1.6839 2.6374 22.7997 31.3705 1.0284
27 20 0.9870571 manhattan, euclidean FALSE 3.0344 5.5304 49.2702 4.3868 0.0183 0.0307 1.0472 1.0628 1.7189 2.6734 23.8156 32.4001 1.0379
28 20 0.9811712 euclidean, minkowski FALSE 3.0344 5.5304 49.2702 4.3868 0.0183 0.0307 1.0472 1.0628 1.7189 2.6734 23.8156 32.4001 1.0379

If we compare the error statistics from the best model in example2 with the model in example1, for IBM and Microsoft we see consistent improvement. All the relative and scaled error metrics defaults to segen, but you can choose more challenging thresholds (like the deviation of the whole time feature or the average of the whole predicted sequence).

The error statistics from example1 (averaged across 10 expanding validation windows):

example1$best_model$testing_errors
                 me    mae     mse  rmsse    mpe   mape   rmae  rrmse   rame
  IBM.Close  3.0814 4.3986 32.5930 4.0150 0.0234 0.0338 1.1202 1.1316 1.2604
  MSFT.Close 1.7566 5.7480 50.7096 4.1506 0.0064 0.0222 0.8200 0.8740 2.3276
               mase    smse     sce  gmrae
  IBM.Close  2.9098 22.1456 41.0728 1.0792
  MSFT.Close 1.9926 17.5462 12.3304 0.6554

The error statistics from example2 (as above, averaged across 10 expanding validation windows):

example2$best_model$testing_errors
                 me   mae     mse  rmsse    mpe   mape   rmae  rrmse   rame
  IBM.Close  3.0196 4.371 31.9238 3.9840 0.0230 0.0336 1.1188 1.1292 1.2572
  MSFT.Close 1.7416 5.758 50.7996 4.1546 0.0064 0.0224 0.8216 0.8748 2.3410
               mase    smse     sce  gmrae
  IBM.Close  2.8904 21.6836 40.2366 1.0872
  MSFT.Close 1.9960 17.5792 12.2248 0.6648

The improvement is somehow unclear between the two examples, but we are still using a naive approach to measure scaled and relative errors. Let’s try to shift to deviation as scale, and average as benchmark, that are more challenging evaluation criteria, and extend to 100 samples.

example3 <- segen(time_features, seq_len = 20, n_windows = 5, dates = rownames(time_features), error_scale = "deviation", error_benchmark = "average", n_samp = 100)
  time: 1282.79 sec elapsed

As you can see, the relative and scaled measures shift sensibly as we change the bar of our expectations:

example3$best_model$testing_errors
                 me    mae     mse  rmsse    mpe   mape   rmae  rrmse rame   mase
  IBM.Close  3.0090 4.4084 32.3898 1.9778 0.0230 0.0338 1.0928 1.0548    1 0.7206
  MSFT.Close 1.3992 5.7070 50.8614 1.5518 0.0052 0.0222 0.9808 0.9970    1 0.2892
               smse     sce  gmrae
  IBM.Close  4.8464 10.4262 1.1220
  MSFT.Close 2.6318  1.4278 0.9256

Some useful references


  1. The missing imputation is managed through imputeTS package. For any information: https://cran.r-project.org/web/packages/imputeTS/index.html.↩︎

  2. In some cases, maybe you want to operate on smoothed time-features. In this case, segen calls on fANCOVA package. Here you can find all the latest: https://cran.r-project.org/web/packages/fANCOVA/index.html↩︎

  3. The metrics are calculated using the greybox package. For any reference, please take a look here: https://cran.r-project.org/web/packages/greybox/index.html↩︎

  4. Other packages focused on time feature analysis that could be of interest here:

    - AUDREX, https://cran.r-project.org/web/packages/audrex/index.html
    - PROTEUS, https://cran.r-project.org/web/packages/proteus/index.html
    - JENGA, https://cran.r-project.org/web/packages/jenga/index.html
    - TETRAGON, https://cran.r-project.org/web/packages/tetragon/index.html
    - SPOOKY, https://cran.r-project.org/web/packages/spooky/index.html
    - DYMO, https://cran.r-project.org/web/packages/dymo/index.html
    - NAIVE, https://cran.r-project.org/web/packages/naive/index.html
    ↩︎