“Data without generalization is just gossip.” (Robert Pirsig)
“We continually judge the whole from the part we are familiar with.” (Henry Louis Mencken)
“All generalizations are false, including this one.” (Marc Twain)
Segen is a model for sequence generalization using the “network” of similarities among sequences for the extrapolation of the next sequence. The notion of “network” here is related to the computation of a matrix of distances that is converted to the equivalent of an adjacency matrix using a similarity threshold. The idea behind segen is that the lower triangle of such matrix is enough to predict (to some extent) the behavior of a sequence of sequences.
A brief overview on the process:
The test errors are cross-validated through expanding validation
windows, where the default value of n_windows is set to 10,
meaning that the time features are divided into 10 + 1 segments
guaranteeing at least ten validation sets to measure the error on
unforeseen data.
Some basic transformation are directly managed in background.
Differentiation and integration are automatically managed by segen using
the maximal p-value in a recursive F-test for de-trending each
time-feature: this allows to easily determine the different dynamic
characteristics of each time feature, random walk, trend, exponential
(somehow more simple and practical compared to other formal approaches
like Augmented Dickey-Fuller or Ljung Box Test). If you have limited
missing values in your time features, segen automatically proceeds with
the imputation using the Kalman filter method1. If you prefer to
project into the future the smoothed version, you can set
smoother = TRUE to use loess2 function.
After differentiation, each time features is reframed according
to sequence length (each time feature is segmented in sequences of
seq_len). A distance metric is calculated among each
sequence and the value are compared to a similarity
threshold (the quantile value of distances, where 0 means maximum and 1
minimum difference respectively). Location and scales parameters are
calculated for each column and used to simulate weights used to
calculate the generalized sequence as weighted average. The weights
could optionally be rescaled using min-max normalization so to make the
generalization somehow more “fuzzy”.
For each point in the prediction sequence, a thousand samples are collected for the calculation of quantiles, mean, mode, standard deviation, skewness and kurtosis, and other less common measures.
That’s all. The process is quite easy and relatively fast (as always, the only bottleneck is represented by the quadratic complexity in the calculation of distance matrix for very large sets of sequences, but that’s another issue).
The process flow of segen
The dataset time features included with segen is a
recent take on some Big Techs’ stock prices (source: Yahoo Finance). The
data is expected in a dataframe format, where each column represents a
different time series (the date information is not mandatory and could
be provided separately).
| IBM.Close | MSFT.Close | |
|---|---|---|
| 2020-04-14 | 118.4608 | 173.70 |
| 2020-04-15 | 113.4704 | 171.88 |
| 2020-04-16 | 110.6405 | 177.04 |
| 2020-04-17 | 114.8375 | 178.60 |
| 2020-04-20 | 115.1147 | 175.06 |
| 2020-04-21 | 111.6252 | 167.82 |
| 2020-04-22 | 114.0631 | 173.52 |
| 2020-04-23 | 116.0134 | 171.42 |
| 2020-04-24 | 119.2352 | 174.55 |
| 2020-04-27 | 120.3824 | 174.05 |
In the first example, we are predicting the close price for IBM and
Microsoft. In this example we try to set seq_len = 20
(sequence length), using a cross-validation scheme of 5
n_windows for error measurement.
example1 <- segen(time_features, seq_len = 20, n_windows = 5, n_samp = 10, dates = rownames(time_features))
time: 163.96 sec elapsedThe result is a list of different components, as you can see below.
names(example1)
[1] "exploration" "history" "best_model" "time_log"
names(example1$best_model)
[1] "exploration" "predictions" "testing_errors" "plots"exploration includes all models tested during the
exploration.history includes selected parameters and error
metrics for the explored space during random search (beside prediction
score, me, mae, mse, rmsse, mpe, mape, rmae, rrmse, rame, mase, smse,
sce, gmrae 3, averaged across features and validation
windows). best_model collects a list of information for the
best model selected according to the average error metric: you will find
the prediction intervals (predictions), the visualizations
(plots) and the testing error metric for each time feature
(testing_errors).
The predictions is a list including the predicted
results for each time-feature (quantile, min, max, mean, mode, sd,
skewness, kurtosis, iqr to range, median range ratio, upside
probability, divergence for each time point in the seq_len
sequence). The IQR to range is the interquartile range to the min-max
range, the median range ratio is the range above median to the range
below it, the upside probability is the probability of growth compared
to the former point in the time sequence, the divergence is the maximum
distance of cumulative normal curve of each point to the former point in
the sequence.
| min | 10% | 25% | 50% | 75% | 90% | max | mean | sd | mode | kurtosis | skewness | iqr_to_range | median_range_ratio | upside_prob | divergence | pred_scores | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| t1 | 127.8516 | 128.2531 | 128.3254 | 128.4082 | 128.4903 | 128.5640 | 128.8226 | 128.4073 | 0.1237 | 128.4242 | 3.2527 | -0.1179 | 0.1699 | 0.7443 | NA | NA | 0.0000 |
| t2 | 126.9680 | 127.4031 | 127.5324 | 127.6759 | 127.8195 | 127.9464 | 128.2141 | 127.6719 | 0.2093 | 127.6846 | 2.8055 | -0.1445 | 0.2303 | 0.7603 | 0.000 | 0.980 | 0.0032 |
| t3 | 127.3178 | 127.6683 | 127.7810 | 127.8939 | 128.0134 | 128.1245 | 128.3855 | 127.8963 | 0.1747 | 127.8936 | 2.8452 | -0.0880 | 0.2177 | 0.8534 | 1.000 | 0.000 | 0.0628 |
| t4 | 127.3703 | 127.7146 | 127.8111 | 127.9325 | 128.0434 | 128.1447 | 128.3939 | 127.9254 | 0.1667 | 127.9076 | 2.8223 | -0.1259 | 0.2269 | 0.8209 | 0.630 | 0.006 | 0.0560 |
| t5 | 127.8910 | 128.2116 | 128.3412 | 128.4801 | 128.6170 | 128.7466 | 129.2714 | 128.4815 | 0.2077 | 128.4584 | 2.9902 | 0.1464 | 0.1999 | 1.3435 | 1.000 | 0.000 | 0.1624 |
| t6 | 127.7559 | 128.1905 | 128.3525 | 128.5371 | 128.7100 | 128.8641 | 129.1736 | 128.5293 | 0.2564 | 128.5108 | 2.6854 | -0.1378 | 0.2522 | 0.8147 | 0.580 | 0.028 | 0.1528 |
| t7 | 128.0041 | 128.5290 | 128.6945 | 128.8855 | 129.0639 | 129.2214 | 129.5156 | 128.8783 | 0.2687 | 128.8833 | 2.7146 | -0.1801 | 0.2444 | 0.7150 | 1.000 | 0.000 | 0.0680 |
| t8 | 128.0279 | 128.5882 | 128.7722 | 128.9775 | 129.1817 | 129.3474 | 129.7463 | 128.9714 | 0.2868 | 129.0125 | 2.6691 | -0.1429 | 0.2383 | 0.8097 | 0.894 | 0.000 | 0.0632 |
| t9 | 127.5506 | 128.3522 | 128.5615 | 128.8017 | 129.0163 | 129.2372 | 129.7563 | 128.7896 | 0.3404 | 128.8004 | 2.8880 | -0.1356 | 0.2062 | 0.7630 | 0.068 | 0.240 | 0.0076 |
| t10 | 127.9141 | 128.5187 | 128.7343 | 128.9614 | 129.2012 | 129.4071 | 129.8295 | 128.9597 | 0.3395 | 129.0172 | 2.7016 | -0.1387 | 0.2438 | 0.8288 | 0.945 | 0.000 | 0.1936 |
| t11 | 127.6921 | 128.3722 | 128.5972 | 128.8525 | 129.0872 | 129.3067 | 129.8348 | 128.8406 | 0.3571 | 128.8363 | 2.7317 | -0.0199 | 0.2287 | 0.8465 | 0.258 | 0.150 | 0.1356 |
| t12 | 128.2199 | 128.8153 | 129.0177 | 129.2782 | 129.5136 | 129.7353 | 130.4222 | 129.2736 | 0.3527 | 129.3141 | 2.8526 | -0.0116 | 0.2252 | 1.0810 | 1.000 | 0.000 | 0.1032 |
| t13 | 128.4538 | 128.9156 | 129.1316 | 129.3742 | 129.6215 | 129.8130 | 130.6285 | 129.3748 | 0.3406 | 129.4079 | 2.7068 | 0.0534 | 0.2253 | 1.3629 | 0.762 | 0.000 | 0.1548 |
| t14 | 127.9915 | 128.5318 | 128.7430 | 128.9666 | 129.2299 | 129.3995 | 130.1690 | 128.9728 | 0.3397 | 128.9976 | 2.7477 | 0.0003 | 0.2236 | 1.2330 | 0.000 | 0.413 | 0.1152 |
| t15 | 128.0190 | 128.5794 | 128.8006 | 129.0241 | 129.2810 | 129.4668 | 130.2045 | 129.0292 | 0.3433 | 129.0196 | 2.8140 | -0.0171 | 0.2198 | 1.1743 | 0.893 | 0.002 | 0.0208 |
| t16 | 128.0201 | 128.6425 | 128.8484 | 129.0883 | 129.3754 | 129.5724 | 130.2160 | 129.1075 | 0.3631 | 129.0246 | 2.7381 | 0.0472 | 0.2400 | 1.0558 | 0.817 | 0.001 | 0.1860 |
| t17 | 128.5602 | 129.2172 | 129.4342 | 129.7232 | 130.0100 | 130.2227 | 130.9997 | 129.7193 | 0.3964 | 129.6997 | 2.6407 | -0.0238 | 0.2360 | 1.0977 | 1.000 | 0.000 | 0.0156 |
| t18 | 128.5751 | 129.4400 | 129.7254 | 130.0500 | 130.3792 | 130.6583 | 131.5158 | 130.0529 | 0.4719 | 130.0280 | 2.7332 | -0.0006 | 0.2223 | 0.9938 | 0.997 | 0.000 | 0.0860 |
| t19 | 128.2061 | 129.1017 | 129.4036 | 129.7729 | 130.1543 | 130.4623 | 131.3085 | 129.7759 | 0.5292 | 129.7075 | 2.6832 | -0.0060 | 0.2420 | 0.9801 | 0.011 | 0.245 | 0.1908 |
| t20 | 128.8143 | 129.5700 | 129.8824 | 130.2327 | 130.6225 | 130.8888 | 131.7961 | 130.2401 | 0.5130 | 130.2154 | 2.6916 | -0.0211 | 0.2482 | 1.1022 | 1.000 | 0.000 | 0.0000 |
For each time features included in the model, you get a plot of the
median with the chosen confidence interval (ci default is
0.8). As in other packages4, we provide different stats to give a
better hint on the different dynamics related to aleatoric and epistemic
uncertainty.
$IBM.Close
$MSFT.Close
The hyper-parameter space is defined by seq_len,
dist_method, similarity and
rescale. Now, let’s try a random search for the best
parameter settings. The following example shows how to sample 30
different models for a sequence of 20 time steps.
example2 <- segen(time_features, seq_len = 20, n_samp = 30, n_windows = 5, dates = rownames(time_features))
time: 360.14 sec elapsed| seq_len | similarity | dist_method | rescale | me | mae | mse | rmsse | mpe | mape | rmae | rrmse | rame | mase | smse | sce | gmrae | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23 | 20 | 0.3003704 | manhattan, minkowski | TRUE | 2.3806 | 5.0645 | 41.3617 | 4.0693 | 0.0147 | 0.0280 | 0.9702 | 1.0020 | 1.7991 | 2.4432 | 19.6314 | 26.2307 | 0.8760 |
| 18 | 20 | 0.5309009 | manhattan, manhattan | FALSE | 2.5255 | 5.0408 | 40.8800 | 4.0377 | 0.0154 | 0.0281 | 0.9808 | 1.0011 | 1.6266 | 2.4423 | 19.5230 | 27.2844 | 0.9493 |
| 7 | 20 | 0.1522422 | manhattan, manhattan | FALSE | 2.2284 | 5.0719 | 42.0416 | 4.0896 | 0.0142 | 0.0281 | 0.9744 | 1.0111 | 1.9361 | 2.4502 | 19.9772 | 25.1723 | 0.8954 |
| 4 | 20 | 0.1591091 | euclidean, manhattan | TRUE | 2.1947 | 5.0696 | 41.9408 | 4.0939 | 0.0140 | 0.0281 | 0.9807 | 1.0198 | 1.9671 | 2.4469 | 19.8683 | 24.7886 | 0.8752 |
| 16 | 20 | 0.1708809 | euclidean, manhattan | TRUE | 2.1998 | 5.0668 | 41.8181 | 4.0924 | 0.0141 | 0.0281 | 0.9845 | 1.0225 | 1.9646 | 2.4499 | 19.8622 | 24.9219 | 0.8855 |
| 3 | 20 | 0.3239139 | maximum , manhattan | FALSE | 2.3029 | 5.0752 | 41.6974 | 4.0923 | 0.0144 | 0.0279 | 0.9846 | 1.0211 | 1.8848 | 2.4548 | 19.7961 | 25.6876 | 0.8952 |
| 11 | 20 | 0.3062563 | minkowski, euclidean | TRUE | 2.3869 | 5.0926 | 41.6240 | 4.0999 | 0.0147 | 0.0281 | 0.9840 | 1.0183 | 1.8285 | 2.4632 | 19.8150 | 26.2914 | 0.8798 |
| 15 | 20 | 0.5985886 | maximum , euclidean | FALSE | 2.5675 | 5.0735 | 41.0758 | 4.0600 | 0.0157 | 0.0284 | 0.9859 | 1.0055 | 1.5592 | 2.4619 | 19.6821 | 27.5574 | 0.9608 |
| 14 | 20 | 0.3582482 | maximum , minkowski | TRUE | 2.4170 | 5.0956 | 41.3851 | 4.0894 | 0.0148 | 0.0283 | 0.9830 | 1.0132 | 1.7917 | 2.4655 | 19.7753 | 26.5630 | 0.9030 |
| 1 | 20 | 0.5593493 | minkowski, manhattan | FALSE | 2.4612 | 5.0563 | 41.2578 | 4.0678 | 0.0155 | 0.0283 | 0.9918 | 1.0160 | 1.6723 | 2.4579 | 19.7827 | 26.8860 | 0.9591 |
| 10 | 20 | 0.1345846 | manhattan, euclidean | TRUE | 2.3393 | 5.1443 | 43.2689 | 4.1402 | 0.0146 | 0.0282 | 0.9792 | 1.0186 | 1.8985 | 2.4691 | 20.3400 | 25.8723 | 0.8752 |
| 24 | 20 | 0.5985886 | euclidean, euclidean | FALSE | 2.6001 | 5.0996 | 41.4231 | 4.0822 | 0.0160 | 0.0286 | 0.9965 | 1.0149 | 1.5871 | 2.4787 | 19.9242 | 27.9292 | 0.9749 |
| 5 | 20 | 0.0816116 | minkowski, manhattan | TRUE | 2.3567 | 5.1664 | 43.5632 | 4.1778 | 0.0145 | 0.0285 | 0.9920 | 1.0364 | 1.9341 | 2.4862 | 20.5938 | 25.9480 | 0.8367 |
| 17 | 20 | 0.6191892 | manhattan, manhattan | TRUE | 2.5605 | 5.1186 | 42.3399 | 4.1150 | 0.0159 | 0.0285 | 0.9966 | 1.0224 | 1.7418 | 2.4806 | 20.2108 | 27.7833 | 0.9475 |
| 9 | 20 | 0.0570871 | minkowski, manhattan | TRUE | 2.3912 | 5.1831 | 43.8489 | 4.1908 | 0.0147 | 0.0287 | 0.9930 | 1.0375 | 1.9397 | 2.4937 | 20.7392 | 26.2293 | 0.8486 |
| 30 | 20 | 0.5161862 | maximum, maximum | TRUE | 2.4082 | 5.1895 | 43.2879 | 4.1574 | 0.0148 | 0.0286 | 0.9991 | 1.0292 | 1.9112 | 2.5027 | 20.5427 | 26.6881 | 0.9282 |
| 29 | 20 | 0.6182082 | minkowski, euclidean | FALSE | 2.6350 | 5.1538 | 42.2233 | 4.1224 | 0.0162 | 0.0288 | 1.0103 | 1.0275 | 1.5924 | 2.4991 | 20.2152 | 28.2977 | 0.9874 |
| 12 | 20 | 0.0325626 | minkowski, manhattan | TRUE | 2.4141 | 5.2051 | 44.3456 | 4.2127 | 0.0148 | 0.0288 | 0.9956 | 1.0401 | 1.9424 | 2.5098 | 21.0985 | 26.5579 | 0.8485 |
| 8 | 20 | 0.6309610 | minkowski, minkowski | TRUE | 2.5307 | 5.2073 | 43.1336 | 4.1686 | 0.0156 | 0.0290 | 1.0080 | 1.0332 | 1.8107 | 2.5206 | 20.6629 | 27.8829 | 0.9546 |
| 25 | 20 | 0.2866366 | euclidean, maximum | TRUE | 2.4947 | 5.2637 | 44.5242 | 4.1988 | 0.0153 | 0.0289 | 1.0043 | 1.0368 | 1.9772 | 2.5138 | 20.7724 | 27.1552 | 0.9378 |
| 6 | 20 | 0.2326827 | maximum, maximum | TRUE | 2.6533 | 5.2996 | 44.5266 | 4.2317 | 0.0158 | 0.0291 | 1.0128 | 1.0462 | 1.8788 | 2.5365 | 20.8903 | 28.2270 | 0.9169 |
| 19 | 20 | 0.4112212 | minkowski, maximum | TRUE | 2.5731 | 5.2898 | 44.4426 | 4.2270 | 0.0156 | 0.0292 | 1.0199 | 1.0527 | 1.9393 | 2.5359 | 20.8520 | 27.7506 | 0.9320 |
| 21 | 20 | 0.8713013 | manhattan, maximum | TRUE | 2.8299 | 5.3540 | 45.9143 | 4.2717 | 0.0171 | 0.0294 | 1.0193 | 1.0450 | 1.8228 | 2.5741 | 21.7734 | 30.0763 | 0.9817 |
| 26 | 20 | 0.9232933 | minkowski, euclidean | TRUE | 2.8954 | 5.3907 | 46.5403 | 4.2988 | 0.0176 | 0.0297 | 1.0217 | 1.0467 | 1.8128 | 2.5934 | 22.1322 | 30.7822 | 0.9829 |
| 2 | 20 | 0.9870571 | minkowski, minkowski | TRUE | 2.9195 | 5.4071 | 47.0234 | 4.3173 | 0.0178 | 0.0299 | 1.0224 | 1.0478 | 1.8153 | 2.6045 | 22.4494 | 31.0731 | 0.9725 |
| 20 | 20 | 0.8742442 | manhattan, euclidean | FALSE | 2.8778 | 5.4222 | 46.8438 | 4.2977 | 0.0173 | 0.0300 | 1.0397 | 1.0571 | 1.7345 | 2.6075 | 22.2516 | 30.5022 | 1.0234 |
| 13 | 20 | 0.8320621 | minkowski, manhattan | FALSE | 2.8321 | 5.4193 | 46.8758 | 4.3164 | 0.0174 | 0.0301 | 1.0476 | 1.0658 | 1.7919 | 2.6179 | 22.3652 | 30.4645 | 1.0289 |
| 22 | 20 | 0.8909209 | manhattan, manhattan | FALSE | 2.9624 | 5.4719 | 47.6212 | 4.3346 | 0.0178 | 0.0303 | 1.0447 | 1.0604 | 1.6839 | 2.6374 | 22.7997 | 31.3705 | 1.0284 |
| 27 | 20 | 0.9870571 | manhattan, euclidean | FALSE | 3.0344 | 5.5304 | 49.2702 | 4.3868 | 0.0183 | 0.0307 | 1.0472 | 1.0628 | 1.7189 | 2.6734 | 23.8156 | 32.4001 | 1.0379 |
| 28 | 20 | 0.9811712 | euclidean, minkowski | FALSE | 3.0344 | 5.5304 | 49.2702 | 4.3868 | 0.0183 | 0.0307 | 1.0472 | 1.0628 | 1.7189 | 2.6734 | 23.8156 | 32.4001 | 1.0379 |
If we compare the error statistics from the best model in
example2 with the model in example1, for IBM
and Microsoft we see consistent improvement. All the relative and scaled
error metrics defaults to segen, but you can choose more
challenging thresholds (like the deviation of the whole time feature or
the average of the whole predicted sequence).
The error statistics from example1 (averaged across 10
expanding validation windows):
example1$best_model$testing_errors
me mae mse rmsse mpe mape rmae rrmse rame
IBM.Close 3.0814 4.3986 32.5930 4.0150 0.0234 0.0338 1.1202 1.1316 1.2604
MSFT.Close 1.7566 5.7480 50.7096 4.1506 0.0064 0.0222 0.8200 0.8740 2.3276
mase smse sce gmrae
IBM.Close 2.9098 22.1456 41.0728 1.0792
MSFT.Close 1.9926 17.5462 12.3304 0.6554The error statistics from example2 (as above, averaged
across 10 expanding validation windows):
example2$best_model$testing_errors
me mae mse rmsse mpe mape rmae rrmse rame
IBM.Close 3.0196 4.371 31.9238 3.9840 0.0230 0.0336 1.1188 1.1292 1.2572
MSFT.Close 1.7416 5.758 50.7996 4.1546 0.0064 0.0224 0.8216 0.8748 2.3410
mase smse sce gmrae
IBM.Close 2.8904 21.6836 40.2366 1.0872
MSFT.Close 1.9960 17.5792 12.2248 0.6648The improvement is somehow unclear between the two examples, but we
are still using a naive approach to measure scaled and
relative errors. Let’s try to shift to deviation as scale,
and average as benchmark, that are more challenging
evaluation criteria, and extend to 100 samples.
example3 <- segen(time_features, seq_len = 20, n_windows = 5, dates = rownames(time_features), error_scale = "deviation", error_benchmark = "average", n_samp = 100)
time: 1282.79 sec elapsedAs you can see, the relative and scaled measures shift sensibly as we change the bar of our expectations:
example3$best_model$testing_errors
me mae mse rmsse mpe mape rmae rrmse rame mase
IBM.Close 3.0090 4.4084 32.3898 1.9778 0.0230 0.0338 1.0928 1.0548 1 0.7206
MSFT.Close 1.3992 5.7070 50.8614 1.5518 0.0052 0.0222 0.9808 0.9970 1 0.2892
smse sce gmrae
IBM.Close 4.8464 10.4262 1.1220
MSFT.Close 2.6318 1.4278 0.9256The missing imputation is managed through imputeTS package. For any information: https://cran.r-project.org/web/packages/imputeTS/index.html.↩︎
In some cases, maybe you want to operate on smoothed
time-features. In this case, segen calls on fANCOVA
package. Here you can find all the latest: https://cran.r-project.org/web/packages/fANCOVA/index.html↩︎
The metrics are calculated using the greybox package. For any reference, please take a look here: https://cran.r-project.org/web/packages/greybox/index.html↩︎
Other packages focused on time feature analysis that could be of interest here:
- AUDREX, https://cran.r-project.org/web/packages/audrex/index.html
- PROTEUS, https://cran.r-project.org/web/packages/proteus/index.html
- JENGA, https://cran.r-project.org/web/packages/jenga/index.html
- TETRAGON, https://cran.r-project.org/web/packages/tetragon/index.html
- SPOOKY, https://cran.r-project.org/web/packages/spooky/index.html
- DYMO, https://cran.r-project.org/web/packages/dymo/index.html
- NAIVE, https://cran.r-project.org/web/packages/naive/index.html
↩︎