“Data without generalization is just gossip.” (Robert Pirsig)
“We continually judge the whole from the part we are familiar with.” (Henry Louis Mencken)
“All generalizations are false, including this one.” (Marc Twain)
Segen is a model for sequence generalization using the “network” of similarities among sequences for the extrapolation of the next sequence. The notion of “network” here is related to the computation of a matrix of distances that is converted to the equivalent of an adjacency matrix using a similarity threshold. The idea behind segen is that the lower triangle of such matrix is enough to predict (to some extent) the behavior of a sequence of sequences.
A brief overview on the process:
The test errors are cross-validated through expanding validation
windows, where the default value of n_windows
is set to 10,
meaning that the time features are divided into 10 + 1 segments
guaranteeing at least ten validation sets to measure the error on
unforeseen data.
Some basic transformation are directly managed in background.
Differentiation and integration are automatically managed by segen using
the maximal p-value in a recursive F-test for de-trending each
time-feature: this allows to easily determine the different dynamic
characteristics of each time feature, random walk, trend, exponential
(somehow more simple and practical compared to other formal approaches
like Augmented Dickey-Fuller or Ljung Box Test). If you have limited
missing values in your time features, segen automatically proceeds with
the imputation using the Kalman filter method1. If you prefer to
project into the future the smoothed version, you can set
smoother
= TRUE
to use loess2 function.
After differentiation, each time features is reframed according
to sequence length (each time feature is segmented in sequences of
seq_len
). A distance metric is calculated among each
sequence and the value are compared to a similarity
threshold (the quantile value of distances, where 0 means maximum and 1
minimum difference respectively). Location and scales parameters are
calculated for each column and used to simulate weights used to
calculate the generalized sequence as weighted average. The weights
could optionally be rescaled using min-max normalization so to make the
generalization somehow more “fuzzy”.
For each point in the prediction sequence, a thousand samples are collected for the calculation of quantiles, mean, mode, standard deviation, skewness and kurtosis, and other less common measures.
That’s all. The process is quite easy and relatively fast (as always, the only bottleneck is represented by the quadratic complexity in the calculation of distance matrix for very large sets of sequences, but that’s another issue).
The process flow of segen
The dataset time features
included with segen is a
recent take on some Big Techs’ stock prices (source: Yahoo Finance). The
data is expected in a dataframe format, where each column represents a
different time series (the date information is not mandatory and could
be provided separately).
IBM.Close | MSFT.Close | |
---|---|---|
2020-04-14 | 118.4608 | 173.70 |
2020-04-15 | 113.4704 | 171.88 |
2020-04-16 | 110.6405 | 177.04 |
2020-04-17 | 114.8375 | 178.60 |
2020-04-20 | 115.1147 | 175.06 |
2020-04-21 | 111.6252 | 167.82 |
2020-04-22 | 114.0631 | 173.52 |
2020-04-23 | 116.0134 | 171.42 |
2020-04-24 | 119.2352 | 174.55 |
2020-04-27 | 120.3824 | 174.05 |
In the first example, we are predicting the close price for IBM and
Microsoft. In this example we try to set seq_len = 20
(sequence length), using a cross-validation scheme of 5
n_windows
for error measurement.
<- segen(time_features, seq_len = 20, n_windows = 5, n_samp = 10, dates = rownames(time_features))
example1 : 163.96 sec elapsed time
The result is a list of different components, as you can see below.
names(example1)
1] "exploration" "history" "best_model" "time_log"
[names(example1$best_model)
1] "exploration" "predictions" "testing_errors" "plots" [
exploration
includes all models tested during the
exploration.history
includes selected parameters and error
metrics for the explored space during random search (beside prediction
score, me, mae, mse, rmsse, mpe, mape, rmae, rrmse, rame, mase, smse,
sce, gmrae 3, averaged across features and validation
windows). best_model
collects a list of information for the
best model selected according to the average error metric: you will find
the prediction intervals (predictions
), the visualizations
(plots
) and the testing error metric for each time feature
(testing_errors
).
The predictions
is a list including the predicted
results for each time-feature (quantile, min, max, mean, mode, sd,
skewness, kurtosis, iqr to range, median range ratio, upside
probability, divergence for each time point in the seq_len
sequence). The IQR to range is the interquartile range to the min-max
range, the median range ratio is the range above median to the range
below it, the upside probability is the probability of growth compared
to the former point in the time sequence, the divergence is the maximum
distance of cumulative normal curve of each point to the former point in
the sequence.
min | 10% | 25% | 50% | 75% | 90% | max | mean | sd | mode | kurtosis | skewness | iqr_to_range | median_range_ratio | upside_prob | divergence | pred_scores | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
t1 | 127.8516 | 128.2531 | 128.3254 | 128.4082 | 128.4903 | 128.5640 | 128.8226 | 128.4073 | 0.1237 | 128.4242 | 3.2527 | -0.1179 | 0.1699 | 0.7443 | NA | NA | 0.0000 |
t2 | 126.9680 | 127.4031 | 127.5324 | 127.6759 | 127.8195 | 127.9464 | 128.2141 | 127.6719 | 0.2093 | 127.6846 | 2.8055 | -0.1445 | 0.2303 | 0.7603 | 0.000 | 0.980 | 0.0032 |
t3 | 127.3178 | 127.6683 | 127.7810 | 127.8939 | 128.0134 | 128.1245 | 128.3855 | 127.8963 | 0.1747 | 127.8936 | 2.8452 | -0.0880 | 0.2177 | 0.8534 | 1.000 | 0.000 | 0.0628 |
t4 | 127.3703 | 127.7146 | 127.8111 | 127.9325 | 128.0434 | 128.1447 | 128.3939 | 127.9254 | 0.1667 | 127.9076 | 2.8223 | -0.1259 | 0.2269 | 0.8209 | 0.630 | 0.006 | 0.0560 |
t5 | 127.8910 | 128.2116 | 128.3412 | 128.4801 | 128.6170 | 128.7466 | 129.2714 | 128.4815 | 0.2077 | 128.4584 | 2.9902 | 0.1464 | 0.1999 | 1.3435 | 1.000 | 0.000 | 0.1624 |
t6 | 127.7559 | 128.1905 | 128.3525 | 128.5371 | 128.7100 | 128.8641 | 129.1736 | 128.5293 | 0.2564 | 128.5108 | 2.6854 | -0.1378 | 0.2522 | 0.8147 | 0.580 | 0.028 | 0.1528 |
t7 | 128.0041 | 128.5290 | 128.6945 | 128.8855 | 129.0639 | 129.2214 | 129.5156 | 128.8783 | 0.2687 | 128.8833 | 2.7146 | -0.1801 | 0.2444 | 0.7150 | 1.000 | 0.000 | 0.0680 |
t8 | 128.0279 | 128.5882 | 128.7722 | 128.9775 | 129.1817 | 129.3474 | 129.7463 | 128.9714 | 0.2868 | 129.0125 | 2.6691 | -0.1429 | 0.2383 | 0.8097 | 0.894 | 0.000 | 0.0632 |
t9 | 127.5506 | 128.3522 | 128.5615 | 128.8017 | 129.0163 | 129.2372 | 129.7563 | 128.7896 | 0.3404 | 128.8004 | 2.8880 | -0.1356 | 0.2062 | 0.7630 | 0.068 | 0.240 | 0.0076 |
t10 | 127.9141 | 128.5187 | 128.7343 | 128.9614 | 129.2012 | 129.4071 | 129.8295 | 128.9597 | 0.3395 | 129.0172 | 2.7016 | -0.1387 | 0.2438 | 0.8288 | 0.945 | 0.000 | 0.1936 |
t11 | 127.6921 | 128.3722 | 128.5972 | 128.8525 | 129.0872 | 129.3067 | 129.8348 | 128.8406 | 0.3571 | 128.8363 | 2.7317 | -0.0199 | 0.2287 | 0.8465 | 0.258 | 0.150 | 0.1356 |
t12 | 128.2199 | 128.8153 | 129.0177 | 129.2782 | 129.5136 | 129.7353 | 130.4222 | 129.2736 | 0.3527 | 129.3141 | 2.8526 | -0.0116 | 0.2252 | 1.0810 | 1.000 | 0.000 | 0.1032 |
t13 | 128.4538 | 128.9156 | 129.1316 | 129.3742 | 129.6215 | 129.8130 | 130.6285 | 129.3748 | 0.3406 | 129.4079 | 2.7068 | 0.0534 | 0.2253 | 1.3629 | 0.762 | 0.000 | 0.1548 |
t14 | 127.9915 | 128.5318 | 128.7430 | 128.9666 | 129.2299 | 129.3995 | 130.1690 | 128.9728 | 0.3397 | 128.9976 | 2.7477 | 0.0003 | 0.2236 | 1.2330 | 0.000 | 0.413 | 0.1152 |
t15 | 128.0190 | 128.5794 | 128.8006 | 129.0241 | 129.2810 | 129.4668 | 130.2045 | 129.0292 | 0.3433 | 129.0196 | 2.8140 | -0.0171 | 0.2198 | 1.1743 | 0.893 | 0.002 | 0.0208 |
t16 | 128.0201 | 128.6425 | 128.8484 | 129.0883 | 129.3754 | 129.5724 | 130.2160 | 129.1075 | 0.3631 | 129.0246 | 2.7381 | 0.0472 | 0.2400 | 1.0558 | 0.817 | 0.001 | 0.1860 |
t17 | 128.5602 | 129.2172 | 129.4342 | 129.7232 | 130.0100 | 130.2227 | 130.9997 | 129.7193 | 0.3964 | 129.6997 | 2.6407 | -0.0238 | 0.2360 | 1.0977 | 1.000 | 0.000 | 0.0156 |
t18 | 128.5751 | 129.4400 | 129.7254 | 130.0500 | 130.3792 | 130.6583 | 131.5158 | 130.0529 | 0.4719 | 130.0280 | 2.7332 | -0.0006 | 0.2223 | 0.9938 | 0.997 | 0.000 | 0.0860 |
t19 | 128.2061 | 129.1017 | 129.4036 | 129.7729 | 130.1543 | 130.4623 | 131.3085 | 129.7759 | 0.5292 | 129.7075 | 2.6832 | -0.0060 | 0.2420 | 0.9801 | 0.011 | 0.245 | 0.1908 |
t20 | 128.8143 | 129.5700 | 129.8824 | 130.2327 | 130.6225 | 130.8888 | 131.7961 | 130.2401 | 0.5130 | 130.2154 | 2.6916 | -0.0211 | 0.2482 | 1.1022 | 1.000 | 0.000 | 0.0000 |
For each time features included in the model, you get a plot of the
median with the chosen confidence interval (ci
default is
0.8). As in other packages4, we provide different stats to give a
better hint on the different dynamics related to aleatoric and epistemic
uncertainty.
$IBM.Close
$MSFT.Close
The hyper-parameter space is defined by seq_len
,
dist_method
, similarity
and
rescale
. Now, let’s try a random search for the best
parameter settings. The following example shows how to sample 30
different models for a sequence of 20 time steps.
<- segen(time_features, seq_len = 20, n_samp = 30, n_windows = 5, dates = rownames(time_features))
example2 : 360.14 sec elapsed time
seq_len | similarity | dist_method | rescale | me | mae | mse | rmsse | mpe | mape | rmae | rrmse | rame | mase | smse | sce | gmrae | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
23 | 20 | 0.3003704 | manhattan, minkowski | TRUE | 2.3806 | 5.0645 | 41.3617 | 4.0693 | 0.0147 | 0.0280 | 0.9702 | 1.0020 | 1.7991 | 2.4432 | 19.6314 | 26.2307 | 0.8760 |
18 | 20 | 0.5309009 | manhattan, manhattan | FALSE | 2.5255 | 5.0408 | 40.8800 | 4.0377 | 0.0154 | 0.0281 | 0.9808 | 1.0011 | 1.6266 | 2.4423 | 19.5230 | 27.2844 | 0.9493 |
7 | 20 | 0.1522422 | manhattan, manhattan | FALSE | 2.2284 | 5.0719 | 42.0416 | 4.0896 | 0.0142 | 0.0281 | 0.9744 | 1.0111 | 1.9361 | 2.4502 | 19.9772 | 25.1723 | 0.8954 |
4 | 20 | 0.1591091 | euclidean, manhattan | TRUE | 2.1947 | 5.0696 | 41.9408 | 4.0939 | 0.0140 | 0.0281 | 0.9807 | 1.0198 | 1.9671 | 2.4469 | 19.8683 | 24.7886 | 0.8752 |
16 | 20 | 0.1708809 | euclidean, manhattan | TRUE | 2.1998 | 5.0668 | 41.8181 | 4.0924 | 0.0141 | 0.0281 | 0.9845 | 1.0225 | 1.9646 | 2.4499 | 19.8622 | 24.9219 | 0.8855 |
3 | 20 | 0.3239139 | maximum , manhattan | FALSE | 2.3029 | 5.0752 | 41.6974 | 4.0923 | 0.0144 | 0.0279 | 0.9846 | 1.0211 | 1.8848 | 2.4548 | 19.7961 | 25.6876 | 0.8952 |
11 | 20 | 0.3062563 | minkowski, euclidean | TRUE | 2.3869 | 5.0926 | 41.6240 | 4.0999 | 0.0147 | 0.0281 | 0.9840 | 1.0183 | 1.8285 | 2.4632 | 19.8150 | 26.2914 | 0.8798 |
15 | 20 | 0.5985886 | maximum , euclidean | FALSE | 2.5675 | 5.0735 | 41.0758 | 4.0600 | 0.0157 | 0.0284 | 0.9859 | 1.0055 | 1.5592 | 2.4619 | 19.6821 | 27.5574 | 0.9608 |
14 | 20 | 0.3582482 | maximum , minkowski | TRUE | 2.4170 | 5.0956 | 41.3851 | 4.0894 | 0.0148 | 0.0283 | 0.9830 | 1.0132 | 1.7917 | 2.4655 | 19.7753 | 26.5630 | 0.9030 |
1 | 20 | 0.5593493 | minkowski, manhattan | FALSE | 2.4612 | 5.0563 | 41.2578 | 4.0678 | 0.0155 | 0.0283 | 0.9918 | 1.0160 | 1.6723 | 2.4579 | 19.7827 | 26.8860 | 0.9591 |
10 | 20 | 0.1345846 | manhattan, euclidean | TRUE | 2.3393 | 5.1443 | 43.2689 | 4.1402 | 0.0146 | 0.0282 | 0.9792 | 1.0186 | 1.8985 | 2.4691 | 20.3400 | 25.8723 | 0.8752 |
24 | 20 | 0.5985886 | euclidean, euclidean | FALSE | 2.6001 | 5.0996 | 41.4231 | 4.0822 | 0.0160 | 0.0286 | 0.9965 | 1.0149 | 1.5871 | 2.4787 | 19.9242 | 27.9292 | 0.9749 |
5 | 20 | 0.0816116 | minkowski, manhattan | TRUE | 2.3567 | 5.1664 | 43.5632 | 4.1778 | 0.0145 | 0.0285 | 0.9920 | 1.0364 | 1.9341 | 2.4862 | 20.5938 | 25.9480 | 0.8367 |
17 | 20 | 0.6191892 | manhattan, manhattan | TRUE | 2.5605 | 5.1186 | 42.3399 | 4.1150 | 0.0159 | 0.0285 | 0.9966 | 1.0224 | 1.7418 | 2.4806 | 20.2108 | 27.7833 | 0.9475 |
9 | 20 | 0.0570871 | minkowski, manhattan | TRUE | 2.3912 | 5.1831 | 43.8489 | 4.1908 | 0.0147 | 0.0287 | 0.9930 | 1.0375 | 1.9397 | 2.4937 | 20.7392 | 26.2293 | 0.8486 |
30 | 20 | 0.5161862 | maximum, maximum | TRUE | 2.4082 | 5.1895 | 43.2879 | 4.1574 | 0.0148 | 0.0286 | 0.9991 | 1.0292 | 1.9112 | 2.5027 | 20.5427 | 26.6881 | 0.9282 |
29 | 20 | 0.6182082 | minkowski, euclidean | FALSE | 2.6350 | 5.1538 | 42.2233 | 4.1224 | 0.0162 | 0.0288 | 1.0103 | 1.0275 | 1.5924 | 2.4991 | 20.2152 | 28.2977 | 0.9874 |
12 | 20 | 0.0325626 | minkowski, manhattan | TRUE | 2.4141 | 5.2051 | 44.3456 | 4.2127 | 0.0148 | 0.0288 | 0.9956 | 1.0401 | 1.9424 | 2.5098 | 21.0985 | 26.5579 | 0.8485 |
8 | 20 | 0.6309610 | minkowski, minkowski | TRUE | 2.5307 | 5.2073 | 43.1336 | 4.1686 | 0.0156 | 0.0290 | 1.0080 | 1.0332 | 1.8107 | 2.5206 | 20.6629 | 27.8829 | 0.9546 |
25 | 20 | 0.2866366 | euclidean, maximum | TRUE | 2.4947 | 5.2637 | 44.5242 | 4.1988 | 0.0153 | 0.0289 | 1.0043 | 1.0368 | 1.9772 | 2.5138 | 20.7724 | 27.1552 | 0.9378 |
6 | 20 | 0.2326827 | maximum, maximum | TRUE | 2.6533 | 5.2996 | 44.5266 | 4.2317 | 0.0158 | 0.0291 | 1.0128 | 1.0462 | 1.8788 | 2.5365 | 20.8903 | 28.2270 | 0.9169 |
19 | 20 | 0.4112212 | minkowski, maximum | TRUE | 2.5731 | 5.2898 | 44.4426 | 4.2270 | 0.0156 | 0.0292 | 1.0199 | 1.0527 | 1.9393 | 2.5359 | 20.8520 | 27.7506 | 0.9320 |
21 | 20 | 0.8713013 | manhattan, maximum | TRUE | 2.8299 | 5.3540 | 45.9143 | 4.2717 | 0.0171 | 0.0294 | 1.0193 | 1.0450 | 1.8228 | 2.5741 | 21.7734 | 30.0763 | 0.9817 |
26 | 20 | 0.9232933 | minkowski, euclidean | TRUE | 2.8954 | 5.3907 | 46.5403 | 4.2988 | 0.0176 | 0.0297 | 1.0217 | 1.0467 | 1.8128 | 2.5934 | 22.1322 | 30.7822 | 0.9829 |
2 | 20 | 0.9870571 | minkowski, minkowski | TRUE | 2.9195 | 5.4071 | 47.0234 | 4.3173 | 0.0178 | 0.0299 | 1.0224 | 1.0478 | 1.8153 | 2.6045 | 22.4494 | 31.0731 | 0.9725 |
20 | 20 | 0.8742442 | manhattan, euclidean | FALSE | 2.8778 | 5.4222 | 46.8438 | 4.2977 | 0.0173 | 0.0300 | 1.0397 | 1.0571 | 1.7345 | 2.6075 | 22.2516 | 30.5022 | 1.0234 |
13 | 20 | 0.8320621 | minkowski, manhattan | FALSE | 2.8321 | 5.4193 | 46.8758 | 4.3164 | 0.0174 | 0.0301 | 1.0476 | 1.0658 | 1.7919 | 2.6179 | 22.3652 | 30.4645 | 1.0289 |
22 | 20 | 0.8909209 | manhattan, manhattan | FALSE | 2.9624 | 5.4719 | 47.6212 | 4.3346 | 0.0178 | 0.0303 | 1.0447 | 1.0604 | 1.6839 | 2.6374 | 22.7997 | 31.3705 | 1.0284 |
27 | 20 | 0.9870571 | manhattan, euclidean | FALSE | 3.0344 | 5.5304 | 49.2702 | 4.3868 | 0.0183 | 0.0307 | 1.0472 | 1.0628 | 1.7189 | 2.6734 | 23.8156 | 32.4001 | 1.0379 |
28 | 20 | 0.9811712 | euclidean, minkowski | FALSE | 3.0344 | 5.5304 | 49.2702 | 4.3868 | 0.0183 | 0.0307 | 1.0472 | 1.0628 | 1.7189 | 2.6734 | 23.8156 | 32.4001 | 1.0379 |
If we compare the error statistics from the best model in
example2
with the model in example1
, for IBM
and Microsoft we see consistent improvement. All the relative and scaled
error metrics defaults to segen
, but you can choose more
challenging thresholds (like the deviation of the whole time feature or
the average of the whole predicted sequence).
The error statistics from example1
(averaged across 10
expanding validation windows):
$best_model$testing_errors
example1
me mae mse rmsse mpe mape rmae rrmse rame3.0814 4.3986 32.5930 4.0150 0.0234 0.0338 1.1202 1.1316 1.2604
IBM.Close 1.7566 5.7480 50.7096 4.1506 0.0064 0.0222 0.8200 0.8740 2.3276
MSFT.Close
mase smse sce gmrae2.9098 22.1456 41.0728 1.0792
IBM.Close 1.9926 17.5462 12.3304 0.6554 MSFT.Close
The error statistics from example2
(as above, averaged
across 10 expanding validation windows):
$best_model$testing_errors
example2
me mae mse rmsse mpe mape rmae rrmse rame3.0196 4.371 31.9238 3.9840 0.0230 0.0336 1.1188 1.1292 1.2572
IBM.Close 1.7416 5.758 50.7996 4.1546 0.0064 0.0224 0.8216 0.8748 2.3410
MSFT.Close
mase smse sce gmrae2.8904 21.6836 40.2366 1.0872
IBM.Close 1.9960 17.5792 12.2248 0.6648 MSFT.Close
The improvement is somehow unclear between the two examples, but we
are still using a naive
approach to measure scaled and
relative errors. Let’s try to shift to deviation
as scale,
and average
as benchmark, that are more challenging
evaluation criteria, and extend to 100 samples.
<- segen(time_features, seq_len = 20, n_windows = 5, dates = rownames(time_features), error_scale = "deviation", error_benchmark = "average", n_samp = 100)
example3 : 1282.79 sec elapsed time
As you can see, the relative and scaled measures shift sensibly as we change the bar of our expectations:
$best_model$testing_errors
example3
me mae mse rmsse mpe mape rmae rrmse rame mase3.0090 4.4084 32.3898 1.9778 0.0230 0.0338 1.0928 1.0548 1 0.7206
IBM.Close 1.3992 5.7070 50.8614 1.5518 0.0052 0.0222 0.9808 0.9970 1 0.2892
MSFT.Close
smse sce gmrae4.8464 10.4262 1.1220
IBM.Close 2.6318 1.4278 0.9256 MSFT.Close
The missing imputation is managed through imputeTS package. For any information: https://cran.r-project.org/web/packages/imputeTS/index.html.↩︎
In some cases, maybe you want to operate on smoothed
time-features. In this case, segen calls on fANCOVA
package. Here you can find all the latest: https://cran.r-project.org/web/packages/fANCOVA/index.html↩︎
The metrics are calculated using the greybox package. For any reference, please take a look here: https://cran.r-project.org/web/packages/greybox/index.html↩︎
Other packages focused on time feature analysis that could be of interest here:
- AUDREX, https://cran.r-project.org/web/packages/audrex/index.html
- PROTEUS, https://cran.r-project.org/web/packages/proteus/index.html
- JENGA, https://cran.r-project.org/web/packages/jenga/index.html
- TETRAGON, https://cran.r-project.org/web/packages/tetragon/index.html
- SPOOKY, https://cran.r-project.org/web/packages/spooky/index.html
- DYMO, https://cran.r-project.org/web/packages/dymo/index.html
- NAIVE, https://cran.r-project.org/web/packages/naive/index.html
↩︎