“In Greek mythology, Proteus (/ˈproʊtiəs, -tjuːs/;Ancient Greek: Πρωτεύς, Prōteus) is an early prophetic sea-god or god of rivers and oceanic bodies of water, one of several deities whom Homer calls the”Old Man of the Sea" (halios gerôn).Some who ascribe a specific domain to Proteus call him the god of “elusive sea change”, which suggests the constantly changing nature of the sea or the liquid quality of water. He can foretell the future, but, in a mytheme familiar to several cultures, will change his shape to avoid doing so; he answers only to those who are capable of capturing him. From this feature of Proteus comes the adjective protean, meaning “versatile”, “mutable”, or “capable of assuming many forms”. “Protean” has positive connotations of flexibility, versatility and adaptability." (Wikipedia)
Proteus is a Sequence-to-Sequence Variational Model for time-feature analysis based on a wide range of different distributions. The main point of Proteus is that the normal distribution is not enough to properly support any kind of prediction process. If you want to catch the elusive future, you need to change the default normal distribution, choosing other latent models that better fit specific processes.
Here is a description of Proteus’s architecture. A number of neural network models are created to estimate each shape/location/scale parameter of the chosen latent distribution: moving from left to right, the tensors with the past sequences are transformed into the tensors of parameters of future sequences. The latent model is used to produce the future sequences and estimates the error of measurement (and assesses the confidence interval).
The overall model architecture
The time features structured in a dataframe in columnar order are “horizontally” reframed creating a 3D tensor. The 3D tensor passes through 3 main steps: time2vec embedding (inspired to this reference1), then adaptive normalization (inspired to this reference2), then a simple neural network with three linear transformations for achieving the target size. At the end of these three steps, you have the tensor of the estimated parameters for the chosen latent model and the sampling process may begin.
How each parameter for the latent distribution is calculated
In our introduction to Proteus, we are going to use the Close Price series for Amazon, Google and Facebook (from Yahoo Finance). As showed in amzn_aapl_fb, the time features are expected in ordered columns in a dataframe format.
| Date | AMZN | GOOGL | FB | |
|---|---|---|---|---|
| 3779 | 2012-05-18 | 213.85 | 300.5005 | 38.23 |
| 3780 | 2012-05-21 | 218.11 | 307.3624 | 34.03 |
| 3781 | 2012-05-22 | 215.33 | 300.7007 | 31.00 |
| 3782 | 2012-05-23 | 217.28 | 305.0350 | 32.00 |
| 3783 | 2012-05-24 | 215.24 | 302.1321 | 33.03 |
| 3784 | 2012-05-25 | 212.89 | 296.0611 | 31.91 |
| 3785 | 2012-05-29 | 214.75 | 297.4675 | 28.84 |
| 3786 | 2012-05-30 | 209.23 | 294.4094 | 28.19 |
| 3787 | 2012-05-31 | 212.91 | 290.7207 | 29.60 |
| 3788 | 2012-06-01 | 208.22 | 285.7758 | 27.72 |
In the first example, we are predicting a single time feature, Amazon. Proteus is a seq2seq model, meaning that each time feature is reframed in a matrix of sequences, with dimension n_sequences x past + future variables. The number of sequences depends on the sequence_stride variable: when the flag is TRUE, for each time feature will be extracted only distinct series (the sequence stride will be equal topast + future), while when the flag is FALSE, each sequence will be shifted of a single position in time (the sequence stride is equal to one).
We are predicting sequences of 30 time periods in the future based on sequences of 60 time periods in the past, using 20 different temporal embeddings (i.e., we are decomposing the original time feature in 1 trend and 19 different periodic components), a forward neural network of 32 nodes and a variational model based on normal distribution. The optimization method is another important hyper-parameter with a clear impact on the error performances measured with back-testing on four rolling blocks (with n_blocks equals to 4 and rolling_blocks to TRUE, the error will be sampled on three different measurements with a rolling window scheme; setting rolling_blocks to FALSE means that the error will be sampled three times using an incremental window scheme).
Setting verbose to TRUE, you will get detailed information on the training and validation process (numbers of sequences, max batch size, loss metric). The loss metrics available are Evidence Lower Bound (plus reconstruction error based on mae) and Continuous Ranked Probability Score. Both the metrics lead to very similar results, but CRPS is clearly more simple and elegant (that’s why is the default measure).
example1 <- proteus(amzn_aapl_fb, target = "AMZN", future = 30, past = 60, t_embed = 20, activ = "linear", nodes = 32, distr = "normal", optim = "adam", loss_metric = "crps", rolling_blocks = T, n_blocks = 4, sequence_stride = T, verbose = T, dates = amzn_aapl_fb$Date, days_off = c("saturday", "sunday"))
block 1
5 sequence for training
4 sequence for testing
setting max batch size to 4
epoch: 3 Train loss: 0.3175983 Test loss: 0.3284948
epoch: 6 Train loss: 0.3379224 Test loss: 0.3695798
epoch: 9 Train loss: 0.3481375 Test loss: 0.3160852
epoch: 12 Train loss: 0.3030055 Test loss: 0.4029772
early stop at epoch: 12 Train loss: 0.292647 Test loss: 0.4029772
block 2
4 sequence for training
4 sequence for testing
setting max batch size to 4
epoch: 3 Train loss: 0.3288874 Test loss: 0.3136515
epoch: 6 Train loss: 0.3316765 Test loss: 0.3180925
epoch: 9 Train loss: 0.308616 Test loss: 0.3132755
early stop at epoch: 10 Train loss: 0.2995778 Test loss: 0.3528911
block 3
4 sequence for training
4 sequence for testing
setting max batch size to 4
epoch: 3 Train loss: 0.3604256 Test loss: 0.3602472
epoch: 6 Train loss: 0.3751796 Test loss: 0.3895938
epoch: 9 Train loss: 0.3314459 Test loss: 0.376107
epoch: 12 Train loss: 0.2679351 Test loss: 0.3170523
early stop at epoch: 14 Train loss: 0.2885902 Test loss: 0.3565183
final training on all 4
17 sequence for training
setting max batch size to 17
epoch: 3 Train loss: 0.3474164
epoch: 6 Train loss: 0.3335282
epoch: 9 Train loss: 0.3325986
epoch: 12 Train loss: 0.3279028
epoch: 15 Train loss: 0.3235377
epoch: 18 Train loss: 0.312183
epoch: 21 Train loss: 0.3309314
epoch: 24 Train loss: 0.322084
epoch: 27 Train loss: 0.3340648
epoch: 30 Train loss: 0.3435601
proteus: 25.19 sec elapsed
variational model based on normal latent distribution with 104 tensors and 167850 parametersThe result is a list of different components, as you can see below.
names(example1)
[1] "model_descr" "prediction" "plot" "learning_error"
[5] "features_errors" "pred_stats" "time_log"The first variable is a simple high-level description of the model.
example1$model_descr
[1] "variational model based on normal latent distribution with 104 tensors and 167850 parameters"The prediction is a list including the predicted results for each time-feature (quantile, min, max, mean, mode, sd, skewness, kurtosis for each time point in the future sequence).
example1$prediction
$AMZN
dates min q10 q25 q50 q75 q90 max
t1 2019-07-15 1947.048 1972.820 2004.121 2008.133 2010.628 2016.053 2059.355
t2 2019-07-16 1887.705 1958.359 1991.765 2009.173 2020.239 2022.959 2062.272
t3 2019-07-17 1924.641 1971.100 1985.011 2004.669 2023.003 2041.329 2062.229
t4 2019-07-18 1933.760 1974.908 1984.659 2004.046 2019.431 2040.763 2042.731
t5 2019-07-19 1949.507 1954.793 1978.063 1999.376 2009.094 2035.146 2049.761
t6 2019-07-22 1909.000 1986.733 1996.835 2010.115 2018.833 2040.313 2050.868
t7 2019-07-23 1980.636 1986.671 1996.289 2012.278 2029.511 2054.748 2056.156
t8 2019-07-24 1973.103 1998.108 2000.788 2016.063 2035.966 2048.329 2058.482
t9 2019-07-25 1940.995 1975.335 1995.218 2010.327 2022.236 2026.003 2049.924
t10 2019-07-26 1947.921 1983.859 2004.503 2013.217 2022.302 2030.653 2064.791
t11 2019-07-29 1973.390 1991.212 2003.827 2021.872 2026.144 2044.475 2064.307
t12 2019-07-30 1954.717 2005.250 2008.093 2022.075 2038.974 2046.755 2080.567
t13 2019-07-31 1928.817 1982.854 2004.587 2024.878 2035.095 2080.861 2095.865
t14 2019-08-01 1902.480 1985.288 1992.836 2027.192 2048.379 2072.498 2103.087
t15 2019-08-02 1905.155 1988.567 1994.409 2036.033 2056.467 2074.677 2112.431
t16 2019-08-05 1952.391 1989.473 2000.900 2039.772 2056.011 2061.836 2105.646
t17 2019-08-06 1954.487 1993.442 2000.255 2031.110 2058.969 2060.511 2108.397
t18 2019-08-07 1936.893 1983.050 2001.540 2020.368 2055.646 2060.209 2072.619
t19 2019-08-08 1922.195 1965.068 2000.309 2020.496 2052.643 2065.103 2092.243
t20 2019-08-09 1936.475 1947.165 1996.691 2029.618 2065.609 2096.311 2101.932
t21 2019-08-12 1936.613 1947.313 2000.668 2025.971 2068.809 2087.958 2103.855
t22 2019-08-13 1827.491 1939.799 2001.028 2019.493 2075.201 2106.257 2117.230
t23 2019-08-14 1791.617 1950.965 1996.500 2020.233 2073.151 2126.488 2138.192
t24 2019-08-15 1861.185 1951.407 1994.614 2020.033 2074.116 2131.857 2138.105
t25 2019-08-16 1833.235 1954.589 1997.995 2017.132 2084.279 2129.137 2138.173
t26 2019-08-19 1891.939 1958.903 1992.291 2016.516 2081.429 2123.861 2147.246
t27 2019-08-20 1903.661 1957.260 1999.249 2023.286 2086.663 2128.051 2183.203
t28 2019-08-21 1842.617 1970.182 2002.738 2026.433 2072.271 2137.156 2212.329
t29 2019-08-22 1836.043 1971.771 2011.977 2031.169 2080.700 2142.093 2196.207
t30 2019-08-23 1860.897 1982.150 2008.734 2031.313 2079.767 2158.128 2205.688
mean sd mode skewness kurtosis
t1 2004.362 26.714 2008.113 -0.297 4.157
t2 1998.606 43.109 2016.164 -1.359 4.947
t3 2004.211 36.390 2005.515 -0.536 3.194
t4 2002.017 31.069 1997.223 -0.596 3.069
t5 1996.451 30.147 2002.708 0.073 2.396
t6 2004.322 35.612 2008.761 -1.492 5.528
t7 2015.245 26.009 2005.898 0.409 1.915
t8 2019.470 24.719 2007.266 -0.127 2.269
t9 2005.087 28.376 2018.482 -0.767 3.430
t10 2011.309 28.201 2013.669 -0.484 3.970
t11 2018.115 24.415 2023.394 0.000 2.807
t12 2023.026 30.430 2013.375 -0.379 3.887
t13 2022.443 43.981 2022.503 -0.286 3.305
t14 2020.911 50.980 2033.521 -0.712 3.748
t15 2025.433 53.315 2051.354 -0.625 3.440
t16 2029.531 41.300 2050.114 -0.090 2.595
t17 2029.626 41.389 2048.819 0.049 2.605
t18 2022.285 40.249 2055.602 -0.616 2.602
t19 2020.714 47.245 2034.569 -0.568 2.796
t20 2026.855 54.242 2017.932 -0.278 2.125
t21 2028.163 54.534 2031.947 -0.332 2.070
t22 2020.438 80.875 2015.621 -1.024 3.775
t23 2021.837 92.852 2015.313 -1.097 4.274
t24 2027.279 79.829 2025.834 -0.378 2.771
t25 2027.482 85.207 1996.713 -0.722 3.347
t26 2031.072 73.730 2012.007 -0.079 2.392
t27 2037.556 77.757 2015.593 0.217 2.519
t28 2036.792 92.222 2018.024 -0.106 3.488
t29 2039.185 90.387 2024.659 -0.437 3.698
t30 2043.696 88.199 2023.857 -0.038 3.361For each time features included in the model, you get a plot of the median values with the chosen confidence interval (ci default is 0.8).
example1$plot
$AMZNSequence Plot of Historical and Predicted Close Prices
It is possible to select any number of time features from the starting dataset. In the following example, we are going to select Amazon, Google and Facebook, for a joint-prediction.
example2 <- proteus(amzn_aapl_fb, target = c("AMZN", "GOOGL", "FB"), future = 30, past = 60, t_embed = 20, activ = "linear", nodes = 64, distr = "normal", optim = "adam", rolling_blocks = F, sequence_stride = T, verbose = F, dates = amzn_aapl_fb$Date, days_off = c("saturday", "sunday"))
proteus: 35.37 sec elapsed
example2$plot
$AMZNSequence Plot of Historical and Predicted Close Prices
$GOOGL
Sequence Plot of Historical and Predicted Close Prices
$FB
Sequence Plot of Historical and Predicted Close Prices
The learning error is the error for the joint-variational model, including all the time features, both for the training and validation process (in this case, based on a incremental windows for the available blocks with full-strided sequences). You get a wide range of common metrics (rmse, mae, mdae, mpe, mape, smape, rrse, rae). A good model should ideally score below 1 for rrse and rae (or at least approximately equal to 1); a bad model is consistently above 1 (“consistently” means that is a multiple).
example2$learning_error
rmse mae mdae mpe mape smape rrse rae
train 6.688533 3.714400 1.9285 Inf Inf 1.933433 1.004867 0.9897333
test 11.637167 6.970333 3.6140 -Inf Inf 1.950033 1.003467 0.9935000The features errors include the standard metrics (rmse, mae, mdae, mpe, mape, smape, rrse, rae), this time measured for each time feature.
example2$features_errors
$AMZN
rmse mae mdae mpe mape smape rrse
train 27.39053 20.46910 13.67537 0.01616667 0.05490000 0.05526667 0.3032667
test 49.36047 38.58493 30.07370 0.01086667 0.03713333 0.03933333 0.3780667
rae
train 0.2765
test 0.3219
$GOOGL
rmse mae mdae mpe mape smape rrse
train 30.81497 20.71213 13.9444 0.02540000 0.03780000 0.03936667 0.2333333
test 48.16187 34.44870 25.7233 0.02333333 0.04066667 0.04160000 0.8156333
rae
train 0.1937333
test 0.6923000
$FB
rmse mae mdae mpe mape smape rrse
train 4.944333 3.784233 2.987633 0.03143333 0.07430000 0.07306667 0.1965333
test 6.735700 5.513033 5.104233 0.02416667 0.04043333 0.04133333 0.4512000
rae
train 0.1674
test 0.4162It is possible to select among different latent models (twelve different distributions, with two and three parameters), which increases the number of parameters to be estimated and the computation time. Other important variables that will impact the computation time are the rolling_blocks (when flagged to FALSE, the back-testing will be performed on an incremental block scheme with an increasing number of sequences from previous blocks: you got a more precise result but more time is required) and the sequence_stride (when flagged to FALSE, each sequence will be shifted of a single point in time instead of the whole past + future length, resulting in a large number of sequences for each block: as above, you get larger tensors and more precision but more computation time is required). You can easily compare the difference between example1 and example3: in this case, example3 is a little overfitting.
example3 <- proteus(amzn_aapl_fb, target = "AMZN", future = 30, past = 60, t_embed = 20, activ = "linear", nodes = 32, distr = "genbeta", optim = "adam", loss_metric = "crps", rolling_blocks = F, sequence_stride = F, verbose = F, dates = amzn_aapl_fb$Date, days_off = c("saturday", "sunday"))
proteus: 408.65 sec elapsedexample1$time_log
[1] "25S"
example3$time_log
[1] "6M 49S"
example1$model_descr
[1] "variational model based on normal latent distribution with 104 tensors and 167850 parameters"
example3$model_descr
[1] "variational model based on genbeta latent distribution with 156 tensors and 251775 parameters"
example1$features_errors
$AMZN
rmse mae mdae mpe mape smape rrse
train 28.2626 21.04407 14.01620 0.01403333 0.04426667 0.04726667 0.3636667
test 49.6028 38.85787 30.15487 0.01036667 0.03756667 0.03966667 0.3802333
rae
train 0.3098333
test 0.3244333
example3$features_errors
$AMZN
rmse mae mdae mpe mape smape rrse
train 25.45427 17.84980 12.00577 -0.004700000 0.04746667 0.0471 0.2604000
test 61.71610 47.26393 37.57153 0.008266667 0.04896667 0.0490 0.4428667
rae
train 0.2247333
test 0.4049667For each point in time of the predicted features, six measures are taken:
While the IQR ratio could be helpful in understanding the aleatoric uncertainty affecting the predicted features, the KL divergence could be useful in estimating the epistemic uncertainty (just an idea). Notice how the terminal KL divergence depends a lot on the specific distribution comparing example1 (normal) and example3 (overfitting genbeta).
example1$pred_stats
AMZN
avg_iqr_to_range 0.278
terminal_iqr_ratio 10.916
avg_kl_divergence 0.077
terminal_kl_divergence 11.554
avg_upside_prob 0.552
terminal_upside_prob 0.752example2$pred_stats
AMZN GOOGL FB
avg_iqr_to_range 0.277 0.281 0.308
terminal_iqr_ratio 10.514 6.428 5.111
avg_kl_divergence 0.101 0.168 0.765
terminal_kl_divergence 10.250 25.049 125.414
avg_upside_prob 0.551 0.558 0.562
terminal_upside_prob 0.755 0.744 0.802example3$pred_stats
AMZN
avg_iqr_to_range 0.092
terminal_iqr_ratio 7.485
avg_kl_divergence 0.005
terminal_kl_divergence 2.539
avg_upside_prob 0.535
terminal_upside_prob 0.666Last, but not least, you can prospect the future for long sequences (if you have enough data points for the number of blocks and the sequence stride) and you can choose among the different activation functions implemented in Proteus (in example4, we tried “gelu”).
example4 <- proteus(amzn_aapl_fb, target = c("AMZN", "GOOGL", "FB"), future = 100, past = 300, t_embed = 20, activ = "gelu", nodes = 32, distr = "gev", optim = "rmsprop", rolling_blocks = T, sequence_stride = F, verbose = F, dates = amzn_aapl_fb$Date, days_off = c("saturday", "sunday"))
proteus: 237.69 sec elapsed
example4$plot
$AMZNSequence Plot of Historical and Predicted Close Prices
$GOOGL
Sequence Plot of Historical and Predicted Close Prices
$FB
Sequence Plot of Historical and Predicted Close Prices
Seyed Mehran Kazemi, Rishab Goel, Sepehr Eghbali, Janahan Ramanan, Jaspreet Sahota, Sanjay Thakur, StellaWu, Cathal Smyth, Pascal Poupart, Marcus Brubaker, Time2Vec: Learning a Vector Representation of Time, arXiv:1907.05321v1 [cs.LG] 11 Jul 2019↩︎
Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen, Moncef Gabbouj, and Alexandros Iosifidis, Deep Adaptive Input Normalization for Time Series Forecasting, arXiv:1902.07892v2 [q-fin.CP] 22 Sep 2019↩︎