Intro to proteus

Author

Giancarlo Vercellino

Published

March 2, 2023

In Greek mythology, Proteus (/ˈproʊtiəs, -tjuːs/;Ancient Greek: Πρωτεύς, Prōteus) is an early prophetic sea-god or god of rivers and oceanic bodies of water, one of several deities whom Homer calls the ”Old Man of the Sea” (halios gerôn).Some who ascribe a specific domain to Proteus call him the god of ”elusive sea change”, which suggests the constantly changing nature of the sea or the liquid quality of water. He can foretell the future, but, in a mytheme familiar to several cultures, will change his shape to avoid doing so; he answers only to those who are capable of capturing him. From this feature of Proteus comes the adjective protean, meaning ”versatile”, ”mutable”, or ”capable of assuming many forms”. ”Protean” has positive connotations of flexibility, versatility and adaptability.Wikipedia

Multiform like Proteus

Proteus is a Sequence-to-Sequence Variational Model designed for time-feature analysis, leveraging a wide range of distributions for improved accuracy. Unlike traditional methods that rely solely on the normal distribution, Proteus uses various latent models to better capture and predict complex processes. To achieve this, Proteus employs a neural network architecture that estimates the shape, location, and scale parameters of the chosen distribution. This approach transforms past sequence data into future sequence parameters, improving the model’s prediction capabilities. Proteus also assesses the accuracy of its predictions by estimating the error of measurement and calculating the confidence interval. By utilizing a range of distributions and advanced modeling techniques, Proteus provides a more accurate and comprehensive approach to time-feature analysis.

Here is a description of Proteus’s architecture. A number of neural network models are created to estimate each shape/location/scale parameter of the chosen latent distribution: moving from left to right, the tensors with the past sequences are transformed into the tensors of parameters of future sequences. The latent model is used to produce the future sequences and estimates the error of measurement (and assesses the confidence interval).

Figure 1 - The overall model architecture

The time features structured in a dataframe in columnar order are “horizontally” reframed creating a 3D tensor. The 3D tensor passes through 3 main steps: time2vec embedding (inspired to this reference1), then adaptive normalization (inspired to this reference2), then a simple neural network with three linear transformations for achieving the target size. At the end of these three steps, you have the tensor of the estimated parameters for the chosen latent model and the sampling process may begin.

Figure 2 - How each parameter for the latent distribution is calculated

Starting from scratch

In our introduction to Proteus, we are going to use the Close Price series for Amazon, Google and Facebook (from Yahoo Finance). As showed in amzn_aapl_fb, the time features are expected in ordered columns in a dataframe format.

library(proteus)
knitr::kable(head(amzn_aapl_fb, 10), align = "ccc", caption = "Examples of time features: close prices for Amazon, Google and Facebook")
Examples of time features: close prices for Amazon, Google and Facebook
Date AMZN GOOGL FB
3779 2012-05-18 213.85 300.5005 38.23
3780 2012-05-21 218.11 307.3624 34.03
3781 2012-05-22 215.33 300.7007 31.00
3782 2012-05-23 217.28 305.0350 32.00
3783 2012-05-24 215.24 302.1321 33.03
3784 2012-05-25 212.89 296.0611 31.91
3785 2012-05-29 214.75 297.4675 28.84
3786 2012-05-30 209.23 294.4094 28.19
3787 2012-05-31 212.91 290.7207 29.60
3788 2012-06-01 208.22 285.7758 27.72

In our first example, we use proteus to predict a single time feature, namely Amazon. Proteus reframes each time feature as a matrix of sequences, with dimensions of n_sequences x past + future variables. The number of sequences is determined by the stride variable, which in this case is set to 1. This means that each past + future sequence is shifted by a single position in time.

We aim to predict sequences of 30 time periods in the future based on 60 time periods in the past, utilizing 20 different temporal embeddings. These embeddings decompose the original time feature into 1 trend and 19 different periodic components. We use a forward neural network with 32 nodes and a variational model based on normal distribution.

The optimization method is another important hyper-parameter that impacts the error performance of the model. We measure error using back-testing on four rolling blocks, with n_blocks set to 4 and rolling_blocks set to TRUE. This means that the error is sampled on three different measurements using a rolling window scheme. If rolling_blocks is set to FALSE, the error is sampled three times using an incremental window scheme.

Setting verbose to TRUE provides detailed information on the training and validation process, including the number of sequences, max batch size, and loss metric. Proteus offers several loss metrics, including Evidence Lower Bound, Continuous Ranked Probability Score, and a Custom Score (this last is the absolute difference in cdf between prediction and actual on the estimated latent parameters for the chosen distribution).

example1 <- proteus(amzn_aapl_fb, target = "AMZN", future = 30, past = 60, t_embed = 20, activ = "linear", nodes = 32, distr = "normal", optim = "adam", loss_metric = "crps", rolling_blocks = T, n_blocks = 4, stride = 1, verbose = T, dates = "Date")
date and value gaps filled with kalman imputation

block 1 
541 sequence for training
541 sequence for testing
epoch:  3    Train loss:  0.3282238    Test loss:  0.3254851 
epoch:  6    Train loss:  0.3331435    Test loss:  0.3290568 
epoch:  9    Train loss:  0.3325217    Test loss:  0.3298825 
epoch:  12    Train loss:  0.3241959    Test loss:  0.3331832 
early stop at epoch:  12    Train loss:  0.1797601    Test loss:  0.295964 

block 2 
541 sequence for training
541 sequence for testing
epoch:  3    Train loss:  0.3333673    Test loss:  0.3357568 
epoch:  6    Train loss:  0.327236    Test loss:  0.3315799 
epoch:  9    Train loss:  0.3264316    Test loss:  0.3358205 
epoch:  12    Train loss:  0.3347558    Test loss:  0.328617 
epoch:  15    Train loss:  0.3318962    Test loss:  0.3311511 
epoch:  18    Train loss:  0.3319906    Test loss:  0.3360375 
epoch:  21    Train loss:  0.3293301    Test loss:  0.3344645 
epoch:  24    Train loss:  0.3340255    Test loss:  0.332931 
epoch:  27    Train loss:  0.3338883    Test loss:  0.3309637 
epoch:  30    Train loss:  0.3275637    Test loss:  0.325176 

block 3 
541 sequence for training
541 sequence for testing
epoch:  3    Train loss:  0.3300303    Test loss:  0.3347785 
epoch:  6    Train loss:  0.3347132    Test loss:  0.3320392 
epoch:  9    Train loss:  0.3323337    Test loss:  0.3323828 
epoch:  12    Train loss:  0.3296569    Test loss:  0.3317752 
early stop at epoch:  13    Train loss:  0.2757971    Test loss:  0.3399342 

final training on all 4 
2164 sequence for training
epoch:  3    Train loss:  0.3334524 
epoch:  6    Train loss:  0.3327346 
epoch:  9    Train loss:  0.3326941 
epoch:  12    Train loss:  0.3342977 
epoch:  15    Train loss:  0.3330329 
epoch:  18    Train loss:  0.3315944 
epoch:  21    Train loss:  0.3305171 
epoch:  24    Train loss:  0.3344898 
epoch:  27    Train loss:  0.3330322 
epoch:  30    Train loss:  0.3325613 

variational model based on normal latent distribution with 104 tensors and 167850 parameters
proteus: 969.31 sec elapsed

The result is a list of different components, as you can see below.

names(example1)
[1] "model_descr"     "prediction"      "plot"            "features_errors"
[5] "history"         "time_log"       

The first variable is a simple high-level description of the model.

example1$model_descr
[1] "variational model based on normal latent distribution with 104 tensors and 167850 parameters"

The predictions is a list including the predicted results for each time-feature (quantile, min, max, mean, mode, sd, skewness, kurtosis, iqr to range, above to below range, upside probability, divergence for each time point in the sequence). The IQR to range is the interquartile range to the min-max range, the above to below range is the range above median to the range below it, the upside probability is the probability of growth compared to the former point in the time sequence, the divergence is the maximum distance of cumulative normal curve of each point to the former point in the sequence.

knitr::kable(example1$prediction$AMZN[1:10,], align = "ccc", caption = "Examples of time-feature prediction (first ten rows)")
Examples of time-feature prediction (first ten rows)
min 10% 25% 50% 75% 90% max mean sd mode kurtosis skewness iqr_to_range above_to_below_range upside_prob divergence
2019-07-12 1871.676 2000.441 2008.193 2011.581 2015.973 2023.987 2128.774 2011.812 15.9701 2011.240 18.8059 -0.4415 0.0303 0.8377 0.5625 0.3590
2019-07-13 1846.539 1993.865 2006.659 2012.422 2019.670 2031.785 2146.090 2012.695 22.6081 2012.371 12.7274 -0.3956 0.0434 0.8058 0.5632 0.0536
2019-07-14 1810.182 1988.702 2005.666 2013.443 2023.462 2039.838 2146.244 2013.838 27.7922 2012.857 10.1913 -0.5391 0.0530 0.6534 0.5829 0.0271
2019-07-15 1767.992 1986.050 2004.821 2014.537 2027.263 2047.659 2157.775 2014.943 32.5280 2013.479 9.5906 -0.7504 0.0576 0.5810 0.5816 0.0197
2019-07-16 1759.234 1981.934 2004.203 2015.539 2030.340 2053.713 2186.508 2015.691 36.5173 2013.907 9.0279 -0.7370 0.0612 0.6671 0.5508 0.0203
2019-07-17 1781.287 1981.259 2003.228 2015.434 2032.437 2056.735 2201.561 2016.197 40.0886 2013.796 8.6095 -0.6986 0.0695 0.7949 0.5311 0.0228
2019-07-18 1759.668 1979.082 2002.642 2016.389 2035.440 2060.172 2202.072 2016.932 43.2282 2013.050 8.9574 -0.7630 0.0741 0.7233 0.5514 0.0216
2019-07-19 1729.402 1977.678 2001.741 2017.217 2037.231 2063.003 2235.560 2017.841 46.3900 2014.703 9.4186 -0.7207 0.0701 0.7586 0.5656 0.0136
2019-07-20 1724.631 1977.317 2001.581 2017.070 2039.155 2067.486 2251.566 2018.706 49.2971 2015.534 9.4522 -0.7050 0.0713 0.8019 0.5576 0.0148
2019-07-21 1718.067 1975.772 2001.306 2017.695 2041.739 2070.241 2280.978 2019.581 51.8477 2015.648 9.2776 -0.7569 0.0718 0.8787 0.5607 0.0092

For each time features included in the model, you get a plot of the median values with the chosen confidence interval (ci default is 0.8).

example1$plot
$AMZN

Adding any number of time features

It is possible to select any number of time features from the starting dataset. In the following example, we are going to select Amazon, Google and Facebook, for a joint-prediction.

example2 <- proteus(amzn_aapl_fb, target = c("AMZN", "GOOGL", "FB"), future = 30, past = 60, t_embed = 20, activ = "linear", nodes = 64, distr = "normal", optim = "adam", rolling_blocks = F, stride = 10, verbose = F, dates = "Date")
proteus: 249.98 sec elapsed
example2$plot
$AMZN


$GOOGL


$FB

The history plot reports the average selected loss across the validation blocks (in this case, based on a incremental windows with 10-strided sequences).

example2$history

The features errors include the standard metrics (me, mae, mse, rmsse, mpe, mape, rmae, rrmse, rame, mase, smse, sce) for each time feature.

example2$features_errors
$AMZN
            me      mae      mse    rmsse   mpe       mape      rmae     rrmse
train  5.62200 15.79933  525.156 11.42867 0.011 0.04200000 0.0850000 0.1083333
test  15.64367 40.34967 4017.532 25.79433 0.016 0.04133333 0.1613333 0.1816667
            rame     mase     smse      sce
train 0.02233333 4.191000 131.0683 4826.621
test  0.10733333 9.393667 813.8523 6407.464

$GOOGL
         me      mae      mse    rmsse        mpe       mape      rmae
train 6.337 16.91867  584.695 12.26367 0.01133333 0.03300000 0.0700000
test  5.176 26.83400 1460.180 18.52433 0.00500000 0.03066667 0.2703333
           rrmse       rame     mase     smse      sce
train 0.08933333 0.02633333 4.526333 150.6217 5185.446
test  0.32466667 0.13833333 6.868333 361.1113 2257.251

$FB
            me      mae      mse rmsse        mpe       mape       rmae
train 1.573000 2.759000 15.71833 4.739 0.02600000 0.05533333 0.05533333
test  1.597667 4.945333 54.01133 7.686 0.01133333 0.03666667 0.28300000
           rrmse       rame     mase     smse      sce
train 0.07166667 0.02966667 4.020667 22.57467 7611.148
test  0.31300000 0.22766667 6.735333 68.88967 3741.401

Shifting skin, enhancing precision and getting a better understanding on uncertainty

Proteus offers a selection of twelve different latent models, each with two or three parameters, that can be used to improve the accuracy of predictions. However, selecting a specific model may increases the number of parameters to be estimated and computation time required.

Other important variables that impact computation time include rolling_blocks and stride. When rolling_blocks is set to FALSE, back-testing is performed using an incremental block scheme with an increasing number of sequences from previous blocks, resulting in more accurate results but longer computation time. The stride parameter operates as a thinning factor, reducing tensor size and computation time. To illustrate, consider the comparison between example1 and example3. By using a smaller stride, we were able to reduce computation time and overfitting. Therefore, selecting the appropriate latent model and adjusting key parameters can significantly impact the performance of time-feature analysis.

example3 <- proteus(amzn_aapl_fb, target = "AMZN", future = 30, past = 60, t_embed = 20, activ = "linear", nodes = 32, distr = "genbeta", optim = "adam", loss_metric = "crps", rolling_blocks = F, stride = 20, verbose = F, dates = "Date")
proteus: 173.34 sec elapsed
example1$time_log
[1] "16M 9S"
example3$time_log
[1] "2M 53S"
example1$model_descr
[1] "variational model based on normal latent distribution with 104 tensors and 167850 parameters"
example3$model_descr
[1] "variational model based on genbeta latent distribution with 156 tensors and 251775 parameters"
example1$features_errors
$AMZN
             me      mae       mse    rmsse        mpe       mape      rmae
train  8.101667 18.41333  711.8557 12.84200 0.01400000 0.03833333 0.1306667
test  13.773333 40.40367 3969.9887 25.80133 0.01433333 0.04133333 0.1623333
          rrmse  rame     mase     smse      sce
train 0.1586667 0.056 4.691333 170.1490 31275.20
test  0.1843333 0.101 9.390667 806.6403 56542.38
example3$features_errors
$AMZN
             me    mae       mse    rmsse          mpe       mape       rmae
train -3.061000 15.395  469.3167 11.03867 -0.014000000 0.04266667 0.08566667
test   5.324667 35.400 2957.0780 22.29100  0.002333333 0.03733333 0.14166667
          rrmse       rame     mase     smse        sce
train 0.1096667 0.02866667 4.146000 122.1493 -1177.1557
test  0.1560000 0.02266667 8.274333 601.1843   975.2163

With the release of version 1.1, we implemented a dedicated function for hyper-parameter tuning using random search. To see if we can improve our results with a limited number of models, let’s begin a random search with a sample size of 3. This will not only provide us with potential improvements but also give us valuable insights on how to further refine the tuning process.

example4 <- proteus_random_search(3, amzn_aapl_fb, target = "AMZN", future = 30,  loss_metric = "crps", rolling_blocks = F, verbose = F, dates = "Date")
proteus: 409.89 sec elapsed
proteus: 377.65 sec elapsed
proteus: 217.81 sec elapsed
random search: 1005.41 sec elapsed

If we take a look inside the random_search table we can have an idea on the best hyper-parameters.

knitr::kable(example4$random_search, align = "ccc", caption = "Examples of random search into the hyper-parameter space of proteus")
Examples of random search into the hyper-parameter space of proteus
model past t_embed activ nodes distr optim lr stride avg_me avg_mae avg_mse avg_rmsse avg_mpe avg_mape avg_rmae avg_rrmse avg_rame avg_mase avg_smse avg_sce
3 30 5 linear 25 chisq sgd 0.088 9 9.030500e+00 2.780270e+01 2.281405e+03 1.830150e+01 0.0103 0.0403 0.1197 0.1413 0.0543 6.681300e+00 4.704435e+02 5.084861e+03
2 34 11 softmax 304 gpd rprop 0.088 9 -2.055587e+02 2.069192e+02 3.465990e+05 2.165030e+02 -0.4363 0.4370 1.1597 1.6680 1.5997 5.506780e+01 6.896170e+04 -1.581803e+05
1 46 26 mish 129 exp asgd 0.042 3 -5.547500e+07 5.547500e+07 1.769282e+18 3.942615e+08 -136643.7175 136643.7175 225947.5537 2951937.4700 271004.4677 1.269292e+07 3.371283e+17 -1.400557e+11

Footnotes

  1. Seyed Mehran Kazemi, Rishab Goel, Sepehr Eghbali, Janahan Ramanan, Jaspreet Sahota, Sanjay Thakur, StellaWu, Cathal Smyth, Pascal Poupart, Marcus Brubaker, Time2Vec: Learning a Vector Representation of Time, arXiv:1907.05321v1 [cs.LG] 11 Jul 2019↩︎

  2. Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen, Moncef Gabbouj, and Alexandros Iosifidis, Deep Adaptive Input Normalization for Time Series Forecasting, arXiv:1902.07892v2 [q-fin.CP] 22 Sep 2019↩︎