Proteus: a brief introduction

Giancarlo Vercellino

23-June-2021

“In Greek mythology, Proteus (/ˈproʊtiəs, -tjuːs/;Ancient Greek: Πρωτεύς, Prōteus) is an early prophetic sea-god or god of rivers and oceanic bodies of water, one of several deities whom Homer calls the”Old Man of the Sea" (halios gerôn).Some who ascribe a specific domain to Proteus call him the god of “elusive sea change”, which suggests the constantly changing nature of the sea or the liquid quality of water. He can foretell the future, but, in a mytheme familiar to several cultures, will change his shape to avoid doing so; he answers only to those who are capable of capturing him. From this feature of Proteus comes the adjective protean, meaning “versatile”, “mutable”, or “capable of assuming many forms”. “Protean” has positive connotations of flexibility, versatility and adaptability." (Wikipedia)

Multiform like Proteus

Proteus is a Sequence-to-Sequence Variational Model for time-feature analysis based on a wide range of different distributions. The main point of Proteus is that the normal distribution is not enough to properly support any kind of prediction process. If you want to catch the elusive future, you need to change the default normal distribution, choosing other latent models that better fit specific processes.

Here is a description of Proteus’s architecture. A number of neural network models are created to estimate each shape/location/scale parameter of the chosen latent distribution: moving from left to right, the tensors with the past sequences are transformed into the tensors of parameters of future sequences. The latent model is used to produce the future sequences and estimates the error of measurement (and assesses the confidence interval).

The overall model architecture

The overall model architecture

The time features structured in a dataframe in columnar order are “horizontally” reframed creating a 3D tensor. The 3D tensor passes through 3 main steps: time2vec embedding (inspired to this reference1), then adaptive normalization (inspired to this reference2), then a simple neural network with three linear transformations for achieving the target size. At the end of these three steps, you have the tensor of the estimated parameters for the chosen latent model and the sampling process may begin.

How each parameter for the latent distribution is calculated

How each parameter for the latent distribution is calculated

Starting from scratch

In our introduction to Proteus, we are going to use the Close Price series for Amazon, Google and Facebook (from Yahoo Finance). As showed in amzn_aapl_fb, the time features are expected in ordered columns in a dataframe format.

Examples of time features: close prices for Amazon, Google and Facebook
Date AMZN GOOGL FB
3779 2012-05-18 213.85 300.5005 38.23
3780 2012-05-21 218.11 307.3624 34.03
3781 2012-05-22 215.33 300.7007 31.00
3782 2012-05-23 217.28 305.0350 32.00
3783 2012-05-24 215.24 302.1321 33.03
3784 2012-05-25 212.89 296.0611 31.91
3785 2012-05-29 214.75 297.4675 28.84
3786 2012-05-30 209.23 294.4094 28.19
3787 2012-05-31 212.91 290.7207 29.60
3788 2012-06-01 208.22 285.7758 27.72

In the first example, we are predicting a single time feature, Amazon. Proteus is a seq2seq model, meaning that each time feature is reframed in a matrix of sequences, with dimension n_sequences x past + future variables. The number of sequences depends on the sequence_stride variable: when the flag is TRUE, for each time feature will be extracted only distinct series (the sequence stride will be equal topast + future), while when the flag is FALSE, each sequence will be shifted of a single position in time (the sequence stride is equal to one).

We are predicting sequences of 30 time periods in the future based on sequences of 60 time periods in the past, using 20 different temporal embeddings (i.e., we are decomposing the original time feature in 1 trend and 19 different periodic components), a forward neural network of 32 nodes and a variational model based on normal distribution. The optimization method is another important hyper-parameter with a clear impact on the error performances measured with back-testing on four rolling blocks (with n_blocks equals to 4 and rolling_blocks to TRUE, the error will be sampled on three different measurements with a rolling window scheme; setting rolling_blocks to FALSE means that the error will be sampled three times using an incremental window scheme).

Setting verbose to TRUE, you will get detailed information on the training and validation process (numbers of sequences, max batch size, loss metric). The loss metrics available are Evidence Lower Bound (plus reconstruction error based on mae) and Continuous Ranked Probability Score. Both the metrics lead to very similar results, but CRPS is clearly more simple and elegant (that’s why is the default measure).

example1 <- proteus(amzn_aapl_fb, target = "AMZN", future = 30, past = 60, t_embed = 20, activ = "linear", nodes = 32, distr = "normal", optim = "adam", loss_metric = "crps", rolling_blocks = T, n_blocks = 4, sequence_stride = T, verbose = T, dates = amzn_aapl_fb$Date, days_off = c("saturday", "sunday"))
  
  block 1 
  5 sequence for training
  4 sequence for testing
  setting max batch size to 4 
  epoch:  3    Train loss:  0.3175983    Test loss:  0.3284948 
  epoch:  6    Train loss:  0.3379224    Test loss:  0.3695798 
  epoch:  9    Train loss:  0.3481375    Test loss:  0.3160852 
  epoch:  12    Train loss:  0.3030055    Test loss:  0.4029772 
  early stop at epoch:  12    Train loss:  0.292647    Test loss:  0.4029772 
  
  block 2 
  4 sequence for training
  4 sequence for testing
  setting max batch size to 4 
  epoch:  3    Train loss:  0.3288874    Test loss:  0.3136515 
  epoch:  6    Train loss:  0.3316765    Test loss:  0.3180925 
  epoch:  9    Train loss:  0.308616    Test loss:  0.3132755 
  early stop at epoch:  10    Train loss:  0.2995778    Test loss:  0.3528911 
  
  block 3 
  4 sequence for training
  4 sequence for testing
  setting max batch size to 4 
  epoch:  3    Train loss:  0.3604256    Test loss:  0.3602472 
  epoch:  6    Train loss:  0.3751796    Test loss:  0.3895938 
  epoch:  9    Train loss:  0.3314459    Test loss:  0.376107 
  epoch:  12    Train loss:  0.2679351    Test loss:  0.3170523 
  early stop at epoch:  14    Train loss:  0.2885902    Test loss:  0.3565183 
  
  final training on all 4 
  17 sequence for training
  setting max batch size to 17 
  epoch:  3    Train loss:  0.3474164 
  epoch:  6    Train loss:  0.3335282 
  epoch:  9    Train loss:  0.3325986 
  epoch:  12    Train loss:  0.3279028 
  epoch:  15    Train loss:  0.3235377 
  epoch:  18    Train loss:  0.312183 
  epoch:  21    Train loss:  0.3309314 
  epoch:  24    Train loss:  0.322084 
  epoch:  27    Train loss:  0.3340648 
  epoch:  30    Train loss:  0.3435601 
  proteus: 25.19 sec elapsed
  
  variational model based on normal latent distribution with 104 tensors and 167850 parameters

The result is a list of different components, as you can see below.

names(example1)
  [1] "model_descr"     "prediction"      "plot"            "learning_error" 
  [5] "features_errors" "pred_stats"      "time_log"

The first variable is a simple high-level description of the model.

example1$model_descr
  [1] "variational model based on normal latent distribution with 104 tensors and 167850 parameters"

The prediction is a list including the predicted results for each time-feature (quantile, min, max, mean, mode, sd, skewness, kurtosis for each time point in the future sequence).

example1$prediction
  $AMZN
           dates      min      q10      q25      q50      q75      q90      max
  t1  2019-07-15 1947.048 1972.820 2004.121 2008.133 2010.628 2016.053 2059.355
  t2  2019-07-16 1887.705 1958.359 1991.765 2009.173 2020.239 2022.959 2062.272
  t3  2019-07-17 1924.641 1971.100 1985.011 2004.669 2023.003 2041.329 2062.229
  t4  2019-07-18 1933.760 1974.908 1984.659 2004.046 2019.431 2040.763 2042.731
  t5  2019-07-19 1949.507 1954.793 1978.063 1999.376 2009.094 2035.146 2049.761
  t6  2019-07-22 1909.000 1986.733 1996.835 2010.115 2018.833 2040.313 2050.868
  t7  2019-07-23 1980.636 1986.671 1996.289 2012.278 2029.511 2054.748 2056.156
  t8  2019-07-24 1973.103 1998.108 2000.788 2016.063 2035.966 2048.329 2058.482
  t9  2019-07-25 1940.995 1975.335 1995.218 2010.327 2022.236 2026.003 2049.924
  t10 2019-07-26 1947.921 1983.859 2004.503 2013.217 2022.302 2030.653 2064.791
  t11 2019-07-29 1973.390 1991.212 2003.827 2021.872 2026.144 2044.475 2064.307
  t12 2019-07-30 1954.717 2005.250 2008.093 2022.075 2038.974 2046.755 2080.567
  t13 2019-07-31 1928.817 1982.854 2004.587 2024.878 2035.095 2080.861 2095.865
  t14 2019-08-01 1902.480 1985.288 1992.836 2027.192 2048.379 2072.498 2103.087
  t15 2019-08-02 1905.155 1988.567 1994.409 2036.033 2056.467 2074.677 2112.431
  t16 2019-08-05 1952.391 1989.473 2000.900 2039.772 2056.011 2061.836 2105.646
  t17 2019-08-06 1954.487 1993.442 2000.255 2031.110 2058.969 2060.511 2108.397
  t18 2019-08-07 1936.893 1983.050 2001.540 2020.368 2055.646 2060.209 2072.619
  t19 2019-08-08 1922.195 1965.068 2000.309 2020.496 2052.643 2065.103 2092.243
  t20 2019-08-09 1936.475 1947.165 1996.691 2029.618 2065.609 2096.311 2101.932
  t21 2019-08-12 1936.613 1947.313 2000.668 2025.971 2068.809 2087.958 2103.855
  t22 2019-08-13 1827.491 1939.799 2001.028 2019.493 2075.201 2106.257 2117.230
  t23 2019-08-14 1791.617 1950.965 1996.500 2020.233 2073.151 2126.488 2138.192
  t24 2019-08-15 1861.185 1951.407 1994.614 2020.033 2074.116 2131.857 2138.105
  t25 2019-08-16 1833.235 1954.589 1997.995 2017.132 2084.279 2129.137 2138.173
  t26 2019-08-19 1891.939 1958.903 1992.291 2016.516 2081.429 2123.861 2147.246
  t27 2019-08-20 1903.661 1957.260 1999.249 2023.286 2086.663 2128.051 2183.203
  t28 2019-08-21 1842.617 1970.182 2002.738 2026.433 2072.271 2137.156 2212.329
  t29 2019-08-22 1836.043 1971.771 2011.977 2031.169 2080.700 2142.093 2196.207
  t30 2019-08-23 1860.897 1982.150 2008.734 2031.313 2079.767 2158.128 2205.688
          mean     sd     mode skewness kurtosis
  t1  2004.362 26.714 2008.113   -0.297    4.157
  t2  1998.606 43.109 2016.164   -1.359    4.947
  t3  2004.211 36.390 2005.515   -0.536    3.194
  t4  2002.017 31.069 1997.223   -0.596    3.069
  t5  1996.451 30.147 2002.708    0.073    2.396
  t6  2004.322 35.612 2008.761   -1.492    5.528
  t7  2015.245 26.009 2005.898    0.409    1.915
  t8  2019.470 24.719 2007.266   -0.127    2.269
  t9  2005.087 28.376 2018.482   -0.767    3.430
  t10 2011.309 28.201 2013.669   -0.484    3.970
  t11 2018.115 24.415 2023.394    0.000    2.807
  t12 2023.026 30.430 2013.375   -0.379    3.887
  t13 2022.443 43.981 2022.503   -0.286    3.305
  t14 2020.911 50.980 2033.521   -0.712    3.748
  t15 2025.433 53.315 2051.354   -0.625    3.440
  t16 2029.531 41.300 2050.114   -0.090    2.595
  t17 2029.626 41.389 2048.819    0.049    2.605
  t18 2022.285 40.249 2055.602   -0.616    2.602
  t19 2020.714 47.245 2034.569   -0.568    2.796
  t20 2026.855 54.242 2017.932   -0.278    2.125
  t21 2028.163 54.534 2031.947   -0.332    2.070
  t22 2020.438 80.875 2015.621   -1.024    3.775
  t23 2021.837 92.852 2015.313   -1.097    4.274
  t24 2027.279 79.829 2025.834   -0.378    2.771
  t25 2027.482 85.207 1996.713   -0.722    3.347
  t26 2031.072 73.730 2012.007   -0.079    2.392
  t27 2037.556 77.757 2015.593    0.217    2.519
  t28 2036.792 92.222 2018.024   -0.106    3.488
  t29 2039.185 90.387 2024.659   -0.437    3.698
  t30 2043.696 88.199 2023.857   -0.038    3.361

For each time features included in the model, you get a plot of the median values with the chosen confidence interval (ci default is 0.8).

example1$plot
  $AMZN
Sequence Plot of Historical and Predicted Close Prices

Sequence Plot of Historical and Predicted Close Prices

Adding any number of time features

It is possible to select any number of time features from the starting dataset. In the following example, we are going to select Amazon, Google and Facebook, for a joint-prediction.

example2 <- proteus(amzn_aapl_fb, target = c("AMZN", "GOOGL", "FB"), future = 30, past = 60, t_embed = 20, activ = "linear", nodes = 64, distr = "normal", optim = "adam", rolling_blocks = F, sequence_stride = T, verbose = F, dates = amzn_aapl_fb$Date, days_off = c("saturday", "sunday"))
  proteus: 35.37 sec elapsed
example2$plot
  $AMZN
Sequence Plot of Historical and Predicted Close Prices

Sequence Plot of Historical and Predicted Close Prices

  
  $GOOGL
Sequence Plot of Historical and Predicted Close Prices

Sequence Plot of Historical and Predicted Close Prices

  
  $FB
Sequence Plot of Historical and Predicted Close Prices

Sequence Plot of Historical and Predicted Close Prices

The learning error is the error for the joint-variational model, including all the time features, both for the training and validation process (in this case, based on a incremental windows for the available blocks with full-strided sequences). You get a wide range of common metrics (rmse, mae, mdae, mpe, mape, smape, rrse, rae). A good model should ideally score below 1 for rrse and rae (or at least approximately equal to 1); a bad model is consistently above 1 (“consistently” means that is a multiple).

example2$learning_error
             rmse      mae   mdae  mpe mape    smape     rrse       rae
  train  6.688533 3.714400 1.9285  Inf  Inf 1.933433 1.004867 0.9897333
  test  11.637167 6.970333 3.6140 -Inf  Inf 1.950033 1.003467 0.9935000

The features errors include the standard metrics (rmse, mae, mdae, mpe, mape, smape, rrse, rae), this time measured for each time feature.

example2$features_errors
  $AMZN
            rmse      mae     mdae        mpe       mape      smape      rrse
  train 27.39053 20.46910 13.67537 0.01616667 0.05490000 0.05526667 0.3032667
  test  49.36047 38.58493 30.07370 0.01086667 0.03713333 0.03933333 0.3780667
           rae
  train 0.2765
  test  0.3219
  
  $GOOGL
            rmse      mae    mdae        mpe       mape      smape      rrse
  train 30.81497 20.71213 13.9444 0.02540000 0.03780000 0.03936667 0.2333333
  test  48.16187 34.44870 25.7233 0.02333333 0.04066667 0.04160000 0.8156333
              rae
  train 0.1937333
  test  0.6923000
  
  $FB
            rmse      mae     mdae        mpe       mape      smape      rrse
  train 4.944333 3.784233 2.987633 0.03143333 0.07430000 0.07306667 0.1965333
  test  6.735700 5.513033 5.104233 0.02416667 0.04043333 0.04133333 0.4512000
           rae
  train 0.1674
  test  0.4162

Shifting skin, enhancing precision and getting a better understanding on uncertainty

It is possible to select among different latent models (twelve different distributions, with two and three parameters), which increases the number of parameters to be estimated and the computation time. Other important variables that will impact the computation time are the rolling_blocks (when flagged to FALSE, the back-testing will be performed on an incremental block scheme with an increasing number of sequences from previous blocks: you got a more precise result but more time is required) and the sequence_stride (when flagged to FALSE, each sequence will be shifted of a single point in time instead of the whole past + future length, resulting in a large number of sequences for each block: as above, you get larger tensors and more precision but more computation time is required). You can easily compare the difference between example1 and example3: in this case, example3 is a little overfitting.

example3 <- proteus(amzn_aapl_fb, target = "AMZN", future = 30, past = 60, t_embed = 20, activ = "linear", nodes = 32, distr = "genbeta", optim = "adam", loss_metric = "crps", rolling_blocks = F, sequence_stride = F, verbose = F, dates = amzn_aapl_fb$Date, days_off = c("saturday", "sunday"))
  proteus: 408.65 sec elapsed
example1$time_log
  [1] "25S"
example3$time_log
  [1] "6M 49S"

example1$model_descr
  [1] "variational model based on normal latent distribution with 104 tensors and 167850 parameters"
example3$model_descr
  [1] "variational model based on genbeta latent distribution with 156 tensors and 251775 parameters"

example1$features_errors
  $AMZN
           rmse      mae     mdae        mpe       mape      smape      rrse
  train 28.2626 21.04407 14.01620 0.01403333 0.04426667 0.04726667 0.3636667
  test  49.6028 38.85787 30.15487 0.01036667 0.03756667 0.03966667 0.3802333
              rae
  train 0.3098333
  test  0.3244333
example3$features_errors
  $AMZN
            rmse      mae     mdae          mpe       mape  smape      rrse
  train 25.45427 17.84980 12.00577 -0.004700000 0.04746667 0.0471 0.2604000
  test  61.71610 47.26393 37.57153  0.008266667 0.04896667 0.0490 0.4428667
              rae
  train 0.2247333
  test  0.4049667

For each point in time of the predicted features, six measures are taken:

  1. the ratio of IQR to range (averaged across the prediction points);
  2. the terminal IQR ratio (the ratio between the last IQR-to-range and the first one);
  3. the average Kullback-Leibler divergence (comparing each point in time with the previous one);
  4. the terminal KL divergence (the divergence calculated between the last and the first point in prediction);
  5. the upside probability (probability of getting a larger value compared to the previous point, averaged across all points);
  6. the terminal upside probability (probability of getting a larger value comparing the last and the first point in prediction).

While the IQR ratio could be helpful in understanding the aleatoric uncertainty affecting the predicted features, the KL divergence could be useful in estimating the epistemic uncertainty (just an idea). Notice how the terminal KL divergence depends a lot on the specific distribution comparing example1 (normal) and example3 (overfitting genbeta).

example1$pred_stats
                           AMZN
  avg_iqr_to_range        0.278
  terminal_iqr_ratio     10.916
  avg_kl_divergence       0.077
  terminal_kl_divergence 11.554
  avg_upside_prob         0.552
  terminal_upside_prob    0.752
example2$pred_stats
                           AMZN  GOOGL      FB
  avg_iqr_to_range        0.277  0.281   0.308
  terminal_iqr_ratio     10.514  6.428   5.111
  avg_kl_divergence       0.101  0.168   0.765
  terminal_kl_divergence 10.250 25.049 125.414
  avg_upside_prob         0.551  0.558   0.562
  terminal_upside_prob    0.755  0.744   0.802
example3$pred_stats
                          AMZN
  avg_iqr_to_range       0.092
  terminal_iqr_ratio     7.485
  avg_kl_divergence      0.005
  terminal_kl_divergence 2.539
  avg_upside_prob        0.535
  terminal_upside_prob   0.666

Last, but not least, you can prospect the future for long sequences (if you have enough data points for the number of blocks and the sequence stride) and you can choose among the different activation functions implemented in Proteus (in example4, we tried “gelu”).

example4 <- proteus(amzn_aapl_fb, target = c("AMZN", "GOOGL", "FB"), future = 100, past = 300, t_embed = 20, activ = "gelu", nodes = 32, distr = "gev", optim = "rmsprop", rolling_blocks = T, sequence_stride = F, verbose = F, dates = amzn_aapl_fb$Date, days_off = c("saturday", "sunday"))
  proteus: 237.69 sec elapsed
example4$plot
  $AMZN
Sequence Plot of Historical and Predicted Close Prices

Sequence Plot of Historical and Predicted Close Prices

  
  $GOOGL
Sequence Plot of Historical and Predicted Close Prices

Sequence Plot of Historical and Predicted Close Prices

  
  $FB
Sequence Plot of Historical and Predicted Close Prices

Sequence Plot of Historical and Predicted Close Prices

Some references that inspired this work


  1. Seyed Mehran Kazemi, Rishab Goel, Sepehr Eghbali, Janahan Ramanan, Jaspreet Sahota, Sanjay Thakur, StellaWu, Cathal Smyth, Pascal Poupart, Marcus Brubaker, Time2Vec: Learning a Vector Representation of Time, arXiv:1907.05321v1 [cs.LG] 11 Jul 2019↩︎

  2. Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen, Moncef Gabbouj, and Alexandros Iosifidis, Deep Adaptive Input Normalization for Time Series Forecasting, arXiv:1902.07892v2 [q-fin.CP] 22 Sep 2019↩︎