Intro to proteus

Author

Giancarlo Vercellino

Published

March 2, 2023

“In Greek mythology, Proteus (/ˈproʊtiəs, -tjuːs/;Ancient Greek: Πρωτεύς, Prōteus) is an early prophetic sea-god or god of rivers and oceanic bodies of water, one of several deities whom Homer calls the ”Old Man of the Sea” (halios gerôn).Some who ascribe a specific domain to Proteus call him the god of ”elusive sea change”, which suggests the constantly changing nature of the sea or the liquid quality of water. He can foretell the future, but, in a mytheme familiar to several cultures, will change his shape to avoid doing so; he answers only to those who are capable of capturing him. From this feature of Proteus comes the adjective protean, meaning ”versatile”, ”mutable”, or ”capable of assuming many forms”. ”Protean” has positive connotations of flexibility, versatility and adaptability.” Wikipedia

Multiform like Proteus

Proteus is a Sequence-to-Sequence Variational Model designed for time-feature analysis, leveraging a wide range of distributions for improved accuracy. Unlike traditional methods that rely solely on the normal distribution, Proteus uses various latent models to better capture and predict complex processes. To achieve this, Proteus employs a neural network architecture that estimates the shape, location, and scale parameters of the chosen distribution. This approach transforms past sequence data into future sequence parameters, improving the model’s prediction capabilities. Proteus also assesses the accuracy of its predictions by estimating the error of measurement and calculating the confidence interval. By utilizing a range of distributions and advanced modeling techniques, Proteus provides a more accurate and comprehensive approach to time-feature analysis.

Here is a description of Proteus’s architecture. A number of neural network models are created to estimate each shape/location/scale parameter of the chosen latent distribution: moving from left to right, the tensors with the past sequences are transformed into the tensors of parameters of future sequences. The latent model is used to produce the future sequences and estimates the error of measurement (and assesses the confidence interval).

Figure 1 - The overall model architecture

The time features structured in a dataframe in columnar order are “horizontally” reframed creating a 3D tensor. The 3D tensor passes through 3 main steps: time2vec embedding (inspired to this reference¹), then adaptive normalization (inspired to this reference²), then a simple neural network with three linear transformations for achieving the target size. At the end of these three steps, you have the tensor of the estimated parameters for the chosen latent model and the sampling process may begin.

Figure 2 - How each parameter for the latent distribution is calculated

Starting from scratch

In our introduction to Proteus, we are going to use the Close Price series for Amazon, Google and Facebook (from Yahoo Finance). As showed in amzn_aapl_fb, the time features are expected in ordered columns in a dataframe format.

library(proteus)

knitr::kable(head(amzn_aapl_fb, 10), align = "ccc", caption = "Examples of time features: close prices for Amazon, Google and Facebook")

Examples of time features: close prices for Amazon, Google and Facebook
	Date	AMZN	GOOGL	FB
3779	2012-05-18	213.85	300.5005	38.23
3780	2012-05-21	218.11	307.3624	34.03
3781	2012-05-22	215.33	300.7007	31.00
3782	2012-05-23	217.28	305.0350	32.00
3783	2012-05-24	215.24	302.1321	33.03
3784	2012-05-25	212.89	296.0611	31.91
3785	2012-05-29	214.75	297.4675	28.84
3786	2012-05-30	209.23	294.4094	28.19
3787	2012-05-31	212.91	290.7207	29.60
3788	2012-06-01	208.22	285.7758	27.72

In our first example, we use proteus to predict a single time feature, namely Amazon. Proteus reframes each time feature as a matrix of sequences, with dimensions of n_sequences x past + future variables. The number of sequences is determined by the stride variable, which in this case is set to 1. This means that each past + future sequence is shifted by a single position in time.

We aim to predict sequences of 30 time periods in the future based on 60 time periods in the past, utilizing 20 different temporal embeddings. These embeddings decompose the original time feature into 1 trend and 19 different periodic components. We use a forward neural network with 32 nodes and a variational model based on normal distribution.

The optimization method is another important hyper-parameter that impacts the error performance of the model. We measure error using back-testing on four rolling blocks, with n_blocks set to 4 and rolling_blocks set to TRUE. This means that the error is sampled on three different measurements using a rolling window scheme. If rolling_blocks is set to FALSE, the error is sampled three times using an incremental window scheme.

Setting verbose to TRUE provides detailed information on the training and validation process, including the number of sequences, max batch size, and loss metric. Proteus offers several loss metrics, including Evidence Lower Bound, Continuous Ranked Probability Score, and a Custom Score (this last is the absolute difference in cdf between prediction and actual on the estimated latent parameters for the chosen distribution).

example1 <- proteus(amzn_aapl_fb, target = "AMZN", future = 30, past = 60, t_embed = 20, activ = "linear", nodes = 32, distr = "normal", optim = "adam", loss_metric = "crps", rolling_blocks = T, n_blocks = 4, stride = 1, verbose = T, dates = "Date")

date and value gaps filled with kalman imputation

block 1 
541 sequence for training
541 sequence for testing
epoch:  3    Train loss:  0.3282238    Test loss:  0.3254851 
epoch:  6    Train loss:  0.3331435    Test loss:  0.3290568 
epoch:  9    Train loss:  0.3325217    Test loss:  0.3298825 
epoch:  12    Train loss:  0.3241959    Test loss:  0.3331832 
early stop at epoch:  12    Train loss:  0.1797601    Test loss:  0.295964 

block 2 
541 sequence for training
541 sequence for testing
epoch:  3    Train loss:  0.3333673    Test loss:  0.3357568 
epoch:  6    Train loss:  0.327236    Test loss:  0.3315799 
epoch:  9    Train loss:  0.3264316    Test loss:  0.3358205 
epoch:  12    Train loss:  0.3347558    Test loss:  0.328617 
epoch:  15    Train loss:  0.3318962    Test loss:  0.3311511 
epoch:  18    Train loss:  0.3319906    Test loss:  0.3360375 
epoch:  21    Train loss:  0.3293301    Test loss:  0.3344645 
epoch:  24    Train loss:  0.3340255    Test loss:  0.332931 
epoch:  27    Train loss:  0.3338883    Test loss:  0.3309637 
epoch:  30    Train loss:  0.3275637    Test loss:  0.325176 

block 3 
541 sequence for training
541 sequence for testing
epoch:  3    Train loss:  0.3300303    Test loss:  0.3347785 
epoch:  6    Train loss:  0.3347132    Test loss:  0.3320392 
epoch:  9    Train loss:  0.3323337    Test loss:  0.3323828 
epoch:  12    Train loss:  0.3296569    Test loss:  0.3317752 
early stop at epoch:  13    Train loss:  0.2757971    Test loss:  0.3399342 

final training on all 4 
2164 sequence for training
epoch:  3    Train loss:  0.3334524 
epoch:  6    Train loss:  0.3327346 
epoch:  9    Train loss:  0.3326941 
epoch:  12    Train loss:  0.3342977 
epoch:  15    Train loss:  0.3330329 
epoch:  18    Train loss:  0.3315944 
epoch:  21    Train loss:  0.3305171 
epoch:  24    Train loss:  0.3344898 
epoch:  27    Train loss:  0.3330322 
epoch:  30    Train loss:  0.3325613 

variational model based on normal latent distribution with 104 tensors and 167850 parameters
proteus: 969.31 sec elapsed

The result is a list of different components, as you can see below.

names(example1)

[1] "model_descr"     "prediction"      "plot"            "features_errors"
[5] "history"         "time_log"

The first variable is a simple high-level description of the model.

example1$model_descr

[1] "variational model based on normal latent distribution with 104 tensors and 167850 parameters"

The predictions is a list including the predicted results for each time-feature (quantile, min, max, mean, mode, sd, skewness, kurtosis, iqr to range, above to below range, upside probability, divergence for each time point in the sequence). The IQR to range is the interquartile range to the min-max range, the above to below range is the range above median to the range below it, the upside probability is the probability of growth compared to the former point in the time sequence, the divergence is the maximum distance of cumulative normal curve of each point to the former point in the sequence.

knitr::kable(example1$prediction$AMZN[1:10,], align = "ccc", caption = "Examples of time-feature prediction (first ten rows)")

Examples of time-feature prediction (first ten rows)
	min	10%	25%	50%	75%	90%	max	mean	sd	mode	kurtosis	skewness	iqr_to_range	above_to_below_range	upside_prob	divergence
2019-07-12	1871.676	2000.441	2008.193	2011.581	2015.973	2023.987	2128.774	2011.812	15.9701	2011.240	18.8059	-0.4415	0.0303	0.8377	0.5625	0.3590
2019-07-13	1846.539	1993.865	2006.659	2012.422	2019.670	2031.785	2146.090	2012.695	22.6081	2012.371	12.7274	-0.3956	0.0434	0.8058	0.5632	0.0536
2019-07-14	1810.182	1988.702	2005.666	2013.443	2023.462	2039.838	2146.244	2013.838	27.7922	2012.857	10.1913	-0.5391	0.0530	0.6534	0.5829	0.0271
2019-07-15	1767.992	1986.050	2004.821	2014.537	2027.263	2047.659	2157.775	2014.943	32.5280	2013.479	9.5906	-0.7504	0.0576	0.5810	0.5816	0.0197
2019-07-16	1759.234	1981.934	2004.203	2015.539	2030.340	2053.713	2186.508	2015.691	36.5173	2013.907	9.0279	-0.7370	0.0612	0.6671	0.5508	0.0203
2019-07-17	1781.287	1981.259	2003.228	2015.434	2032.437	2056.735	2201.561	2016.197	40.0886	2013.796	8.6095	-0.6986	0.0695	0.7949	0.5311	0.0228
2019-07-18	1759.668	1979.082	2002.642	2016.389	2035.440	2060.172	2202.072	2016.932	43.2282	2013.050	8.9574	-0.7630	0.0741	0.7233	0.5514	0.0216
2019-07-19	1729.402	1977.678	2001.741	2017.217	2037.231	2063.003	2235.560	2017.841	46.3900	2014.703	9.4186	-0.7207	0.0701	0.7586	0.5656	0.0136
2019-07-20	1724.631	1977.317	2001.581	2017.070	2039.155	2067.486	2251.566	2018.706	49.2971	2015.534	9.4522	-0.7050	0.0713	0.8019	0.5576	0.0148
2019-07-21	1718.067	1975.772	2001.306	2017.695	2041.739	2070.241	2280.978	2019.581	51.8477	2015.648	9.2776	-0.7569	0.0718	0.8787	0.5607	0.0092

For each time features included in the model, you get a plot of the median values with the chosen confidence interval (ci default is 0.8).

example1$plot

$AMZN

Adding any number of time features

It is possible to select any number of time features from the starting dataset. In the following example, we are going to select Amazon, Google and Facebook, for a joint-prediction.

example2 <- proteus(amzn_aapl_fb, target = c("AMZN", "GOOGL", "FB"), future = 30, past = 60, t_embed = 20, activ = "linear", nodes = 64, distr = "normal", optim = "adam", rolling_blocks = F, stride = 10, verbose = F, dates = "Date")

proteus: 249.98 sec elapsed

example2$plot

$AMZN


$GOOGL

$FB

The history plot reports the average selected loss across the validation blocks (in this case, based on a incremental windows with 10-strided sequences).

example2$history

The features errors include the standard metrics (me, mae, mse, rmsse, mpe, mape, rmae, rrmse, rame, mase, smse, sce) for each time feature.

example2$features_errors

$AMZN
            me      mae      mse    rmsse   mpe       mape      rmae     rrmse
train  5.62200 15.79933  525.156 11.42867 0.011 0.04200000 0.0850000 0.1083333
test  15.64367 40.34967 4017.532 25.79433 0.016 0.04133333 0.1613333 0.1816667
            rame     mase     smse      sce
train 0.02233333 4.191000 131.0683 4826.621
test  0.10733333 9.393667 813.8523 6407.464

$GOOGL
         me      mae      mse    rmsse        mpe       mape      rmae
train 6.337 16.91867  584.695 12.26367 0.01133333 0.03300000 0.0700000
test  5.176 26.83400 1460.180 18.52433 0.00500000 0.03066667 0.2703333
           rrmse       rame     mase     smse      sce
train 0.08933333 0.02633333 4.526333 150.6217 5185.446
test  0.32466667 0.13833333 6.868333 361.1113 2257.251

$FB
            me      mae      mse rmsse        mpe       mape       rmae
train 1.573000 2.759000 15.71833 4.739 0.02600000 0.05533333 0.05533333
test  1.597667 4.945333 54.01133 7.686 0.01133333 0.03666667 0.28300000
           rrmse       rame     mase     smse      sce
train 0.07166667 0.02966667 4.020667 22.57467 7611.148
test  0.31300000 0.22766667 6.735333 68.88967 3741.401

Shifting skin, enhancing precision and getting a better understanding on uncertainty

Proteus offers a selection of twelve different latent models, each with two or three parameters, that can be used to improve the accuracy of predictions. However, selecting a specific model may increases the number of parameters to be estimated and computation time required.

Other important variables that impact computation time include rolling_blocks and stride. When rolling_blocks is set to FALSE, back-testing is performed using an incremental block scheme with an increasing number of sequences from previous blocks, resulting in more accurate results but longer computation time. The stride parameter operates as a thinning factor, reducing tensor size and computation time. To illustrate, consider the comparison between example1 and example3. By using a smaller stride, we were able to reduce computation time and overfitting. Therefore, selecting the appropriate latent model and adjusting key parameters can significantly impact the performance of time-feature analysis.

example3 <- proteus(amzn_aapl_fb, target = "AMZN", future = 30, past = 60, t_embed = 20, activ = "linear", nodes = 32, distr = "genbeta", optim = "adam", loss_metric = "crps", rolling_blocks = F, stride = 20, verbose = F, dates = "Date")

proteus: 173.34 sec elapsed

example1$time_log

[1] "16M 9S"

example3$time_log

[1] "2M 53S"

example1$model_descr

[1] "variational model based on normal latent distribution with 104 tensors and 167850 parameters"

example3$model_descr

[1] "variational model based on genbeta latent distribution with 156 tensors and 251775 parameters"

example1$features_errors

$AMZN
             me      mae       mse    rmsse        mpe       mape      rmae
train  8.101667 18.41333  711.8557 12.84200 0.01400000 0.03833333 0.1306667
test  13.773333 40.40367 3969.9887 25.80133 0.01433333 0.04133333 0.1623333
          rrmse  rame     mase     smse      sce
train 0.1586667 0.056 4.691333 170.1490 31275.20
test  0.1843333 0.101 9.390667 806.6403 56542.38

example3$features_errors

$AMZN
             me    mae       mse    rmsse          mpe       mape       rmae
train -3.061000 15.395  469.3167 11.03867 -0.014000000 0.04266667 0.08566667
test   5.324667 35.400 2957.0780 22.29100  0.002333333 0.03733333 0.14166667
          rrmse       rame     mase     smse        sce
train 0.1096667 0.02866667 4.146000 122.1493 -1177.1557
test  0.1560000 0.02266667 8.274333 601.1843   975.2163

With the release of version 1.1, we implemented a dedicated function for hyper-parameter tuning using random search. To see if we can improve our results with a limited number of models, let’s begin a random search with a sample size of 3. This will not only provide us with potential improvements but also give us valuable insights on how to further refine the tuning process.

example4 <- proteus_random_search(3, amzn_aapl_fb, target = "AMZN", future = 30,  loss_metric = "crps", rolling_blocks = F, verbose = F, dates = "Date")

proteus: 409.89 sec elapsed
proteus: 377.65 sec elapsed
proteus: 217.81 sec elapsed
random search: 1005.41 sec elapsed

If we take a look inside the random_search table we can have an idea on the best hyper-parameters.

knitr::kable(example4$random_search, align = "ccc", caption = "Examples of random search into the hyper-parameter space of proteus")

Examples of random search into the hyper-parameter space of proteus
model	past	t_embed	activ	nodes	distr	optim	lr	stride	avg_me	avg_mae	avg_mse	avg_rmsse	avg_mpe	avg_mape	avg_rmae	avg_rrmse	avg_rame	avg_mase	avg_smse	avg_sce
3	30	5	linear	25	chisq	sgd	0.088	9	9.030500e+00	2.780270e+01	2.281405e+03	1.830150e+01	0.0103	0.0403	0.1197	0.1413	0.0543	6.681300e+00	4.704435e+02	5.084861e+03
2	34	11	softmax	304	gpd	rprop	0.088	9	-2.055587e+02	2.069192e+02	3.465990e+05	2.165030e+02	-0.4363	0.4370	1.1597	1.6680	1.5997	5.506780e+01	6.896170e+04	-1.581803e+05
1	46	26	mish	129	exp	asgd	0.042	3	-5.547500e+07	5.547500e+07	1.769282e+18	3.942615e+08	-136643.7175	136643.7175	225947.5537	2951937.4700	271004.4677	1.269292e+07	3.371283e+17	-1.400557e+11

Footnotes

Seyed Mehran Kazemi, Rishab Goel, Sepehr Eghbali, Janahan Ramanan, Jaspreet Sahota, Sanjay Thakur, StellaWu, Cathal Smyth, Pascal Poupart, Marcus Brubaker, Time2Vec: Learning a Vector Representation of Time, arXiv:1907.05321v1 [cs.LG] 11 Jul 2019↩︎
Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen, Moncef Gabbouj, and Alexandros Iosifidis, Deep Adaptive Input Normalization for Time Series Forecasting, arXiv:1902.07892v2 [q-fin.CP] 22 Sep 2019↩︎