Intro to codez

Author

Giancarlo Vercellino

“Geometry is not true, it is convenient” (Henri Poincaré)

Kernel methods owe their name to the use of kernel functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates. This approach is called the ’kernel trick’” (Wikipedia)

Geometry in the humming of the strings, music in the spacing of the spheres”(Pythagoras)

Moving prediction to the right space

Many academics1 have recently realized that for the most part deep neural networks operate as a kind of kernel machine, projecting data in different geometric spaces to solve their tasks (that’s not exactly as for the “kernel trick”, because here the projection is usually calculated in a lower dimensional space and the coordinates are actually computed). In codez we use the same “geometric reasoning” to simplifying the solution of Seq2Seq models: codez uses two deep neural networks to predict sequence in time, where the first network is an autoencoder used to encode the tensor of sequences in a subspace with lower dimensions, the second network is a forward neural net that predict the next sequence directly in the latent space, then the prediction is decoded into the original space. In the following, a brief explanation:

  • Some basic transformation are directly managed in background. Differencing and integration are automatically managed by codez using the maximal p-value in a recursive F-test for de-trending each time-feature: this allows to easily determine the different dynamic characteristics of each time feature, random walk, trend, exponential (somehow more simple and practical compared to other formal approaches like Augmented Dickey-Fuller or Ljung Box Test). If you have limited missing values in your time features, codez automatically proceeds with the imputation using the Kalman filter method2. If you prefer to project into the future the smoothed version, you can set smoother= TRUE to use loess3 function.

  • The test errors are cross-validated through an expanding validation n_windows: the default value is set to 10, meaning that the time features are divided into 10 + 1 segments guaranteeing at least ten validation sets to measure the error on unforeseen data. For each point in the prediction sequence, the empirical error distribution is sampled a thousand time for the calculation of quantiles, mean, mode, standard deviation, skewness and kurtosis, and other less common measures that are provided for each time step (more details in the following paragraphs).

Figure 1: The process flow of codez

A showcase with stock price prediction

The dataset `time features` included with codez is a recent take on some Big Techs’ stock prices (source: Yahoo Finance). The data is expected in a dataframe format, where each column represents a different time series (the date information is not mandatory and could be provided separately).

library(codez)
#|echo:false
#|include:false

knitr::kable(head(amzn_aapl_fb, 10), align = "ccc", caption = "Examples of stock prices from Tech Companies")
Examples of stock prices from Tech Companies
Date AMZN GOOGL FB
3779 2012-05-18 213.85 300.5005 38.23
3780 2012-05-21 218.11 307.3624 34.03
3781 2012-05-22 215.33 300.7007 31.00
3782 2012-05-23 217.28 305.0350 32.00
3783 2012-05-24 215.24 302.1321 33.03
3784 2012-05-25 212.89 296.0611 31.91
3785 2012-05-29 214.75 297.4675 28.84
3786 2012-05-30 209.23 294.4094 28.19
3787 2012-05-31 212.91 290.7207 29.60
3788 2012-06-01 208.22 285.7758 27.72

In the first example, we are predicting the close price for Amazon. In this example we set sequence length seq_len to 120 using a validation scheme of 5 windows for error measurement. Let’s try to sample 3 random models using the standard parameters and the presets.

example1 <- codez(amzn_aapl_fb[, 2, drop = FALSE], seq_len = 75, n_windows = 5, dates = as.Date(amzn_aapl_fb$Date), n_samp = 3)
time: 219.22 sec elapsed

The result is a list of different components, as you can see below.

names(example1)
[1] "history"    "best_model" "time_log"  
names(example1$best_model)
[1] "predictions" "errors"      "plot"       
knitr::kable(example1$history, align = "ccc", caption = "History with 3 random samples")
History with 3 random samples
seq_len latent autoencoder_layers autoencoder_activations autoencoder_optimizer forward_net_layers forward_net_activations forward_net_reg_L1 forward_net_reg_L2 forward_net_dropout forward_net_optimizer me mae mse rmsse mpe mape rmae rrmse rame mase smse sce gmrae
3 75 26 363, 608 selu, elu rmsprop 918, 413, 321, 655 swish , tanh , linear, elu 869.870, 697.698, 649.650, 39.040 776.777, 490.491, 546.547, 732.733 0.8047, 0.8311, 0.3595, 0.5973 adamax 14.7248 63.0482 8465.934 32.6550 -0.0096 0.0836 0.9368 0.9408 0.7212 12.1302 1499.677 94.2746 0.9870
2 75 66 135, 310, 31, 846 relu , leaky_relu, gelu , gelu sgd 488, 305, 31, 165, 306 selu , relu , swish , relu , leaky_relu 371.372, 2.003, 373.374, 404.405, 769.770 108.109, 328.329, 668.669, 75.076, 979.980 0.8904, 0.6037, 0.6133, 0.1649, 0.8359 rmsprop 14.7252 63.0478 8466.597 32.6614 -0.0098 0.0836 0.9368 0.9412 0.7212 12.1300 1499.885 94.2800 0.9756
1 75 50 641, 56 elu , selu adamax 937, 867, 879, 266, 321 selu , selu , sigmoid, relu , elu 921.922, 625.626, 261.262, 389.390, 129.130 516.517, 32.033, 944.945, 102.103, 227.228 0.1120, 0.3851, 0.6854, 0.2978, 0.3595 adam 14.7262 63.1206 8472.620 32.7196 -0.0098 0.0838 0.9402 0.9448 0.7212 12.1528 1501.775 94.3036 0.9900

history includes hyper-parameters and error metrics for the models samples during random search (me, mae, mse, rmsse, mpe, mape, rmae, rrmse, rame, mase, smse, sce, gmrae4). best_model collects the best combination of hyper-parameters, together with the error metric for each time feature (errors), prediction intervals (predictions) and visualizations (plots).

The predictions is a list including the predicted results for each time-feature (quantile, min, max, mean/proportion, mode, sd, skewness, kurtosis, iqr to range, above to below range, upside/upgrade probability, divergence for each time point in the sequence). The IQR to range is the interquartile range to the min-max range, the above to below range is the range above median to the range below it, the upside/upgrade probability is the probability of growth (or scale up, in case of categorical variable) compared to the former point in the time sequence, the divergence is the maximum distance of cumulative normal curve of each point to the former point in the sequence.

#|echo: false

knitr::kable(head(example1$best_model$predictions$AMZN, 10), align = "ccc", caption = "Examples of prediction for Google Close Prices (first 30 points)")
Examples of prediction for Google Close Prices (first 30 points)
min 10% 25% 50% 75% 90% max mean sd mode kurtosis skewness iqr_to_range above_to_below_range upside_prob divergence pred_scores
2019-07-12 1907.04 2003.70 2009.79 2011.28 2020.07 2021.78 2030.10 2005.088 31.5678 2011.951 8.4411 -2.6260 0.0835 0.1805 0.626 0.3483 0.4376
2019-07-13 1937.83 2005.79 2007.57 2017.04 2027.19 2029.44 2066.30 2015.195 27.2575 2011.912 6.0219 -1.1534 0.1527 0.6219 0.567 0.1940 0.2880
2019-07-14 1867.77 1996.24 2003.81 2014.60 2026.24 2059.76 2059.76 2008.402 43.1106 2016.677 8.6311 -2.3682 0.1168 0.3076 0.460 0.1860 0.3300
2019-07-16 1879.67 1980.93 2004.81 2011.69 2027.68 2033.05 2050.51 2005.165 43.1076 2017.546 6.6956 -2.1076 0.1339 0.2940 0.482 0.2580 0.3172
2019-07-17 1881.88 1968.33 1979.74 2014.26 2027.27 2028.61 2050.57 2001.786 43.0450 2021.639 5.5307 -1.7584 0.2818 0.2743 0.479 0.1790 0.2900
2019-07-19 1902.18 1981.83 1992.01 2006.96 2032.72 2042.98 2042.98 2003.801 36.3766 1998.559 5.4061 -1.5846 0.2891 0.3438 0.481 0.2830 0.4088
2019-07-20 1897.02 1969.38 1979.96 2011.76 2039.15 2044.07 2063.90 2004.300 43.5385 2024.587 4.1129 -1.1916 0.3547 0.4544 0.497 0.1950 0.3904
2019-07-22 1830.55 1957.57 1995.84 2010.45 2035.73 2044.24 2058.67 2002.828 55.1986 2020.108 7.2123 -2.1403 0.1749 0.2680 0.536 0.1010 0.4116
2019-07-23 1759.55 1959.04 1991.52 2010.14 2040.86 2042.39 2063.21 1991.567 80.0152 2010.310 6.8237 -2.2013 0.1625 0.2118 0.452 0.2040 0.2692
2019-07-25 1790.12 1947.06 1979.22 2010.89 2039.88 2040.30 2061.59 1993.765 72.1616 2027.311 6.0451 -1.9643 0.2235 0.2297 0.521 0.1660 0.6280

For each time features included in the model, you get a plot of the median with the chosen confidence interval (ci default is 0.8). As in other packages5, we provide different stats to give a better hint on the different dynamics related to aleatoric and epistemic uncertainty.

#|echo: false
#|include: true
#|fig-dpi: 300

example1$best_model$plot
$AMZN

Random explorations of the hyper-parameter space

You can use codez in different way. If you have a clear idea about the right hyper-parameters, but you want to understand the best sequence length to explore, you can enter the desired parameters, leaving seq_len to NULL:

example2 <- codez(amzn_aapl_fb[, 2:4], n_windows = 5, dates = as.Date(amzn_aapl_fb$Date), n_samp = 5, seq_len = NULL, latent = 3, autoencoder_layers_n = 3, autoencoder_activ = "gelu", autoencoder_layers_size = 64, autoencoder_optimizer = "nadam", forward_net_layers_n = 2, forward_net_activ = "relu", forward_net_layers_size = 32, forward_net_drop = 0.5, forward_net_reg_L1 = 0, forward_net_reg_L2 = 10, forward_net_optimizer = "adam")
time: 278.5 sec elapsed

As we can see looking at history, with these hyper-parameters the best horizon of exploration is 20.

#|echo: false

knitr::kable(example2$history, align = "ccc", caption = "The best horizon of prediction for a given set of hyper-parameters")
The best horizon of prediction for a given set of hyper-parameters
seq_len latent autoencoder_layers autoencoder_activations autoencoder_optimizer forward_net_layers forward_net_activations forward_net_reg_L1 forward_net_reg_L2 forward_net_dropout forward_net_optimizer me mae mse rmsse mpe mape rmae rrmse rame mase smse sce gmrae
4 20 3 64, 64, 64 gelu, gelu, gelu nadam 32, 32 relu, relu 0, 0 0, 0 0.5, 0.5 adam -3.8131 17.2822 710.0303 10.1980 -0.0163 0.0394 1.0500 1.0417 1.3007 4.8694 153.2963 -26.7311 1.0938
3 27 3 64, 64, 64 gelu, gelu, gelu nadam 32, 32 relu, relu 0, 0 0, 0 0.5, 0.5 adam -5.2803 19.4685 964.5117 11.5791 -0.0131 0.0444 1.0163 1.0029 0.9586 5.4499 202.7442 -33.7634 1.0865
1 51 3 64, 64, 64 gelu, gelu, gelu nadam 32, 32 relu, relu 0, 0 0, 0 0.5, 0.5 adam -5.8973 31.2829 2296.6398 18.1277 -0.0081 0.0694 1.1315 1.1163 1.7449 8.5629 482.1925 -32.5311 1.2239
5 51 3 64, 64, 64 gelu, gelu, gelu nadam 32, 32 relu, relu 0, 0 0, 0 0.5, 0.5 adam -5.8973 31.2829 2296.6398 18.1277 -0.0081 0.0694 1.1315 1.1163 1.7449 8.5629 482.1925 -32.5311 1.2239
2 67 3 64, 64, 64 gelu, gelu, gelu nadam 32, 32 relu, relu 0, 0 0, 0 0.5, 0.5 adam 3.9278 44.0684 4497.5283 24.7915 -0.0002 0.0875 1.1874 1.1431 6.0500 11.4571 872.4610 66.2902 1.3136

As anticipated in the previous example, you can set a range of hyper-parameters for the random search instead of using the preset. Let’s see if we can improve the accuracy of our predictions with an horizon of 20 points and a latent space of 3. This time we collect 10 samples.

example3 <- codez(amzn_aapl_fb[, 2:4], n_windows = 5, dates = as.Date(amzn_aapl_fb$Date), n_samp = 10, seq_len = 20, latent = 3, autoencoder_layers_n = 2:3, autoencoder_activ = c("gelu", "relu", "leaky_relu"), autoencoder_layers_size = 64:128, autoencoder_optimizer = c("nadam", "sgd"), forward_net_layers_n = 1, forward_net_activ = c("swish", "selu", "relu"), forward_net_layers_size = 32:64, forward_net_drop = c(0.3, 0.5, 0.7), forward_net_reg_L1 = 0:100, forward_net_reg_L2 = 0:100, forward_net_optimizer = c("adam", "adagrad"))
time: 548.78 sec elapsed

As we peek at the history table, we can see that improvement is not so easy with only 10 samples: you have to run a larger random search.

knitr::kable(example3$history, align = "ccc", caption = "The best horizon of prediction for a given set of hyper-parameters")
The best horizon of prediction for a given set of hyper-parameters
seq_len latent autoencoder_layers autoencoder_activations autoencoder_optimizer forward_net_layers forward_net_activations forward_net_reg_L1 forward_net_reg_L2 forward_net_dropout forward_net_optimizer me mae mse rmsse mpe mape rmae rrmse rame mase smse sce gmrae
6 20 3 66, 121, 105 gelu , relu , leaky_relu sgd 59 swish 31 28 0.7 adam -3.3819 17.5356 722.7776 10.3271 -0.0150 0.0398 1.0709 1.0539 1.3068 4.9968 156.0633 -22.2168 1.1047
3 20 3 89, 66 relu, relu nadam 40 swish 52 42 0.3 adam -3.7153 17.3611 713.8214 10.2400 -0.0164 0.0399 1.0651 1.0497 1.2045 4.9490 154.0029 -26.1311 1.1235
8 20 3 78, 85, 121 gelu, gelu, gelu sgd 59 relu 59 37 0.5 adam -3.3775 17.5427 723.3525 10.3307 -0.0150 0.0398 1.0717 1.0547 1.3144 4.9997 156.1595 -22.2491 1.1116
9 20 3 71, 99 leaky_relu, gelu sgd 33 selu 28 0 0.3 adagrad -3.4874 17.4959 720.8613 10.3108 -0.0150 0.0399 1.0716 1.0565 1.3853 4.9796 155.6377 -23.2523 1.0992
4 20 3 104, 90 relu, relu nadam 60 selu 53 57 0.7 adam -3.6871 17.4627 712.7081 10.3446 -0.0167 0.0403 1.0825 1.0676 1.3362 5.0100 154.8929 -25.7569 1.1239
2 20 3 100, 83 gelu, gelu nadam 51 selu 12 84 0.3 adagrad -4.8590 16.7736 644.8797 10.1287 -0.0201 0.0407 1.1071 1.0773 1.0993 5.0662 145.9539 -40.4989 1.2145
5 20 3 99, 68, 97 relu , leaky_relu, relu nadam 47 relu 82 71 0.5 adagrad -2.9513 17.9651 743.1797 10.5637 -0.0123 0.0415 1.1159 1.1129 1.3762 5.1852 159.7197 -17.5879 1.1353
10 20 3 81, 67, 113 relu, gelu, gelu nadam 49 selu 80 12 0.7 adagrad -3.7495 17.3979 719.3010 10.2531 -0.0183 0.0399 1.1045 1.0969 2.2890 4.8835 152.5588 -31.3684 1.1474
7 20 3 87, 93, 106 relu, relu, relu nadam 36 swish 79 54 0.5 adagrad -5.2575 17.5125 720.5752 10.5403 -0.0220 0.0449 1.1811 1.1371 1.1357 5.4475 166.9612 -41.3945 1.3471
1 20 3 110, 87 leaky_relu, leaky_relu nadam 43 swish 90 72 0.5 adagrad -2.0398 19.5391 850.9036 11.8973 -0.0104 0.0460 1.2091 1.2137 1.8679 5.9249 206.2210 -6.1713 1.2410

Footnotes

  1. On Arxiv you can find some interesting papers on the subject. Just to cite a few: Neural Networks as Kernel Learners: The Silent Alignment Effect, Every Model Learned by Gradient Descent Is Approximately a Kernel Machine, On the Equivalence between Neural Network and Support Vector Machine.↩︎

  2. The missing imputation is managed through imputeTS package. For any information: https://cran.r-project.org/web/packages/imputeTS/index.html.↩︎

  3. In some cases, maybe you want to operate on smoothed time-features. In this case, codez calls on fANCOVA package. Here you can find all the latest: https://cran.r-project.org/web/packages/fANCOVA/index.html↩︎

  4. The metrics are calculated using the greybox package. For any reference, please take a look here: <https://cran.r-project.org/web/packages/greybox/index.html>↩︎

  5. Other packages focused on time feature analysis that could be of interest:

    ↩︎