library(codez)
Intro to codez
“Geometry is not true, it is convenient” (Henri Poincaré)
“Kernel methods owe their name to the use of kernel functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates. This approach is called the ’kernel trick’” (Wikipedia)
“Geometry in the humming of the strings, music in the spacing of the spheres”(Pythagoras)
Moving prediction to the right space
Many academics1 have recently realized that for the most part deep neural networks operate as a kind of kernel machine, projecting data in different geometric spaces to solve their tasks (that’s not exactly as for the “kernel trick”, because here the projection is usually calculated in a lower dimensional space and the coordinates are actually computed). In codez we use the same “geometric reasoning” to simplifying the solution of Seq2Seq models: codez uses two deep neural networks to predict sequence in time, where the first network is an autoencoder used to encode the tensor of sequences in a subspace with lower dimensions, the second network is a forward neural net that predict the next sequence directly in the latent space, then the prediction is decoded into the original space. In the following, a brief explanation:
Some basic transformation are directly managed in background. Differencing and integration are automatically managed by codez using the maximal p-value in a recursive F-test for de-trending each time-feature: this allows to easily determine the different dynamic characteristics of each time feature, random walk, trend, exponential (somehow more simple and practical compared to other formal approaches like Augmented Dickey-Fuller or Ljung Box Test). If you have limited missing values in your time features, codez automatically proceeds with the imputation using the Kalman filter method2. If you prefer to project into the future the smoothed version, you can set
smoother
=TRUE
to use loess3 function.The test errors are cross-validated through an expanding validation
n_windows
: the default value is set to 10, meaning that the time features are divided into 10 + 1 segments guaranteeing at least ten validation sets to measure the error on unforeseen data. For each point in the prediction sequence, the empirical error distribution is sampled a thousand time for the calculation of quantiles, mean, mode, standard deviation, skewness and kurtosis, and other less common measures that are provided for each time step (more details in the following paragraphs).
A showcase with stock price prediction
The dataset `time features` included with codez
is a recent take on some Big Techs’ stock prices (source: Yahoo Finance). The data is expected in a dataframe format, where each column represents a different time series (the date information is not mandatory and could be provided separately).
#|echo:false
#|include:false
::kable(head(amzn_aapl_fb, 10), align = "ccc", caption = "Examples of stock prices from Tech Companies") knitr
Date | AMZN | GOOGL | FB | |
---|---|---|---|---|
3779 | 2012-05-18 | 213.85 | 300.5005 | 38.23 |
3780 | 2012-05-21 | 218.11 | 307.3624 | 34.03 |
3781 | 2012-05-22 | 215.33 | 300.7007 | 31.00 |
3782 | 2012-05-23 | 217.28 | 305.0350 | 32.00 |
3783 | 2012-05-24 | 215.24 | 302.1321 | 33.03 |
3784 | 2012-05-25 | 212.89 | 296.0611 | 31.91 |
3785 | 2012-05-29 | 214.75 | 297.4675 | 28.84 |
3786 | 2012-05-30 | 209.23 | 294.4094 | 28.19 |
3787 | 2012-05-31 | 212.91 | 290.7207 | 29.60 |
3788 | 2012-06-01 | 208.22 | 285.7758 | 27.72 |
In the first example, we are predicting the close price for Amazon. In this example we set sequence length seq_len
to 120 using a validation scheme of 5 windows for error measurement. Let’s try to sample 3 random models using the standard parameters and the presets.
<- codez(amzn_aapl_fb[, 2, drop = FALSE], seq_len = 75, n_windows = 5, dates = as.Date(amzn_aapl_fb$Date), n_samp = 3) example1
time: 219.22 sec elapsed
The result is a list of different components, as you can see below.
names(example1)
[1] "history" "best_model" "time_log"
names(example1$best_model)
[1] "predictions" "errors" "plot"
::kable(example1$history, align = "ccc", caption = "History with 3 random samples") knitr
seq_len | latent | autoencoder_layers | autoencoder_activations | autoencoder_optimizer | forward_net_layers | forward_net_activations | forward_net_reg_L1 | forward_net_reg_L2 | forward_net_dropout | forward_net_optimizer | me | mae | mse | rmsse | mpe | mape | rmae | rrmse | rame | mase | smse | sce | gmrae | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 75 | 26 | 363, 608 | selu, elu | rmsprop | 918, 413, 321, 655 | swish , tanh , linear, elu | 869.870, 697.698, 649.650, 39.040 | 776.777, 490.491, 546.547, 732.733 | 0.8047, 0.8311, 0.3595, 0.5973 | adamax | 14.7248 | 63.0482 | 8465.934 | 32.6550 | -0.0096 | 0.0836 | 0.9368 | 0.9408 | 0.7212 | 12.1302 | 1499.677 | 94.2746 | 0.9870 |
2 | 75 | 66 | 135, 310, 31, 846 | relu , leaky_relu, gelu , gelu | sgd | 488, 305, 31, 165, 306 | selu , relu , swish , relu , leaky_relu | 371.372, 2.003, 373.374, 404.405, 769.770 | 108.109, 328.329, 668.669, 75.076, 979.980 | 0.8904, 0.6037, 0.6133, 0.1649, 0.8359 | rmsprop | 14.7252 | 63.0478 | 8466.597 | 32.6614 | -0.0098 | 0.0836 | 0.9368 | 0.9412 | 0.7212 | 12.1300 | 1499.885 | 94.2800 | 0.9756 |
1 | 75 | 50 | 641, 56 | elu , selu | adamax | 937, 867, 879, 266, 321 | selu , selu , sigmoid, relu , elu | 921.922, 625.626, 261.262, 389.390, 129.130 | 516.517, 32.033, 944.945, 102.103, 227.228 | 0.1120, 0.3851, 0.6854, 0.2978, 0.3595 | adam | 14.7262 | 63.1206 | 8472.620 | 32.7196 | -0.0098 | 0.0838 | 0.9402 | 0.9448 | 0.7212 | 12.1528 | 1501.775 | 94.3036 | 0.9900 |
history
includes hyper-parameters and error metrics for the models samples during random search (me, mae, mse, rmsse, mpe, mape, rmae, rrmse, rame, mase, smse, sce, gmrae4). best_model
collects the best combination of hyper-parameters, together with the error metric for each time feature (errors
), prediction intervals (predictions
) and visualizations (plots
).
The predictions
is a list including the predicted results for each time-feature (quantile, min, max, mean/proportion, mode, sd, skewness, kurtosis, iqr to range, above to below range, upside/upgrade probability, divergence for each time point in the sequence). The IQR to range is the interquartile range to the min-max range, the above to below range is the range above median to the range below it, the upside/upgrade probability is the probability of growth (or scale up, in case of categorical variable) compared to the former point in the time sequence, the divergence is the maximum distance of cumulative normal curve of each point to the former point in the sequence.
#|echo: false
::kable(head(example1$best_model$predictions$AMZN, 10), align = "ccc", caption = "Examples of prediction for Google Close Prices (first 30 points)") knitr
min | 10% | 25% | 50% | 75% | 90% | max | mean | sd | mode | kurtosis | skewness | iqr_to_range | above_to_below_range | upside_prob | divergence | pred_scores | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2019-07-12 | 1907.04 | 2003.70 | 2009.79 | 2011.28 | 2020.07 | 2021.78 | 2030.10 | 2005.088 | 31.5678 | 2011.951 | 8.4411 | -2.6260 | 0.0835 | 0.1805 | 0.626 | 0.3483 | 0.4376 |
2019-07-13 | 1937.83 | 2005.79 | 2007.57 | 2017.04 | 2027.19 | 2029.44 | 2066.30 | 2015.195 | 27.2575 | 2011.912 | 6.0219 | -1.1534 | 0.1527 | 0.6219 | 0.567 | 0.1940 | 0.2880 |
2019-07-14 | 1867.77 | 1996.24 | 2003.81 | 2014.60 | 2026.24 | 2059.76 | 2059.76 | 2008.402 | 43.1106 | 2016.677 | 8.6311 | -2.3682 | 0.1168 | 0.3076 | 0.460 | 0.1860 | 0.3300 |
2019-07-16 | 1879.67 | 1980.93 | 2004.81 | 2011.69 | 2027.68 | 2033.05 | 2050.51 | 2005.165 | 43.1076 | 2017.546 | 6.6956 | -2.1076 | 0.1339 | 0.2940 | 0.482 | 0.2580 | 0.3172 |
2019-07-17 | 1881.88 | 1968.33 | 1979.74 | 2014.26 | 2027.27 | 2028.61 | 2050.57 | 2001.786 | 43.0450 | 2021.639 | 5.5307 | -1.7584 | 0.2818 | 0.2743 | 0.479 | 0.1790 | 0.2900 |
2019-07-19 | 1902.18 | 1981.83 | 1992.01 | 2006.96 | 2032.72 | 2042.98 | 2042.98 | 2003.801 | 36.3766 | 1998.559 | 5.4061 | -1.5846 | 0.2891 | 0.3438 | 0.481 | 0.2830 | 0.4088 |
2019-07-20 | 1897.02 | 1969.38 | 1979.96 | 2011.76 | 2039.15 | 2044.07 | 2063.90 | 2004.300 | 43.5385 | 2024.587 | 4.1129 | -1.1916 | 0.3547 | 0.4544 | 0.497 | 0.1950 | 0.3904 |
2019-07-22 | 1830.55 | 1957.57 | 1995.84 | 2010.45 | 2035.73 | 2044.24 | 2058.67 | 2002.828 | 55.1986 | 2020.108 | 7.2123 | -2.1403 | 0.1749 | 0.2680 | 0.536 | 0.1010 | 0.4116 |
2019-07-23 | 1759.55 | 1959.04 | 1991.52 | 2010.14 | 2040.86 | 2042.39 | 2063.21 | 1991.567 | 80.0152 | 2010.310 | 6.8237 | -2.2013 | 0.1625 | 0.2118 | 0.452 | 0.2040 | 0.2692 |
2019-07-25 | 1790.12 | 1947.06 | 1979.22 | 2010.89 | 2039.88 | 2040.30 | 2061.59 | 1993.765 | 72.1616 | 2027.311 | 6.0451 | -1.9643 | 0.2235 | 0.2297 | 0.521 | 0.1660 | 0.6280 |
For each time features included in the model, you get a plot of the median with the chosen confidence interval (ci
default is 0.8). As in other packages5, we provide different stats to give a better hint on the different dynamics related to aleatoric and epistemic uncertainty.
#|echo: false
#|include: true
#|fig-dpi: 300
$best_model$plot example1
$AMZN
Random explorations of the hyper-parameter space
You can use codez in different way. If you have a clear idea about the right hyper-parameters, but you want to understand the best sequence length to explore, you can enter the desired parameters, leaving seq_len
to NULL
:
<- codez(amzn_aapl_fb[, 2:4], n_windows = 5, dates = as.Date(amzn_aapl_fb$Date), n_samp = 5, seq_len = NULL, latent = 3, autoencoder_layers_n = 3, autoencoder_activ = "gelu", autoencoder_layers_size = 64, autoencoder_optimizer = "nadam", forward_net_layers_n = 2, forward_net_activ = "relu", forward_net_layers_size = 32, forward_net_drop = 0.5, forward_net_reg_L1 = 0, forward_net_reg_L2 = 10, forward_net_optimizer = "adam") example2
time: 278.5 sec elapsed
As we can see looking at history, with these hyper-parameters the best horizon of exploration is 20.
#|echo: false
::kable(example2$history, align = "ccc", caption = "The best horizon of prediction for a given set of hyper-parameters") knitr
seq_len | latent | autoencoder_layers | autoencoder_activations | autoencoder_optimizer | forward_net_layers | forward_net_activations | forward_net_reg_L1 | forward_net_reg_L2 | forward_net_dropout | forward_net_optimizer | me | mae | mse | rmsse | mpe | mape | rmae | rrmse | rame | mase | smse | sce | gmrae | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 20 | 3 | 64, 64, 64 | gelu, gelu, gelu | nadam | 32, 32 | relu, relu | 0, 0 | 0, 0 | 0.5, 0.5 | adam | -3.8131 | 17.2822 | 710.0303 | 10.1980 | -0.0163 | 0.0394 | 1.0500 | 1.0417 | 1.3007 | 4.8694 | 153.2963 | -26.7311 | 1.0938 |
3 | 27 | 3 | 64, 64, 64 | gelu, gelu, gelu | nadam | 32, 32 | relu, relu | 0, 0 | 0, 0 | 0.5, 0.5 | adam | -5.2803 | 19.4685 | 964.5117 | 11.5791 | -0.0131 | 0.0444 | 1.0163 | 1.0029 | 0.9586 | 5.4499 | 202.7442 | -33.7634 | 1.0865 |
1 | 51 | 3 | 64, 64, 64 | gelu, gelu, gelu | nadam | 32, 32 | relu, relu | 0, 0 | 0, 0 | 0.5, 0.5 | adam | -5.8973 | 31.2829 | 2296.6398 | 18.1277 | -0.0081 | 0.0694 | 1.1315 | 1.1163 | 1.7449 | 8.5629 | 482.1925 | -32.5311 | 1.2239 |
5 | 51 | 3 | 64, 64, 64 | gelu, gelu, gelu | nadam | 32, 32 | relu, relu | 0, 0 | 0, 0 | 0.5, 0.5 | adam | -5.8973 | 31.2829 | 2296.6398 | 18.1277 | -0.0081 | 0.0694 | 1.1315 | 1.1163 | 1.7449 | 8.5629 | 482.1925 | -32.5311 | 1.2239 |
2 | 67 | 3 | 64, 64, 64 | gelu, gelu, gelu | nadam | 32, 32 | relu, relu | 0, 0 | 0, 0 | 0.5, 0.5 | adam | 3.9278 | 44.0684 | 4497.5283 | 24.7915 | -0.0002 | 0.0875 | 1.1874 | 1.1431 | 6.0500 | 11.4571 | 872.4610 | 66.2902 | 1.3136 |
As anticipated in the previous example, you can set a range of hyper-parameters for the random search instead of using the preset. Let’s see if we can improve the accuracy of our predictions with an horizon of 20 points and a latent space of 3. This time we collect 10 samples.
<- codez(amzn_aapl_fb[, 2:4], n_windows = 5, dates = as.Date(amzn_aapl_fb$Date), n_samp = 10, seq_len = 20, latent = 3, autoencoder_layers_n = 2:3, autoencoder_activ = c("gelu", "relu", "leaky_relu"), autoencoder_layers_size = 64:128, autoencoder_optimizer = c("nadam", "sgd"), forward_net_layers_n = 1, forward_net_activ = c("swish", "selu", "relu"), forward_net_layers_size = 32:64, forward_net_drop = c(0.3, 0.5, 0.7), forward_net_reg_L1 = 0:100, forward_net_reg_L2 = 0:100, forward_net_optimizer = c("adam", "adagrad")) example3
time: 548.78 sec elapsed
As we peek at the history table, we can see that improvement is not so easy with only 10 samples: you have to run a larger random search.
::kable(example3$history, align = "ccc", caption = "The best horizon of prediction for a given set of hyper-parameters") knitr
seq_len | latent | autoencoder_layers | autoencoder_activations | autoencoder_optimizer | forward_net_layers | forward_net_activations | forward_net_reg_L1 | forward_net_reg_L2 | forward_net_dropout | forward_net_optimizer | me | mae | mse | rmsse | mpe | mape | rmae | rrmse | rame | mase | smse | sce | gmrae | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 20 | 3 | 66, 121, 105 | gelu , relu , leaky_relu | sgd | 59 | swish | 31 | 28 | 0.7 | adam | -3.3819 | 17.5356 | 722.7776 | 10.3271 | -0.0150 | 0.0398 | 1.0709 | 1.0539 | 1.3068 | 4.9968 | 156.0633 | -22.2168 | 1.1047 |
3 | 20 | 3 | 89, 66 | relu, relu | nadam | 40 | swish | 52 | 42 | 0.3 | adam | -3.7153 | 17.3611 | 713.8214 | 10.2400 | -0.0164 | 0.0399 | 1.0651 | 1.0497 | 1.2045 | 4.9490 | 154.0029 | -26.1311 | 1.1235 |
8 | 20 | 3 | 78, 85, 121 | gelu, gelu, gelu | sgd | 59 | relu | 59 | 37 | 0.5 | adam | -3.3775 | 17.5427 | 723.3525 | 10.3307 | -0.0150 | 0.0398 | 1.0717 | 1.0547 | 1.3144 | 4.9997 | 156.1595 | -22.2491 | 1.1116 |
9 | 20 | 3 | 71, 99 | leaky_relu, gelu | sgd | 33 | selu | 28 | 0 | 0.3 | adagrad | -3.4874 | 17.4959 | 720.8613 | 10.3108 | -0.0150 | 0.0399 | 1.0716 | 1.0565 | 1.3853 | 4.9796 | 155.6377 | -23.2523 | 1.0992 |
4 | 20 | 3 | 104, 90 | relu, relu | nadam | 60 | selu | 53 | 57 | 0.7 | adam | -3.6871 | 17.4627 | 712.7081 | 10.3446 | -0.0167 | 0.0403 | 1.0825 | 1.0676 | 1.3362 | 5.0100 | 154.8929 | -25.7569 | 1.1239 |
2 | 20 | 3 | 100, 83 | gelu, gelu | nadam | 51 | selu | 12 | 84 | 0.3 | adagrad | -4.8590 | 16.7736 | 644.8797 | 10.1287 | -0.0201 | 0.0407 | 1.1071 | 1.0773 | 1.0993 | 5.0662 | 145.9539 | -40.4989 | 1.2145 |
5 | 20 | 3 | 99, 68, 97 | relu , leaky_relu, relu | nadam | 47 | relu | 82 | 71 | 0.5 | adagrad | -2.9513 | 17.9651 | 743.1797 | 10.5637 | -0.0123 | 0.0415 | 1.1159 | 1.1129 | 1.3762 | 5.1852 | 159.7197 | -17.5879 | 1.1353 |
10 | 20 | 3 | 81, 67, 113 | relu, gelu, gelu | nadam | 49 | selu | 80 | 12 | 0.7 | adagrad | -3.7495 | 17.3979 | 719.3010 | 10.2531 | -0.0183 | 0.0399 | 1.1045 | 1.0969 | 2.2890 | 4.8835 | 152.5588 | -31.3684 | 1.1474 |
7 | 20 | 3 | 87, 93, 106 | relu, relu, relu | nadam | 36 | swish | 79 | 54 | 0.5 | adagrad | -5.2575 | 17.5125 | 720.5752 | 10.5403 | -0.0220 | 0.0449 | 1.1811 | 1.1371 | 1.1357 | 5.4475 | 166.9612 | -41.3945 | 1.3471 |
1 | 20 | 3 | 110, 87 | leaky_relu, leaky_relu | nadam | 43 | swish | 90 | 72 | 0.5 | adagrad | -2.0398 | 19.5391 | 850.9036 | 11.8973 | -0.0104 | 0.0460 | 1.2091 | 1.2137 | 1.8679 | 5.9249 | 206.2210 | -6.1713 | 1.2410 |
Footnotes
On Arxiv you can find some interesting papers on the subject. Just to cite a few: Neural Networks as Kernel Learners: The Silent Alignment Effect, Every Model Learned by Gradient Descent Is Approximately a Kernel Machine, On the Equivalence between Neural Network and Support Vector Machine.↩︎
The missing imputation is managed through imputeTS package. For any information: https://cran.r-project.org/web/packages/imputeTS/index.html.↩︎
In some cases, maybe you want to operate on smoothed time-features. In this case, codez calls on
fANCOVA
package. Here you can find all the latest: https://cran.r-project.org/web/packages/fANCOVA/index.html↩︎The metrics are calculated using the greybox package. For any reference, please take a look here: <https://cran.r-project.org/web/packages/greybox/index.html>↩︎
Other packages focused on time feature analysis that could be of interest:
AUDREX, https://cran.r-project.org/web/packages/audrex/index.html
PROTEUS, https://cran.r-project.org/web/packages/proteus/index.html
JENGA, https://cran.r-project.org/web/packages/jenga/index.html
TETRAGON, https://cran.r-project.org/web/packages/tetragon/index.html
SPOOKY, https://cran.r-project.org/web/packages/spooky/index.html
DYMO, https://cran.r-project.org/web/packages/dymo/index.html
SEGEN,https://cran.r-project.org/web/packages/segen/index.html