Intro to codez

Author

Giancarlo Vercellino

“Geometry is not true, it is convenient” (Henri Poincaré)

“Kernel methods owe their name to the use of kernel functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates. This approach is called the ’kernel trick’” (Wikipedia)

“Geometry in the humming of the strings, music in the spacing of the spheres”(Pythagoras)

Moving prediction to the right space

Many academics¹ have recently realized that for the most part deep neural networks operate as a kind of kernel machine, projecting data in different geometric spaces to solve their tasks (that’s not exactly as for the “kernel trick”, because here the projection is usually calculated in a lower dimensional space and the coordinates are actually computed). In codez we use the same “geometric reasoning” to simplifying the solution of Seq2Seq models: codez uses two deep neural networks to predict sequence in time, where the first network is an autoencoder used to encode the tensor of sequences in a subspace with lower dimensions, the second network is a forward neural net that predict the next sequence directly in the latent space, then the prediction is decoded into the original space. In the following, a brief explanation:

Some basic transformation are directly managed in background. Differencing and integration are automatically managed by codez using the maximal p-value in a recursive F-test for de-trending each time-feature: this allows to easily determine the different dynamic characteristics of each time feature, random walk, trend, exponential (somehow more simple and practical compared to other formal approaches like Augmented Dickey-Fuller or Ljung Box Test). If you have limited missing values in your time features, codez automatically proceeds with the imputation using the Kalman filter method². If you prefer to project into the future the smoothed version, you can set smoother= TRUE to use loess³ function.
The test errors are cross-validated through an expanding validation n_windows: the default value is set to 10, meaning that the time features are divided into 10 + 1 segments guaranteeing at least ten validation sets to measure the error on unforeseen data. For each point in the prediction sequence, the empirical error distribution is sampled a thousand time for the calculation of quantiles, mean, mode, standard deviation, skewness and kurtosis, and other less common measures that are provided for each time step (more details in the following paragraphs).

A showcase with stock price prediction

The dataset `time features` included with codez is a recent take on some Big Techs’ stock prices (source: Yahoo Finance). The data is expected in a dataframe format, where each column represents a different time series (the date information is not mandatory and could be provided separately).

library(codez)

#|echo:false
#|include:false

knitr::kable(head(amzn_aapl_fb, 10), align = "ccc", caption = "Examples of stock prices from Tech Companies")

Examples of stock prices from Tech Companies
	Date	AMZN	GOOGL	FB
3779	2012-05-18	213.85	300.5005	38.23
3780	2012-05-21	218.11	307.3624	34.03
3781	2012-05-22	215.33	300.7007	31.00
3782	2012-05-23	217.28	305.0350	32.00
3783	2012-05-24	215.24	302.1321	33.03
3784	2012-05-25	212.89	296.0611	31.91
3785	2012-05-29	214.75	297.4675	28.84
3786	2012-05-30	209.23	294.4094	28.19
3787	2012-05-31	212.91	290.7207	29.60
3788	2012-06-01	208.22	285.7758	27.72

In the first example, we are predicting the close price for Amazon. In this example we set sequence length seq_len to 120 using a validation scheme of 5 windows for error measurement. Let’s try to sample 3 random models using the standard parameters and the presets.

example1 <- codez(amzn_aapl_fb[, 2, drop = FALSE], seq_len = 75, n_windows = 5, dates = as.Date(amzn_aapl_fb$Date), n_samp = 3)

time: 219.22 sec elapsed

The result is a list of different components, as you can see below.

names(example1)

[1] "history"    "best_model" "time_log"

names(example1$best_model)

[1] "predictions" "errors"      "plot"

knitr::kable(example1$history, align = "ccc", caption = "History with 3 random samples")

History with 3 random samples
	seq_len	latent	autoencoder_layers	autoencoder_activations	autoencoder_optimizer	forward_net_layers	forward_net_activations	forward_net_reg_L1	forward_net_reg_L2	forward_net_dropout	forward_net_optimizer	me	mae	mse	rmsse	mpe	mape	rmae	rrmse	rame	mase	smse	sce	gmrae
3	75	26	363, 608	selu, elu	rmsprop	918, 413, 321, 655	swish , tanh , linear, elu	869.870, 697.698, 649.650, 39.040	776.777, 490.491, 546.547, 732.733	0.8047, 0.8311, 0.3595, 0.5973	adamax	14.7248	63.0482	8465.934	32.6550	-0.0096	0.0836	0.9368	0.9408	0.7212	12.1302	1499.677	94.2746	0.9870
2	75	66	135, 310, 31, 846	relu , leaky_relu, gelu , gelu	sgd	488, 305, 31, 165, 306	selu , relu , swish , relu , leaky_relu	371.372, 2.003, 373.374, 404.405, 769.770	108.109, 328.329, 668.669, 75.076, 979.980	0.8904, 0.6037, 0.6133, 0.1649, 0.8359	rmsprop	14.7252	63.0478	8466.597	32.6614	-0.0098	0.0836	0.9368	0.9412	0.7212	12.1300	1499.885	94.2800	0.9756
1	75	50	641, 56	elu , selu	adamax	937, 867, 879, 266, 321	selu , selu , sigmoid, relu , elu	921.922, 625.626, 261.262, 389.390, 129.130	516.517, 32.033, 944.945, 102.103, 227.228	0.1120, 0.3851, 0.6854, 0.2978, 0.3595	adam	14.7262	63.1206	8472.620	32.7196	-0.0098	0.0838	0.9402	0.9448	0.7212	12.1528	1501.775	94.3036	0.9900

history includes hyper-parameters and error metrics for the models samples during random search (me, mae, mse, rmsse, mpe, mape, rmae, rrmse, rame, mase, smse, sce, gmrae⁴). best_model collects the best combination of hyper-parameters, together with the error metric for each time feature (errors), prediction intervals (predictions) and visualizations (plots).

The predictions is a list including the predicted results for each time-feature (quantile, min, max, mean/proportion, mode, sd, skewness, kurtosis, iqr to range, above to below range, upside/upgrade probability, divergence for each time point in the sequence). The IQR to range is the interquartile range to the min-max range, the above to below range is the range above median to the range below it, the upside/upgrade probability is the probability of growth (or scale up, in case of categorical variable) compared to the former point in the time sequence, the divergence is the maximum distance of cumulative normal curve of each point to the former point in the sequence.

#|echo: false

knitr::kable(head(example1$best_model$predictions$AMZN, 10), align = "ccc", caption = "Examples of prediction for Google Close Prices (first 30 points)")

Examples of prediction for Google Close Prices (first 30 points)
	min	10%	25%	50%	75%	90%	max	mean	sd	mode	kurtosis	skewness	iqr_to_range	above_to_below_range	upside_prob	divergence	pred_scores
2019-07-12	1907.04	2003.70	2009.79	2011.28	2020.07	2021.78	2030.10	2005.088	31.5678	2011.951	8.4411	-2.6260	0.0835	0.1805	0.626	0.3483	0.4376
2019-07-13	1937.83	2005.79	2007.57	2017.04	2027.19	2029.44	2066.30	2015.195	27.2575	2011.912	6.0219	-1.1534	0.1527	0.6219	0.567	0.1940	0.2880
2019-07-14	1867.77	1996.24	2003.81	2014.60	2026.24	2059.76	2059.76	2008.402	43.1106	2016.677	8.6311	-2.3682	0.1168	0.3076	0.460	0.1860	0.3300
2019-07-16	1879.67	1980.93	2004.81	2011.69	2027.68	2033.05	2050.51	2005.165	43.1076	2017.546	6.6956	-2.1076	0.1339	0.2940	0.482	0.2580	0.3172
2019-07-17	1881.88	1968.33	1979.74	2014.26	2027.27	2028.61	2050.57	2001.786	43.0450	2021.639	5.5307	-1.7584	0.2818	0.2743	0.479	0.1790	0.2900
2019-07-19	1902.18	1981.83	1992.01	2006.96	2032.72	2042.98	2042.98	2003.801	36.3766	1998.559	5.4061	-1.5846	0.2891	0.3438	0.481	0.2830	0.4088
2019-07-20	1897.02	1969.38	1979.96	2011.76	2039.15	2044.07	2063.90	2004.300	43.5385	2024.587	4.1129	-1.1916	0.3547	0.4544	0.497	0.1950	0.3904
2019-07-22	1830.55	1957.57	1995.84	2010.45	2035.73	2044.24	2058.67	2002.828	55.1986	2020.108	7.2123	-2.1403	0.1749	0.2680	0.536	0.1010	0.4116
2019-07-23	1759.55	1959.04	1991.52	2010.14	2040.86	2042.39	2063.21	1991.567	80.0152	2010.310	6.8237	-2.2013	0.1625	0.2118	0.452	0.2040	0.2692
2019-07-25	1790.12	1947.06	1979.22	2010.89	2039.88	2040.30	2061.59	1993.765	72.1616	2027.311	6.0451	-1.9643	0.2235	0.2297	0.521	0.1660	0.6280

For each time features included in the model, you get a plot of the median with the chosen confidence interval (ci default is 0.8). As in other packages⁵, we provide different stats to give a better hint on the different dynamics related to aleatoric and epistemic uncertainty.

#|echo: false
#|include: true
#|fig-dpi: 300

example1$best_model$plot

$AMZN

Random explorations of the hyper-parameter space

You can use codez in different way. If you have a clear idea about the right hyper-parameters, but you want to understand the best sequence length to explore, you can enter the desired parameters, leaving seq_len to NULL:

example2 <- codez(amzn_aapl_fb[, 2:4], n_windows = 5, dates = as.Date(amzn_aapl_fb$Date), n_samp = 5, seq_len = NULL, latent = 3, autoencoder_layers_n = 3, autoencoder_activ = "gelu", autoencoder_layers_size = 64, autoencoder_optimizer = "nadam", forward_net_layers_n = 2, forward_net_activ = "relu", forward_net_layers_size = 32, forward_net_drop = 0.5, forward_net_reg_L1 = 0, forward_net_reg_L2 = 10, forward_net_optimizer = "adam")

time: 278.5 sec elapsed

As we can see looking at history, with these hyper-parameters the best horizon of exploration is 20.

#|echo: false

knitr::kable(example2$history, align = "ccc", caption = "The best horizon of prediction for a given set of hyper-parameters")

The best horizon of prediction for a given set of hyper-parameters
	seq_len	latent	autoencoder_layers	autoencoder_activations	autoencoder_optimizer	forward_net_layers	forward_net_activations	forward_net_reg_L1	forward_net_reg_L2	forward_net_dropout	forward_net_optimizer	me	mae	mse	rmsse	mpe	mape	rmae	rrmse	rame	mase	smse	sce	gmrae
4	20	3	64, 64, 64	gelu, gelu, gelu	nadam	32, 32	relu, relu	0, 0	0, 0	0.5, 0.5	adam	-3.8131	17.2822	710.0303	10.1980	-0.0163	0.0394	1.0500	1.0417	1.3007	4.8694	153.2963	-26.7311	1.0938
3	27	3	64, 64, 64	gelu, gelu, gelu	nadam	32, 32	relu, relu	0, 0	0, 0	0.5, 0.5	adam	-5.2803	19.4685	964.5117	11.5791	-0.0131	0.0444	1.0163	1.0029	0.9586	5.4499	202.7442	-33.7634	1.0865
1	51	3	64, 64, 64	gelu, gelu, gelu	nadam	32, 32	relu, relu	0, 0	0, 0	0.5, 0.5	adam	-5.8973	31.2829	2296.6398	18.1277	-0.0081	0.0694	1.1315	1.1163	1.7449	8.5629	482.1925	-32.5311	1.2239
5	51	3	64, 64, 64	gelu, gelu, gelu	nadam	32, 32	relu, relu	0, 0	0, 0	0.5, 0.5	adam	-5.8973	31.2829	2296.6398	18.1277	-0.0081	0.0694	1.1315	1.1163	1.7449	8.5629	482.1925	-32.5311	1.2239
2	67	3	64, 64, 64	gelu, gelu, gelu	nadam	32, 32	relu, relu	0, 0	0, 0	0.5, 0.5	adam	3.9278	44.0684	4497.5283	24.7915	-0.0002	0.0875	1.1874	1.1431	6.0500	11.4571	872.4610	66.2902	1.3136

As anticipated in the previous example, you can set a range of hyper-parameters for the random search instead of using the preset. Let’s see if we can improve the accuracy of our predictions with an horizon of 20 points and a latent space of 3. This time we collect 10 samples.

example3 <- codez(amzn_aapl_fb[, 2:4], n_windows = 5, dates = as.Date(amzn_aapl_fb$Date), n_samp = 10, seq_len = 20, latent = 3, autoencoder_layers_n = 2:3, autoencoder_activ = c("gelu", "relu", "leaky_relu"), autoencoder_layers_size = 64:128, autoencoder_optimizer = c("nadam", "sgd"), forward_net_layers_n = 1, forward_net_activ = c("swish", "selu", "relu"), forward_net_layers_size = 32:64, forward_net_drop = c(0.3, 0.5, 0.7), forward_net_reg_L1 = 0:100, forward_net_reg_L2 = 0:100, forward_net_optimizer = c("adam", "adagrad"))

time: 548.78 sec elapsed

As we peek at the history table, we can see that improvement is not so easy with only 10 samples: you have to run a larger random search.

knitr::kable(example3$history, align = "ccc", caption = "The best horizon of prediction for a given set of hyper-parameters")

The best horizon of prediction for a given set of hyper-parameters
	seq_len	latent	autoencoder_layers	autoencoder_activations	autoencoder_optimizer	forward_net_layers	forward_net_activations	forward_net_reg_L1	forward_net_reg_L2	forward_net_dropout	forward_net_optimizer	me	mae	mse	rmsse	mpe	mape	rmae	rrmse	rame	mase	smse	sce	gmrae
6	20	3	66, 121, 105	gelu , relu , leaky_relu	sgd	59	swish	31	28	0.7	adam	-3.3819	17.5356	722.7776	10.3271	-0.0150	0.0398	1.0709	1.0539	1.3068	4.9968	156.0633	-22.2168	1.1047
3	20	3	89, 66	relu, relu	nadam	40	swish	52	42	0.3	adam	-3.7153	17.3611	713.8214	10.2400	-0.0164	0.0399	1.0651	1.0497	1.2045	4.9490	154.0029	-26.1311	1.1235
8	20	3	78, 85, 121	gelu, gelu, gelu	sgd	59	relu	59	37	0.5	adam	-3.3775	17.5427	723.3525	10.3307	-0.0150	0.0398	1.0717	1.0547	1.3144	4.9997	156.1595	-22.2491	1.1116
9	20	3	71, 99	leaky_relu, gelu	sgd	33	selu	28	0	0.3	adagrad	-3.4874	17.4959	720.8613	10.3108	-0.0150	0.0399	1.0716	1.0565	1.3853	4.9796	155.6377	-23.2523	1.0992
4	20	3	104, 90	relu, relu	nadam	60	selu	53	57	0.7	adam	-3.6871	17.4627	712.7081	10.3446	-0.0167	0.0403	1.0825	1.0676	1.3362	5.0100	154.8929	-25.7569	1.1239
2	20	3	100, 83	gelu, gelu	nadam	51	selu	12	84	0.3	adagrad	-4.8590	16.7736	644.8797	10.1287	-0.0201	0.0407	1.1071	1.0773	1.0993	5.0662	145.9539	-40.4989	1.2145
5	20	3	99, 68, 97	relu , leaky_relu, relu	nadam	47	relu	82	71	0.5	adagrad	-2.9513	17.9651	743.1797	10.5637	-0.0123	0.0415	1.1159	1.1129	1.3762	5.1852	159.7197	-17.5879	1.1353
10	20	3	81, 67, 113	relu, gelu, gelu	nadam	49	selu	80	12	0.7	adagrad	-3.7495	17.3979	719.3010	10.2531	-0.0183	0.0399	1.1045	1.0969	2.2890	4.8835	152.5588	-31.3684	1.1474
7	20	3	87, 93, 106	relu, relu, relu	nadam	36	swish	79	54	0.5	adagrad	-5.2575	17.5125	720.5752	10.5403	-0.0220	0.0449	1.1811	1.1371	1.1357	5.4475	166.9612	-41.3945	1.3471
1	20	3	110, 87	leaky_relu, leaky_relu	nadam	43	swish	90	72	0.5	adagrad	-2.0398	19.5391	850.9036	11.8973	-0.0104	0.0460	1.2091	1.2137	1.8679	5.9249	206.2210	-6.1713	1.2410

Footnotes

On Arxiv you can find some interesting papers on the subject. Just to cite a few: Neural Networks as Kernel Learners: The Silent Alignment Effect, Every Model Learned by Gradient Descent Is Approximately a Kernel Machine, On the Equivalence between Neural Network and Support Vector Machine.↩︎
The missing imputation is managed through imputeTS package. For any information: https://cran.r-project.org/web/packages/imputeTS/index.html.↩︎
In some cases, maybe you want to operate on smoothed time-features. In this case, codez calls on fANCOVA package. Here you can find all the latest: https://cran.r-project.org/web/packages/fANCOVA/index.html ↩︎
The metrics are calculated using the greybox package. For any reference, please take a look here: <https://cran.r-project.org/web/packages/greybox/index.html>↩︎
Other packages focused on time feature analysis that could be of interest:
↩︎