Intro to organik

Author

Giancarlo Vercellino

Published

November 25, 2025

“The whole is greater than the sum of its parts.” - Aristotle

“Invisible threads are the strongest ties.” - Friedrich Nietzsche

“Confidence is a feeling.” - Daniel Kahneman

Mixing rpart, glmnet and knn in an organik way

organik trains a probabilistic ensemble per horizon on cumulative growth from a univariate series, backtests variants (CRPS + PIT calibration), and softmax-weights them into robust marginal predictors. A cleaned Kendall-based correlation feeds a Gaussian/t copula to couple horizons, enabling fast simulation of joint growth and level paths with per-horizon quantiles and means.

The organik package works this way:

  • organik starts by converting a single time series into a supervised learning problem for multi-step forecasting. It builds a lagged feature matrix from past observations and, for each forecast horizon, defines a target representing cumulative growth over that horizon. In practice, horizon one captures the change from today to tomorrow, horizon two the total growth over two steps, and so on. This yields a consistent feature–target table across horizons and keeps the link to the original levels simple.

  • For each horizon, organik fits an ensemble of probabilistic component models that vary in three dimensions: mean model, residual distribution, and treatment of predictive uncertainty. The mean model can be a regression tree, an elastic-net regression, or k-nearest neighbours, covering flexible non-parametric and regularized linear options. Residual distributions can be light- or heavy-tailed, symmetric or skewed, to match observed error behaviour. Predictive variance can be constant or feature-dependent via ridge or tree models fitted to a transformed residual scale. Each component delivers a full predictive distribution, not just a point forecast.

  • The ensemble is then weighted according to out-of-sample performance using a rolling backtest. At selected time points, each component is refit on past data and evaluated one step ahead. Two criteria are computed: a simulation-based proper scoring rule for sharpness and a calibration measure based on how realised values behave under the predictive distribution. These are combined into a single score with user-controlled trade-offs. A softmax transformation with a temperature parameter converts scores into model weights, favouring better-calibrated, sharper components while preserving diversity.

  • Once marginal predictive distributions for each horizon are in place, organik estimates how horizons move together by analysing historical cumulative growth across horizons. It uses rank-based correlations that are robust to monotonic transformations and converts them into a correlation structure suitable for multivariate modelling. Because empirical correlation matrices can be noisy or ill-conditioned, the method applies numerical corrections to enforce positive-definiteness, typically via a near–positive-definite adjustment or a small ridge. The resulting matrix encodes how shocks at different horizons co-move in a stable way.

  • Finally, organik uses a copula-based mechanism to generate joint forecast paths consistent with both marginals and dependence. It simulates latent multivariate variables with the estimated correlation, transforms them into uniforms, and, for each horizon, maps these through the marginal quantile functions to obtain simulated paths. The function returns simulated path matrices, horizon-wise means and quantiles, optional incremental returns, and the correlation matrix used, providing coherent multi-step probabilistic forecasts.

The process behind organik

Don’t panic, it’s organik! Some practical examples

In this example, we use organikon some dummy examples to explore how patterns bend, stretch, and evolve over time. Our goal is to forecast 60 future steps by leveraging the full pipeline—distance tensors, variational mapping, and sample simulations.

# Load required libraries
library(organik)

# Let's create a dummy ts
set.seed(42)

y <- cumsum(rnorm(2000, sd = 0.5))
y <- y + abs(min(y)) + 1#smooth positive-ish series

# Minimal example
obj <- organik(
    ts = y, horizon = 150L,
    n_variants = 5L,
    n_testing = 5L,
    seed = 123
  )

Here’s what organik returns and how you use it:

  • model_list (list, length = horizon): one fitted ensemble per horizon. Each element (from ensemble) contains the trained variants, their weights, and a leaderboard with scores (CRPS + calibration). You can inspect per-horizon model diversity and weights via obj$model_list[[h]]$leaderboard.

  • growth_pred_funs (list): marginal predictive objects for cumulative growth at each horizon, evaluated at the most recent feature row. Each object exposes dfun, pfun, qfun, rfun . Example quantile for horizon 1: obj$growth_pred_funs[[1]]$qfun(c(0.1, 0.5, 0.9)).

  • level_pred_funs (list): same interface as above but on the level scale. Example: obj$level_pred_funs[[1]]$qfun(c(0.1, 0.5, 0.9)).

  • cor_mat (matrix H×H): cleaned, positive-definite cross-horizon correlation. Useful if you want to plug a different copula or sanity-check dependence.

  • path_prediction (function): generates joint forecast paths with a Gaussian or t-copula and summarizes them. Call: path <- obj$path_prediction(n_paths = 5000, probs = c(0.05,0.5,0.95), copula = "gaussian"). Returns a list with:

    • cum_growth_paths (n_paths × H) and level_paths (n_paths × H).

    • Optional incr_return_paths (n_paths × H) if return_increments = TRUE.

    • Per-horizon summaries: cum_growth_mean, level_mean, cum_growth_quants, level_quants.

    • Metadata: last_level, copula, df (for t-copula), n_paths, and cor_used (the PD correlation actually used).

  • Reproducibility: Passing seed to organik and/or to path_prediction() makes ensembles, weights, and simulated paths deterministic.

Let’s inspect the components/variants used for h150:

knitr::kable(obj$model_list[[150]]$leaderboard[, -c(1, 7)], caption = "Variants for h150", digits = 3)
Variants for h150
engine dist hetero mean_crsp calibration_error score weight
1 rpart asymmetric_laplace tree 0.284 0.911 2.673 0.60113627
3 glmnet student tree 0.253 1.000 0.720 0.19427082
2 knn logistic ridge 0.425 1.000 -0.856 0.07808392
5 knn skew_t tree 0.425 1.000 -0.856 0.07808381
4 rpart skew_normal ridge 0.516 1.000 -1.681 0.04842518

The leaderboard is a data frame summarizing all trained variants for a given horizon, sorted by ascending score, with one row per variant and these key columns: id (row index), engine (mean model used, e.g., rpart/glmnet/knn), dist (residual family), hetero (scale model), mean_crsp (average CRPS from the rolling backtest), calibration_error (KS statistic on PITs), score (the reciprocal of the weighted blend alpha*mean_crsp + beta*calibration_error), weight (softmax weight derived from scores, sums to 1 across rows), and specific_params (human-readable string capturing the exact hyperparameters randomly selected for engine/dist/hetero).

Now, let’s explore the predictive functions for the last forecast point (h150) for the dummy time series:

# List available predictive functions
names(obj$level_pred_funs$t150)
[1] "dfun" "pfun" "qfun" "rfun"
# Let's have a look at the shape for h150
plot(density(obj$level_pred_funs$t150$rfun(1000)), main = "Density for h150")

Using these functions at each forecasted time point, you can compute confidence intervals, extract quantiles, and generate random samples from the predicted distribution, enabling both uncertainty estimation and full probabilistic forecasting. For example:

# Calculate the 80% confidence interval for h150 predictions (growth)
obj$growth_pred_funs$t150$qfun(c(0.1, 0.9))
[1] -0.1367126  2.1558066
# Calculate the 80% confidence interval for h150 predictions (level)
obj$level_pred_funs$t150$qfun(c(0.1, 0.9))
[1] 10.44826 38.19435
# Generate 1000 random forecast samples for t150 and compute their mean
mean(obj$level_pred_funs$t150$rfun(1000))
[1] 15.60651
# Probability of positive growth at h150
1 - obj$level_pred_funs$t150$pfun(tail(y, 1))
[1] 0.4261444
# Probability of a value between 20 and 30 at h150
obj$level_pred_funs$t150$pfun(30) - obj$level_pred_funs$t150$pfun(20)
[1] 0.1408203
# Odds for value 40 over value 30 at h150
obj$level_pred_funs$t150$dfun(40)/obj$level_pred_funs$t150$dfun(30)
[1] 0.4384153

Final Thoughts

In the end, organik is basically proof that a good forecast is more than just a bunch of pretty plots, it’s the way they hang together when things get weird. The copula is there in the background like a chill coordinator, quietly keeping horizons in sync even when the data is throwing a tantrum. And confidence? That really shouldn’t just be a mood; it should come with a full probability distribution attached. Around here, models don’t win because they sound clever, they win because they survive the tests, call it natural selection for forecasts, minus the Jurassic Park budget. (yeah, it’s a joke!)

Enzoi!