This document reports the empirical results of the experimental run against the three pre-registered research questions and confirmatory hypotheses recorded in Each hypothesis is evaluated at the seed budget on record (n = 5) under two complementary inferential frameworks: the classical small-sample toolkit (paired Student t-test, Wilcoxon signed-rank, classical TOST) and the modern resampling-and-Bayesian toolkit (BCa bootstrap intervals, exact paired sign-permutation, posterior on the paired mean, Friedman + Nemenyi critical-difference diagram). All eight continual-learning metrics referenced in the methods chapter are plotted. The source is an R Markdown document; the prose and the code chunks that produce the figures and statistics are jointly editable in RStudio and re-knit deterministically from the experimental CSV.

A note on the statistical toolkit

Every quantitative claim that follows is reported under two inferential windows. The classical window comprises the paired Student t-test, the Wilcoxon signed-rank test, repeated-measures ANOVA with Bonferroni-Holm-corrected pairwise paired t-tests, and the classical two-one-sided-tests equivalence procedure (TOST). The modern window comprises the exact paired sign-permutation test, the bias-corrected and accelerated bootstrap (BCa) confidence interval (Efron, 1987), the Bayesian posterior on the paired mean under a non-informative reference prior, and the Friedman + Nemenyi critical-difference diagram (Demšar, 2006) for cross-method comparison. Reporting both is not a methodological hedge. It is the disciplined way to expose, at a five-seed budget, which findings are robust to the choice of inferential machinery and which are not.

Question being asked	Classical test	Modern test
Is the paired difference different from zero?	Paired t (Student); Wilcoxon signed-rank	Exact sign-permutation; BCa CI; Bayesian P(delta>0)
Is method A no worse than method B by more than delta?	One-sided paired t against margin (Wellek 2010)	BCa lower bound against margin
Are A and B equivalent within +/- delta?	TOST: two one-sided paired t-tests at alpha=0.05	BCa CI inside the equivalence band
Which of k methods are statistically distinguishable?	Repeated-measures ANOVA; Bonferroni-Holm-corrected pairwise paired t; or Friedman (1937) with Wilcoxon-signed-rank post-hoc	Friedman + Nemenyi critical-difference diagram (Demsar 2006)
What is the magnitude of an effect?	Cohen d / Hedges g	Hedges g, Cliff delta, Bayesian credible interval

At n = 5, the minimum attainable one-sided p-value under both the exact sign-permutation distribution and the exact Wilcoxon signed-rank distribution is 1 / 32 = 0.03125. A value of 0.0313 appearing in any table below should therefore be read as the resolution limit of the test at this sample size rather than as evidence of a particularly small effect.

RQ1: cross-domain generalization and the encoder ablation

Research Question 1

When a continual-learning architecture developed for perceptual inputs is applied to abstract-reasoning tasks with grid-structured inputs (ARC-AGI; Chollet, 2019), is an explicit domain-adaptive encoding layer necessary to maintain retention across the two input domains?

H1

The domain-adaptive encoding layer is necessary for cross-domain retention. When the layer is removed, and all other components and training conditions are held fixed, catastrophic forgetting on ARC-AGI task sequences will increase by a statistically significant margin, measured by Backward Transfer and Average Accuracy (Lopez-Paz & Ranzato, 2017).

Friedman’s test on the 15 (domain x seed) blocks rejects the null of equal mean rank: F_{Iman-Davenport} = 34.61, k = 5, Nemenyi critical difference = 1.58 at alpha = 0.05.

Figure 1. Critical-difference diagram (Demsar 2006)

Lower rank denotes better mean performance. Methods linked by a horizontal bar above the axis are not statistically distinguishable by Nemenyi at alpha = 0.05.

Experience Replay (ER) sits alone at rank 1.00, separated from every other method by more than the critical difference of 1.58. A-GEM sits at 2.40, again separated from the lower group. EWC, Hybrid (with encoder), and Hybrid (without encoder) cluster at 3.33, 3.93, and 4.33 respectively, all within the CD band of one another. At this seed budget, the proposed hybrid is statistically indistinguishable from EWC and from its own ablated form.

Figure 2. Per-method AA per domain with BCa 95% CIs

Bars = mean over 5 seeds. Whiskers = BCa 95% CIs from 4 000 jackknife-accelerated bootstrap resamples.

Figure 3. All four CL metrics per method per domain (small multiples)

Each cell shows mean over 5 seeds with BCa 95% CI whiskers. Rows = metrics. Columns = domains.

H1 verdict

Figure 4. H1 retention contrast of the encoder

Per-seed paired differences for BWT (blue) and AA (red), one row per domain plus a pooled row across 15 blocks.

Table 2. H1 pooled (15 paired observations across 3 domains x 5 seeds)
Metric	Mean diff	BCa 95% CI	Modern: exact perm	Classical: paired t	Classical: Wilcoxon	Hedges g	P(d>0 \| data)
BWT (retention)	+0.012	[-0.005, +0.030]	0.1264	0.1206	0.0757	+0.30	0.88
AA (descriptive)	+0.003	[-0.016, +0.025]	0.3791	0.3792	0.4235	+0.08	0.62

H1 reading

The pooled paired difference in Backward Transfer between the encoder-equipped and the encoder-ablated hybrid, computed over 15 (domain, seed) blocks, is μ_d = +0.012, with a BCa 95% confidence interval of [-0.005, +0.030]. The exact one-sided sign-permutation test returns p = 0.126; the classical paired Student t-test returns p = 0.121; the classical Wilcoxon signed-rank returns p = 0.076. The Bayesian reference posterior on μ_d places 88% of its mass above zero, with 95% credible interval [-0.009, +0.032]. The Wilcoxon p approaches but does not cross α = 0.05; the t-test and the exact-permutation p are further from it. The directional finding is consistent across all four tests and across all three benchmark domains. The retention component of H1 is accordingly directionally supported, with a tight BCa interval and a posterior probability above zero of 0.88, but not statistically significant at α = 0.05 under either the classical or the exact-permutation procedure at n = 5. Chapter 4 should report the result in precisely those terms, including the exact-permutation resolution floor of 0.0313 as context for the achievable p-value.

The Average Accuracy component of H1, evaluated on the same 15 paired observations, shows no detectable effect at the present seed budget: pooled mean +0.003, BCa 95% CI [-0.016, +0.025].

RQ2: temporal solver and computational cost

Research Question 2

Does the choice of temporal solver, a Neural ODE or a Neural Controlled Differential Equation (Kidger et al., 2020), affect the retention quality of the continual-learning architecture, and how does it affect the computational cost of training?

H2

The two solvers are equivalent in retention quality. Backward Transfer and Average Accuracy under the Neural CDE solver will fall within a pre-specified equivalence margin of +/-0.05 of the values obtained under the Neural ODE solver, assessed using two one-sided tests (TOST) at alpha = 0.05.

Computational cost is treated as a descriptive characterization rather than a confirmatory hypothesis. Wall-clock training time and training FLOPs are reported separately across solver configurations.

Figure 5. Per-seed CDE vs ODE accuracy

Points on or above the diagonal indicate CDE >= ODE for that seed. Yellow band = +/-0.05 equivalence margin.

Figure 6. RQ2 compute trade-off (wall-clock and FLOPs)

Wall-clock and FLOPs comparison per solver per domain. Means over 5 seeds with BCa 95% CI whiskers.

H2 verdict

At n = 5, the pre-specified two-sided TOST at +/- 0.05 does not pass on either domain because the small-sample confidence intervals exceed the +0.05 upper margin. What does pass cleanly at n = 5 is the one-sided non-inferiority half of the TOST (CDE not worse than ODE by more than 0.05) together with the wall-clock equivalence within +/- 10% that the chapter intends to report descriptively. Both halves are reported below so the reader can see the full TOST picture and the non-inferiority component separately.

Figure 7. H2 joint non-inferiority on AA and equivalence on wall-clock

Panel A: BCa 95% CI of (CDE - ODE) paired AA differences against the -0.05 NI margin. Panel B: BCa 95% CI of relative paired wall-clock difference against the +/-10% equivalence band.

Table 3. H2 components. Component A (cols 2-6): AA non-inferiority of CDE vs ODE at margin -0.05. Component B (cols 7-9): wall-clock equivalence within +/-10% (TOST).
Domain	Mean d	BCa 95% CI	Perm-NI p	t-NI p	NI verdict	Mean rel d	Wall BCa CI	Wall TOST verdict
ARC-family	+0.046	[-0.004, +0.074]	0.0312	0.0043	pass	-5.2%	[-8.7%, -1.7%]	pass
Split-CIFAR-100	+0.034	[+0.000, +0.089]	0.0312	0.0157	pass	-4.1%	[-5.3%, -2.8%]	pass

H2 reading

One-sided non-inferiority of the Neural CDE solver on Average Accuracy passes at n = 5 on both benchmark domains. The BCa 95% lower bound on the paired difference (CDE − ODE) is well above the pre-specified margin of −0.05 on Split-CIFAR-100 and on ARC-family; both the exact paired sign-permutation non-inferiority test and the classical paired-t non-inferiority test reject the respective null hypothesis at α = 0.05.

The two-sided TOST equivalence on Average Accuracy at the pre-specified ±0.05 margin does not pass at n = 5. The upper-margin one-sided component of the TOST returns p > 0.05 on both domains because the BCa upper bound on (CDE − ODE) reaches +0.10 on Split-CIFAR-100 and +0.07 on ARC-family. The point estimate is consistent with the CDE solver being at least non-inferior and possibly marginally better than the ODE solver on Average Accuracy. The two-sided equivalence claim as pre-registered cannot be made at the present seed budget without inflating the type-I error rate of the test beyond the nominal 5%.

Wall-clock equivalence within ±10% passes on both domains under classical TOST and under the BCa interval. The Neural CDE solver runs approximately 4–5% faster than the Neural ODE solver per training task despite consuming 5–7× the floating-point operations (Figure 6), a consequence of the GPU-friendly fixed-step Runge–Kutta 4 scheduler replacing the adaptive-step adjoint integrator.

RQ3: memory consolidation and storage efficiency

Research Question 3

Can a hierarchical memory consolidation mechanism reduce the storage required for replay-based continual learning while preserving most of the task accuracy that unbounded replay achieves, and how does the retained accuracy depend on the storage budget it is allocated?

H3

The hierarchical consolidation mechanism preserves task accuracy more efficiently than unbounded replay. Mean log(CE) across random seeds will be significantly greater than 0, using a one-sided, one-sample test at alpha = 0.05, with a Wilcoxon signed-rank test as the pre-registered fallback if log(CE) departs from normality. CE = (AA_method / AA_unbounded) / (M_method / M₀).

Figure 8. Accuracy preservation across the consolidation frontier

Mean Average Accuracy at each (buffer, compression) operating point with BCa 95% CI whiskers. Rows = domain, columns = consolidation ratio. Each panel carries two horizontal reference lines: a green long-dashed line at the unbounded-replay AA (the architecture-bound ceiling) and a purple solid line at the no-consolidation AA (the lower envelope). Both lines are labelled inline. Bars sitting on or above the green line preserve the architecture-bound ceiling under memory contraction.

How to read Figure 8. Each row is a domain (Split-CIFAR-100 above, ARC-family below); each column is a consolidation ratio (r = 0.25, 0.50, 1.00). Within each cell, the four positions on the horizontal axis are the four buffer sizes (100, 250, 500, 1000 exemplars), and the two coloured bars at each position are the two compression modes. The green long-dashed line marks the unbounded-replay AA (100% memory baseline, the architecture ceiling), and the purple solid line marks the no-consolidation AA (low-memory baseline with no policy applied). Bars sitting on or above the green line preserve the architecture-bound ceiling under memory contraction.

Interpretation of Figure 8. On Split-CIFAR-100 every operating point either matches or exceeds the green unbounded-replay line and sits substantially above the purple no-consolidation line; on ARC-family the bars hug the green reference across all 24 cells. Because the bars sit on or above the unbounded reference line throughout the sweep, the architecture-bound accuracy ceiling is preserved across the entire (buffer x ratio x compression) operating space. Figure 9 makes the corresponding memory contraction explicit through the log(CE) values in each cell.

Figure 9. log(CE) heat-map per (buffer x ratio x compression) cell

Brighter = larger CE. Every cell on both domains has BCa-95 lower bound on log(CE) above zero. Row labels (b = buffer, r = consolidation ratio) appear once on the far left and align identically across both panels.

H3 verdict

Table 4. H3 verdict across the 48 sweep cells
Domain	Cells	Min BCa LB of log(CE)	Max exact-perm p	Max one-sample t p	Max Wilcoxon p	Verdict
ARC-family	24	1.89	0.0312	0.0002	0.0312	pass
Split-CIFAR-100	24	4.20	0.0312	0.0000	0.0312	pass

H3 reading

Across the 48 sweep cells (24 per benchmark domain), the BCa 95% lower confidence bound on log(CE) exceeds zero in every cell. The minimum BCa lower bound observed across all cells is 1.89, corresponding to a Consolidation Efficiency of 6.6× the unbounded-replay baseline. The classical one-sample t-test against zero, the pre-registered primary procedure, rejects the null in every cell (maximum p across Split-CIFAR-100 cells = 0.0000; ARC-family = 0.0002). The pre-registered Wilcoxon signed-rank fallback and the exact sign-permutation test both attain the n = 5 resolution floor of 0.0313 in every cell. The maximum CE observed across the sweep exceeds 1453× on Split-CIFAR-100. H3 is supported by both the primary one-sided one-sample t-test and the pre-registered Wilcoxon fallback, uniformly across the consolidation frontier.

Hypothesis verdict summary

Hypothesis verdicts at five seeds
Hypothesis	Status at n = 5	Headline result
H1	retention component directional	BWT pooled BCa = [-0.005, +0.030]; perm p = 0.126; t p = 0.121; Wilcoxon p = 0.076; P(d>0\|data) = 0.88
H2	NI passes; two-sided TOST does not at n = 5	One-sided NI of CDE wrt ODE at margin 0.05 passes on both domains. Two-sided TOST at +/- 0.05 does not. Wall-clock equivalence within +/- 10% passes via classical TOST and BCa.
H3	passes uniformly	Min BCa lower bound on log(CE) = 1.89; max t p = 0.0002; max Wilcoxon p = 0.0312.

References

Efron, B. (1987). Better bootstrap confidence intervals. JASA, 82(397), 171–185.
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. JASA, 32(200), 675–701.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83.
Hedges, L. V. (1981). Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics, 6(2), 107–128.
Cliff, N. (1993). Dominance statistics: ordinal analyses to answer ordinal questions. Psychological Bulletin, 114(3), 494–509.
Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. JMLR, 7, 1–30.
Wellek, S. (2010). Testing statistical hypotheses of equivalence and noninferiority (2nd ed.). Chapman & Hall.
Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. J. Exp. Psych: General, 142(2), 573–603.
Benavoli, A., Corani, G., Demšar, J., Zaffalon, M. (2017). Time for a change. JMLR, 18, 1–36.
Lopez-Paz, D., Ranzato, M. (2017). Gradient episodic memory for continual learning. NeurIPS 2017.
Kidger, P., Morrill, J., Foster, J., Lyons, T. (2020). Neural controlled differential equations for irregular time series. NeurIPS 2020.
Chollet, F. (2019). On the measure of intelligence. arXiv:1911.01547.

End of document.

CDE-MAT Experimentation results, five seeds

Research Experimentation, classical and modern small-sample tests, all metrics plotted

Lava Kumar Polu

27 May 2026