Statistical Analysis of Algorithm Runtime Complexity

Why Statistics Meets Algorithms?

Algorithms have theoretical time complexity: \(O(n)\), \(O(n \log n)\), \(O(n^2)\), …
But in practice, real runtimes vary due to:
- CPU cache effects
- Input distribution (random, sorted, reversed)
- Hardware noise
Statistics lets us model, compare, and predict this behavior rigorously
Goal: use regression and hypothesis testing to validate and estimate algorithmic complexity from empirical data

The Complexity Model

For an algorithm with theoretical complexity \(f(n)\), we model actual runtime \(T(n)\) as:

\[T(n) = c \cdot f(n) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2)\]

To linearize, we apply a log-log transform. For power-law growth \(T(n) = c \cdot n^k\):

\[\log T(n) = \log c + k \cdot \log n\]

This gives us a simple linear regression problem: \(Y = \beta_0 + \beta_1 X + \varepsilon\), where \(Y = \log T(n)\), \(X = \log n\), and the slope \(\hat{\beta}_1\) estimates the complexity exponent \(k\).

Simulating Algorithm Runtimes

set.seed(42)
n_vals <- seq(100, 5000, by = 100)

# Simulate runtimes: T = c * n^k + noise
sim_runtime <- function(n, k, c = 1e-7, noise_sd = 0.05) {
  true_time <- c * n^k
  true_time * exp(rnorm(length(n), 0, noise_sd))
}

df <- data.frame(
  n    = rep(n_vals, 3),
  time = c(sim_runtime(n_vals, 1),
           sim_runtime(n_vals, 1.585),   # ~ n log n
           sim_runtime(n_vals, 2)),
  algo = rep(c("O(n)", "O(n log n)", "O(n²)"), each = length(n_vals))
)

Visualizing Runtime Growth

Log-Log Linearization

Simple Linear Regression — Estimating the Exponent

Fit \(\log T = \beta_0 + \beta_1 \log n\). The slope \(\hat{\beta}_1\) estimates the complexity exponent:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\]

Regression results — slope ≈ theoretical exponent
Algorithm	β̂₁ (slope)	β̂₀ (intercept)	R²
O(n log n)	1.5810	-16.0829	0.9989
O(n)	0.9844	-16.0016	0.9959
O(n²)	1.9902	-16.0512	0.9993

Hypothesis Testing the Exponent

We test if \(\hat{\beta}_1\) significantly matches the theoretical value \(k_0\):

\[H_0: \beta_1 = k_0 \quad H_1: \beta_1 \neq k_0 \qquad t = \frac{\hat{\beta}_1 - k_0}{SE(\hat{\beta}_1)}, \quad t \sim t_{n-2}\]

Fail to reject H₀ in all cases (p >> 0.05) — exponents confirmed
	Algorithm	β̂₁	k₀	t	p-value
log_n…1	O(n)	0.9844	1.000	-1.7209	0.0917
log_n…2	O(n log n)	1.5810	1.585	-0.5322	0.5970
log_n…3	O(n²)	1.9902	2.000	-1.3291	0.1901

3D Surface: Runtime as a Function of n and Noise

Prediction Interval for Runtime

For a new observation at \(x^*\), the 95% prediction interval is:

\[\hat{y}^* \pm t_{\alpha/2,\, n-2} \cdot \hat{\sigma} \sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{\sum(x_i - \bar{x})^2}}\]

Key Takeaways

Log-log regression is a powerful tool for empirically validating algorithmic complexity
Slope \(\hat{\beta}_1 \approx k\) confirms the theoretical exponent with statistical rigor
Hypothesis tests give formal evidence that measured data matches expected complexity
Prediction intervals quantify uncertainty in runtime estimates for new inputs
This framework is used in benchmarking, capacity planning, and performance SLAs in industry

“In God we trust; all others bring data.” — W. Edwards Deming