Why Statistics Meets Algorithms?

  • Algorithms have theoretical time complexity: \(O(n)\), \(O(n \log n)\), \(O(n^2)\), …
  • But in practice, real runtimes vary due to:
    • CPU cache effects
    • Input distribution (random, sorted, reversed)
    • Hardware noise
  • Statistics lets us model, compare, and predict this behavior rigorously
  • Goal: use regression and hypothesis testing to validate and estimate algorithmic complexity from empirical data

The Complexity Model

For an algorithm with theoretical complexity \(f(n)\), we model actual runtime \(T(n)\) as:

\[T(n) = c \cdot f(n) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2)\]

To linearize, we apply a log-log transform. For power-law growth \(T(n) = c \cdot n^k\):

\[\log T(n) = \log c + k \cdot \log n\]

This gives us a simple linear regression problem: \(Y = \beta_0 + \beta_1 X + \varepsilon\), where \(Y = \log T(n)\), \(X = \log n\), and the slope \(\hat{\beta}_1\) estimates the complexity exponent \(k\).

Simulating Algorithm Runtimes

set.seed(42)
n_vals <- seq(100, 5000, by = 100)

# Simulate runtimes: T = c * n^k + noise
sim_runtime <- function(n, k, c = 1e-7, noise_sd = 0.05) {
  true_time <- c * n^k
  true_time * exp(rnorm(length(n), 0, noise_sd))
}

df <- data.frame(
  n    = rep(n_vals, 3),
  time = c(sim_runtime(n_vals, 1),
           sim_runtime(n_vals, 1.585),   # ~ n log n
           sim_runtime(n_vals, 2)),
  algo = rep(c("O(n)", "O(n log n)", "O(n²)"), each = length(n_vals))
)

Visualizing Runtime Growth

Log-Log Linearization

Simple Linear Regression — Estimating the Exponent

Fit \(\log T = \beta_0 + \beta_1 \log n\). The slope \(\hat{\beta}_1\) estimates the complexity exponent:

\[\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\]

Regression results — slope ≈ theoretical exponent
Algorithm β̂₁ (slope) β̂₀ (intercept)
O(n log n) 1.5810 -16.0829 0.9989
O(n) 0.9844 -16.0016 0.9959
O(n²) 1.9902 -16.0512 0.9993

Hypothesis Testing the Exponent

We test if \(\hat{\beta}_1\) significantly matches the theoretical value \(k_0\):

\[H_0: \beta_1 = k_0 \quad H_1: \beta_1 \neq k_0 \qquad t = \frac{\hat{\beta}_1 - k_0}{SE(\hat{\beta}_1)}, \quad t \sim t_{n-2}\]

Fail to reject H₀ in all cases (p >> 0.05) — exponents confirmed
Algorithm β̂₁ k₀ t p-value
log_n…1 O(n) 0.9844 1.000 -1.7209 0.0917
log_n…2 O(n log n) 1.5810 1.585 -0.5322 0.5970
log_n…3 O(n²) 1.9902 2.000 -1.3291 0.1901

3D Surface: Runtime as a Function of n and Noise

Prediction Interval for Runtime

For a new observation at \(x^*\), the 95% prediction interval is:

\[\hat{y}^* \pm t_{\alpha/2,\, n-2} \cdot \hat{\sigma} \sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{\sum(x_i - \bar{x})^2}}\]

Key Takeaways

  • Log-log regression is a powerful tool for empirically validating algorithmic complexity
  • Slope \(\hat{\beta}_1 \approx k\) confirms the theoretical exponent with statistical rigor
  • Hypothesis tests give formal evidence that measured data matches expected complexity
  • Prediction intervals quantify uncertainty in runtime estimates for new inputs
  • This framework is used in benchmarking, capacity planning, and performance SLAs in industry

“In God we trust; all others bring data.” — W. Edwards Deming