\[ \newcommand{\sample}{\hat{o}} \newcommand{\live}{\text{Live}} \newcommand{\definitelyexists}{\text{DefinitelyExists}} \]

1 Introduction

Let \(F = \{b_1, \cdots, b_d \}\) be a padded, doubly erasure-coded file made of \(d\) blocks such that:

  1. the slot-level code ratio is \(0.5\);
  2. the network-level code ratio is \(m/k\), such that \(k + m = n\) and \((k + m) \times s = n \times s = d\) for some \(s\).

Codex partitions \(F\) into \(n\) equally-sized subsets \(S_i\) (our slots), such that \(\left|S_i\right| = d/n = s\). The sets \(S_i\) are then split among \(n\) different nodes, and any \(k\) out of \(n\) of these sets can be used to reconstruct \(F\).

We refer to the set of nodes which store the partitions \(S_i\) of \(F\) as the storage set \(F\). We can associate a state function to each node \(o_i(t)\) such that, at any given time \(t\):

\[ \begin{equation} o_i(t) = \begin{cases} 1, & \text{if}\ i\ \text{is holding the data} \\ 0, & \text{otherwise} \end{cases} \end{equation} \] We can then define the number of active storage nodes for \(F\) at time \(t\), \(a(F,t)\), as:

\[ a(F, t) = \sum_{1 \leq i \leq n} o_i(t) \] It is clear then that data loss occurs when \(a(F, t) < k\). In particular, we are interested in the instant \(t_l\) at which it occurs:

\[ t_l = \min \left\{t \mid a(F,t) < k\right\} \]

We do not, in general, know the value of \(a(F, t)\) for an arbitrary \(t\). Instead, we have values at set values of \(t\), where such values are governed by a sampling process.

2 Time-to-Detection

Detection relies on proof deadlines being missed on-chain. In that sense, the blockchain acts as an observer which gets to eventually see proof samples as they are submitted. The ordering in which these samples are observed depend both on node/network conditions and, more importantly, on the nature of the underlying random process that generates samples for the various \(o_i\). Observed samples can be used to build an approximation of \(a(F, t)\) which we name \(\hat{a}(F, t)\). We define the drift \(d(t)\) as:

\[ d(t) = \left|a(F, t) - \hat{a}(F, t)\right| \]

We can make some relatively mild assumptions to draw conclusions about \(d(t)\) and, equivalently, about the time it takes to learn about data loss. Namely, we assume that loss is stable; i.e., once a node loses data, then it does not recover it. In other words, we assume that:

Stability of loss. For every \(o_i\), there is a time instant \(t_l\) such that:

  1. \(o_i(t_l) = 1\);
  2. \(o_i(t) = 0\) for all \(t > t_k\).

Although these may look like very strong assumptions, one could envision that losses happen at a given rate until recovery is triggered, at which point we simply start a new, identical process.

This stability assumptions means that the ordering in which we observe the samples from individual nodes does not really matter. I won’t be overly formal about this: just notice that \(a(F, t)\) is monotonically decreasing and our approximation will eventually reflect its true value which, in the worst case, will happen when both \(a\) and \(\hat{a}\) are zero. Were loss not stable, instead, some reorderings could lead us to the incorrect conclusion that \(F\) has been lost when in fact it hasn’t.

2.1 Learning the Value of \(a(F, t)\)

We will look at a simpler problem at first. Let \(t_k\) be an arbitrary time instant and \(a(F, t_k) = l_k\) be the set of nodes that are live at instant \(t\). How long should we expect for us to see \(\hat{a}(F, t) \leq t_k\)?

A useful metaphor would be to imagine we have some object we want to observe at a given instant (i.e., we want to know \(a(F, t_k)\)) but that the time it takes for a ray of light to travel from the object to us is stochastic (Fig. 2.1).

Stochastic "rays of light".

Figure 2.1: Stochastic “rays of light”.

In our case, our “stochastic rays of light” are the samples that we collect from each node, and the “time of departure” for each ray of light is the time instant of the last renewal (the last proof). And here we can make a distinction between memoryless and non-memoryless renewal processes. If the inter-arrival distribution is memoryless, then we can look at the system as if all rays departed at the same time, and our metaphor works perfectly. In that case, if we take \(\{X_1, \cdots\, X_n\}\) as the set of random variables denoting the travel time of each ray of light, it should be clear that:

\[ X_{\max} = \max \left\{X_1, \cdots, X_n \right\} \]

should capture the distribution of how long it takes for us to have a full view of what is going on. In other words, it should capture how long it takes for us to learn the value of \(a(F, t)\). Since the \(X_i\) are iid in the memoryless case, we can use:

\[ P(X_{\max} \leq t) = \prod_{i=1}^n P(X_i \leq t) \tag{2.1} \]

Assuming the \(X_i\) are geometric distributions with identical parameters \(p\), we get:

\[ P(X_{\max} \leq t) = \left(1 - (1 - p)^t\right)^n = F(t) \] If we want to derive the quantile function for for \(F\) – which we need to establish things like medians and percentiles – we need to compute \(F^{-1}(y)\):

\[ y = \left(1 - (1 - p)^t\right)^n \iff \\\ 1 - y^{1/n} = (1 - p)^t \iff \\ t \ln (1 - p) = \ln (1 - y^{1/n}) \iff \\ t = F^{-1}(y) = \frac{\ln (1 - y^{1/n})}{\ln (1 - p)} \]

2.2 Hypothetical Scenarios

F_inv_y <- function(p) function(y, n) log(1 - y**(1/n)) / log(1 - p)
compute_quantiles <- function(F_inv_y) {
  percentiles <- c(0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.95)
  slots <- c(5, 10, 20, 40, 80, 160)

  quantiles <- lapply(
    slots,
    function(n_slots) tibble(
      percentile = percentiles,
      quantile = periods_to_hours(F_inv_y(percentiles, n_slots)),
      slots = n_slots
    )
  ) |> 
    bind_rows() |>
    mutate(slots = as.factor(slots))
}

Let us assume an Ethereum-like block throughput so that we have one new block hash at every \(12\) seconds. This means we do \(7200\) draws a day. If we want nodes to produce an average of one proof per day, then, we must set \(p\) to \(1/7200\).

periods_per_day <- (3600 / 12) * 24
periods_to_hours <- function(periods) (periods * 12) / 3600
quantiles_geom <- compute_quantiles(F_inv_y(1 / periods_per_day))
quantile_plot <- function(quantiles) {
  ggplot(quantiles) +
    geom_line(aes(x = quantile, y = percentile, col = slots)) + 
    theme_minimal() + 
    ylab('percentage of observations which take more than x hours') +
    xlab('observation time (hours)')
}
plotly::ggplotly(quantile_plot(quantiles_geom))

3 Proof Frequencies

Another way to look at the same problem is to ask ourselves what proof frequency would be required to ensure that we learn about a failure condition within a certain time bound. For instance, if we are erasure coding our file with the usual parameters \(k\) and \(m\), we may be interested in knowing what sort of proof frequency would be required if we wanted to learn, within \(24\{hours}\), that \(50\%\) of our redundancy is gone, i.e. that we lost \(1/2 \times m\) nodes.

It turns out that with geometric inter-arrivals we can actually answer that question - we just need to introduce another parameter \(\gamma\), \(0 \leq \gamma \leq 1\), which represents the probability with which we want our detection bound to hold. We will now look for a way to derive the average proof frequency \(\lambda\) such that we can detect with probability \(\gamma\) that a certain percentage \(c_{\text{loss}}\) of our \(m\) redundant nodes has failed.

Recall that \(F(t)\) provides us with the probability that we detect \(d \in \mathbb{N}\) failures within \(t\) time instants. But that is precisely \(\gamma\), except that we want the average proof rate, which is just \(1/p\) for a geometric with parameter \(p\). We can therefore parameterize Eq. so that \(d = \lfloor c_{\text{loss}} \times m \rfloor\), and then solve for \(p\):

\[ \gamma = \left(1 - (1 - p)^t\right)^{\lfloor c_{\text{loss}} \times m \rfloor} \iff \\ 1 - \gamma^{1/\lfloor c_{\text{loss}} \times m \rfloor} = (1 - p)^t \iff \\ \left(1 - \gamma^{1/\lfloor c_{\text{loss}} \times m \rfloor} \right)^{1/t} = 1 - p \iff \\ p = 1 - \left(1 - \gamma^{1/\lfloor c_{\text{loss}} \times m \rfloor} \right)^{1/t} \]

which yields:

\[ \lambda = \frac{1}{1 - \left(1 - \gamma^{1/\lfloor c_{\text{loss}} \times m \rfloor } \right)^{1/t}} \]

One interesting thing to note is that \(\lim_{m \rightarrow \infty} 1 - \left(1 - \gamma^{1/\lfloor c_{\text{loss}} \times m \rfloor } \right)^{1/t} = 0\) which means that performance of will degrade to \(1\) proof at every time period once the network gets very large.

Now let us look at a range of parameters and see how that works.

lambda <- function(c_loss, gamma, m, t) {
  1 / (1 - (1 - gamma**(1/floor(c_loss * m)))**(1/t))
}
results <- expand_grid(
  c_loss = 0.5,
  t = c(periods_per_day, 2 * periods_per_day),
  m = c(3, 5, 10, 20, 30, 40, 80, 160),
  gamma = c(0.7, 0.8, 0.9, 0.95, 0.99)
) |> 
  rowwise() |>
  mutate(lambda = periods_to_hours(lambda(c_loss, gamma, m, t) * 12))
plotly::ggplotly(
  ggplot(results) +
    geom_line(
      aes(y = lambda, x = m, colour = as.factor(gamma))
    ) +
    facet_wrap(
      . ~ t, 
      labeller = labeller(
        t = function(x) paste("bound (t): ", as.integer(x) * 12 / 3600, 'h')
      )
    ) + 
    ylab("average inter-proof time to detect half of losses") +
    xlab("size of redundancy set") +
    labs(colour = "detection probability") +
    theme_minimal()
)

3.1 Efficency of the Geometric Distribution

One of my suspicions is that the long-tailed geometric distribution is inefficient because it allows inter-proof times that are too large, forcing us to use small averages to squeeze the tail when trying to guarantee bounds on observation times.

Let us start with a simple exercise using a set of uniform distributions. Those are not memoryless, but we can get a sense of whether or not we are on the right track by looking at a restricted case of when our “rays of light” depart at the same time1 so we can reuse Eq. (2.1). Assuming a discrete uniform distribution with support on \([a, b]\), we have that the quantile function is (from Eq. (2.1)):

\[ y = \left(\frac{t - a + 1}{b - a + 1}\right)^n \iff \\ y^{1/n}\left(b - a + 1\right) = t - a + 1 \iff \\ t = F^{-1}(y) = y^{1/n} \left(b - a + 1\right) + a - 1 \] Let’s do as before, and plot the quantiles first. To be fair with the geometric distribution, we keep the average at \(7200\) time periods, i.e., with support on \([0, 14400]\):

F_inv_unif <- function(a, b) function(y, n) (y**(1/n))*(b - a + 1) + a - 1
quantiles_unif <- compute_quantiles(F_inv_unif(0, 2 * periods_per_day))

Let’s plot them now, comparing with the geometric distribution:

plotly::ggplotly(quantile_plot(
  quantiles_unif |> 
    mutate(distribution = 'uniform') |>
    rbind(
      quantiles_geom |>
        mutate(distribution = 'geometric')
    )
) + facet_wrap(distribution ~ ., ncol = 1, nrow = 2))

What this shows is what we expected: using a uniform distribution brings the time required to observe a given state down significantly for distributions of the same average. Of course this is not the whole story though, and we need to look into more numbers.


  1. Which, unlike the memoryless case, is not a perfect representation of the state of the system at a random point in time.↩︎