Let $$\mathbf p = (p_1, p_2, \dots, p_K)$$ be a vector of probabilities and define $$p_0 = 1 - \sum_{k=1}^K p_k$$, supposed to be positive. Now we can consider the index labels $$\{0, 1, \dots, K\}$$ as names of species and take a sample of size $$n$$ from this set according to the probability distribution $$(p_0, \mathbf p)$$.

The label “0” corresponds to very many species all of extremely small probabilty, but altogether adding up to positive probability $$p_0$$. The experimenter who collects this data does not know the ordering of the species by probability. In fact, all he or she can do is determine whether or not two elements of the sample belong to the same species. The data determines a partition, in the number-theoretic sense, of the integer $$n$$.

The following R code generates such a sample. The sample size, and the probabilities $$p_k = 1/ 200 \sqrt k$$, were chosen to mimic a real data-set from a forensic DNA application.

Note that according to this formula, $$K$$ can not be more than about $$10150$$ and at that value $$p_0$$ has become zero. This follows because $$\sum_{k=1}^K 1/ 200 \sqrt k \approx \sqrt K / 100$$.

options(scipen = 8)
set.seed(1234)
n <- 2085
K <- 2500
power <- 1/2
probs <- 1 / (200*((1:K)^power))
pblob <- 1 - sum(probs)
data <- sample(0:K, n, replace = TRUE, prob = c(pblob, probs))
Nblob <- sum(data == 0)
if (Nblob > 0 ) data[data == 0] <- -(1 : Nblob)
tab <- table(table(data))
tab
##
##    1    2    3    4    5    6    7    9   11
## 1618  135   23   12    3    4    3    1    1

One species was observed 11 times, one species was observed 9 times, three species were each observed 7 times … 1618 species were each observed once.

From this tabulation, we can reconstruct the list of sample relative species of each observed species, in decreasing order. I plot it on double logarithmic paper, and also plot the true frequency distribution (which is a straight line).

freqs <- as.integer(names(tab))
reps <- as.vector(tab)
Ltab <- length(tab)
naive <- rep(freqs[Ltab:1], reps[Ltab:1]) / n
plot(naive, log = "xy",
main = "Empirical mass function, true mass function",
sub = "(Sampled species ordered by sample frequency)",
xlab = "Index", ylab = "Relative frequency")

lines(probs) We see that the naive estimator of the probability mass function (the ordered empirical) over-estimates the true mass function. At high values of the index variable, it consists of long horizontal segments, of increasing length. The slope of the left hand endpoints of these segements seems extremely close to the slope of the true mass function. It seems that that distribution could be estimated by anchoring a straight line with the just mentioned slope in the point corresponding to index 1.

Here are nine further simulations which support that idea.

par(mfrow = c(3, 3))
for (i in 1:9) {
data <- sample(0:K, n, replace = TRUE, prob = c(pblob, probs))
Nblob <- sum(data == 0)
if (Nblob > 0 ) data[data == 0] <- -(1 : Nblob)
tab <- table(table(data))
freqs <- as.integer(names(tab))
reps <- as.vector(tab)
Ltab <- length(tab)
naive <- rep(freqs[Ltab:1], reps[Ltab:1]) / n
plot(naive, log = "xy",
xlab = "", ylab = "",
ylim = c(5e-04, 5e-03))
lines(probs)
} I have now seen two data sets from genomics, from completely unrelated fields, but which both fit this picture perfectly, even down to them both having the same power “one half” in what seems to be a power law. The first example was of tomatoe genes, ordered by probability of being expressed (only one gene being expressed at a time). The second example was human male Y-str haplotype DNA profiles.

Simulations with other powers do not spoil the picture. The “spectrum” of ordered relative frequencies fits in a kind of cone, between two sloping lines which meet at the species with index 1. The slope of the lower line is moreover close to the true slope.

Is this a coincidence, or is there a theorem lurking here?