Number of Distinct Samples in Bootstrap

A proof using dot-bar system

Riaz Khan

2021-07-17

Background

Bootstrap is a method to estimate the distribution of a statistic by resampling (Horowitz (2001Horowitz, Joel L. 2001. “The Bootstrap.” In Handbook of Econometrics, 5:3159–3228. Elsevier.)). Suppose \(\theta\) is a statistic that is a function of the \(X\), where \(X\) is a sample of size \(n\). \(X_1,\ X_2,\ X_3,\dots,X_n\) are the data points of \(X\).

So,

\[ \theta = f(X_1,\ X_2,\ X_3,\dots,X_n) \]

The main idea of bootstrap is to make copies of \(X\) and calculate \(\theta\) from each copy of \(X\). Each copy should have the same size as original \(X\). If we have \(b\) copies of \(X\), then we will have \(b\) different estimates of \(\theta\). These \(b\) copies of \(\theta\) will form a distribution and we can make inference from that distribution.

When performing bootstrap, it is imporatnt that when we make copies of \(X\) from the initial data by random sampling, the sampling has to be with replacement. The process is called resampling. The copies are also known as bootstrap samples (Hesterberg (2011Hesterberg, Tim. 2011. “Bootstrap.” Wiley Interdisciplinary Reviews: Computational Statistics 3 (6): 497–526.)). Now an obvious question is, how many distinct bootstrap samples can be formed using the initial sample of size \(n\)? The answer is \(2n-1\choose{n}\) (Davison and Hinkley (1997Davison, Anthony Christopher, and David Victor Hinkley. 1997. Bootstrap Methods and Their Application. 1. Cambridge university press.)). In the following section, we will prove it using a dot-bar method.

Proof

Suppose we have the initial samples \(X_1,\ X_2,\ X_3,\dots,X_n\), from a univariate distribution. As mentioned above, each bootstrap sample is created by taking random draws from the initial sample with replacement. So, each data point of the initial sample can be chosen any number of times in between \(0\) and \(n\) (inclusive). If \(k_i\) represents the count that \(X_i\) is chosen, the following must satisfy:

\[\sum_{i=1}^n{k_i}=n\] Thus, sample \(\{X_1,\ X_2,\ X_3,\dots,X_n\}\) can be written in the frequency domain as \(\{k_1,\ k_2,\ k_3,\dots,k_n\}\), conditioned on \(\sum_{i=1}^n{k_i}=n\).

With that, our original problem takes a simpler form- in how many possible ways \(n\) dots can be partitioned, since number of dots remains constant for each bootstrap sample.

With the data represented in frequency domain, the system can be represented by a dot-bar system. An illustration of dot-bar representation for a simple case \((n=3)\) is shown below1 We have to remember that bootstrap sample \(X_j^*=\{X_1, X_1, X_2\}\) and \(X_k^*=\{X_2, X_1, X_1\}\) are the same. Order does not matter.

An illustration of dot-bar representation.

An illustration of dot-bar representation.

In the dot-bar representation, there are \(n\) dots. The system is simply represented by a sequence of dots and bars. Each bar represents a partitioner. Number of dots between \((j-1)^{th}\) and \(j^{th}\) bar represents the frequency that data point \(X_j\) is chosen. Number of dots after the last bar represents how many times the last data point is chosen.

From above, it is clear that for size \(n\), \(n-1\) bars can completely describe the system. So, the problem takes even a more simpler form- in how many possible ways \(n\) dots and \(n-1\) bars can be rearranged. So, we have total \(n-1 + n=2n-1\) elements, \(n\) of them are dots and \(n-1\) of them are bars. So, the desired number is:

\[ \frac{(2n-1)!}{(n-1)!\ n!} = \frac{(2n-1)!}{((2n-1)-n)!\ n!} ={2n-1 \choose n} \]



Riaz Khan, GStat
MS, Statistics