Attention conservation notice: I state and briefly discuss a couple important Laws of Large Numbers and Central Limit Theorems and the conditions under which they hold. I borrow freely from Jacod and Protter’s excellent text Probability Essentials and White’s more advanced Asymptotic Theory for Econometricians, and take some liberties with notation. You can and should ignore this unless you feel compelled to read it. It has no relevance to any assessments in this course.

Notation

I use \(\{X_i\}_{i=1}^n\) to refer to the set of measurements \(X_1, X_2, \dots, X_n\) of a random variable \(X\). \(\mu\) refers to an expected value, and \(\sigma^2\) refers to a variance.

Moments of a distribution

Any probability distribution is fully characterized by its moments. The moments of a random variable are expected values of powers of the random variable (or related functions of the random variable), for example the \(m\)-th moment of \(X\) is \[ E[X^m] = \int_{-\infty}^{\infty} x^m f(x) dx,\]

where \(m>0\) and need not be an integer. We are already familiar with the first two moments, the mean and the (centered) variance. The mean is the case when \(m=1\), and the raw variance is the case when \(m=2\). The centered variance (the one we’ve used) is the second moment of the RV \(X-E[X]\). The Normal distribution is completely characterized by its first two moments.

Moments are funny things. The existence of the \(m\)-th moment of a distribution ensures the existence of all lower moments. But it says nothing about the existence of higher moments. Some distributions, such as the Cauchy distribution, have no moments. While the Cauchy looks superficially like a Normal distribution, the tails of the Cauchy distribution are so thick that even the integral for the first moment fails to converge (in just about any meaningful sense of the term “converge”). Where necessary, we will take the absolute value to ensure we’re not doing strange things with negative values in powers.

“Almost surely”

We will say that a statement is true “almost surely” if the probability it is true is 1. Thus, a fair coin (\(Pr(H) = Pr(T) = 0.5\)) will almost surely come up heads or tails. There is a third possibility—the coin may land and balance on its side. But we’ve assumed that the probability of this event is zero, so it will almost surely never happen. We abbreviate “almost surely” with “a.s.”.

The big picture

All LLNs and CLTs involve trading off between two types of conditions: conditions on the sampling process for \(\{X_i\}_{i=1}^n\), and conditions on the moments of the distribution(s) of \(\{X_i\}_{i=1}^n\). The game is that stronger conditions on the sampling process buy weaker/simpler conditions on the moments, and vice versa. Thus, if I want to prove a LLN/CLT for a very general sampling process, I’ll need stronger/more complicated conditions on the moments of the \(\{X_i\}_{i=1}^n\). If I want to prove something for a very restricted sampling process, very simple conditions on the moments will suffice.

The classical case

The original LLN and CLT were proven for the case when the measurements \(\{X_i\}_{i=1}^n\) were taken to be sampled such that each measurement \(X_i\) was independent and identically distributed (iid). This means that, for example, if \(X_1 \sim N(\mu,\sigma)\), then \(X_i \sim N(\mu,\sigma)\) for every \(i\). This rules out cases where some measurements have higher or lower means or variances than other measurements, e.g. cases where we are sampling from two different populations. It also rules out cases where our sampling procedure is such that the second observation depends on the first, e.g. cases where we sample from processes which evolve over time like inventories (since the inventory level in period 2 depends on the inventory level in period 1).

iid sampling is the nicest, most friendly kind of sampling to work with. Such processes require the weakest possible condition on the moments: typically, just that the second moment exists. You really can’t ask for a weaker condition on the moments if you want a CLT. But the price of such a weak condition on moments is high: these LLN/CLTs basically don’t apply to most interesting things in life, e.g. things that change over time or things where different observations are meaningfully different.

The (Kolmogorov) Law of Large Numbers

Statement of the LLN: If \(\{X_i\}_{i=1}^n\) are iid with mean \(\mu\) and variance \(\sigma^2\), then as long as \(\sigma^2 < \infty\), we have \[ \lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^n X_i = \mu ~ a.s.\]

Remarks: This theorem was originally proven by a mathematician named Kolmogorov, and hence bears his name. The conditions given here are not the minimal ones; one could prove this theorem without assuming \(\sigma^2<\infty\) by using the weaker condition \(E[|X|] < \infty\). The condition we used is stronger and implies this one, and is convenient for connection to the following CLT.

The (Lindeberg-Levy) Central Limit Theorem

Statement of the CLT: If \(\{X_i\}_{i=1}^n\) are iid with mean \(\mu\) and variance \(\sigma^2\), then as long as \(0 < \sigma^2 < \infty\), we have \[ \lim_{n \to \infty} \frac{ \sum_{i=1}^n X_i - n \mu}{\sigma \sqrt{n}} = Y \sim N(0,1) ~ a.s. \]

Remarks: Notice that this CLT adds an additional restriction over the associated LLN: the variance must not be degenerate (\(0 < \sigma^2\)). Of course if it were degenerate, there would be no need for a theorem like this. Since a crucial piece of the result (and not just the proof) involves the variance, the finite-variance requirement is essential.

A more complicated case

The classical LLN and CLT are nice and all, but we’re fancy people these days and we want more. Many interesting problems and datasets involve some kind of stratified structure, with multiple groups within a population of interest. Just because you know about them doesn’t mean you have the sample size to study them separately, or maybe you want to start by studying the whole sample as a unit and then disaggregating. In these cases the classical LLN and CLT won’t do the job even if we sample independently, since we have heterogeneously-distributed observations.1

The (Markov) Law of Large Numbers

Statement of the LLN: Let \(\{X_i\}_{i=1}^n\) be independent RVs, all with finite means (\(\mu_i < \infty\)). If there exists some \(\delta>0\) such that \[\sum_{i=1}^\infty \frac{E[|X_i - \mu_i|^{1+\delta}]}{i^{1+\delta}} < \infty,\] then \[ \lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^n X_i = \lim_{n \to \infty} \frac{1}{n}\sum_{i=1}^n \mu_i ~ a.s.\]

The (Lindeberg-Feller) Central Limit Theorem

Statement of the CLT: Let \(\{X_i\}_{i=1}^n\) be independent RVs, all with finite means (\(\mu_i < \infty\)) and finite variances (\(0 < \sigma_i < \infty\)), and density functions \(f_i(x)\). Then

\[ \lim_{n \to \infty} \frac{ \sum_{i=1}^n X_i - \sum_{i=1}^n \mu_i}{\sum_{i=1}^n\sigma_i \sqrt{n}} \to Y \sim N(0,1) ~ a.s. \] and \[ \lim_{n \to \infty} \max_{1\leq i \leq n} n^{-1} \left( \frac{\sigma^2_i}{\sum_{i=1}^n\sigma_i^2} \right) = 0\]

if and only if, for every \(\epsilon > 0\), \[ \lim_{n \to \infty} \left(\sum_{i=1}^n\sigma_i^{2}\right)^{-1}n^{-1} \sum_{i=1}^n \int_{(x-\mu_i)^2 > \epsilon n \sum_{i=1}^n\sigma_i^2} (x-\mu_i)^2 f_i(x)dx = 0.\]

Remarks: The ihd (independent and heterogeneously distributed) case reveals a new concern: what if the variance of one distribution dominates the average of the variances? In (slightly) plainer words, what if outcomes for one of the groups we’re studying vary so wildly as to swamp the variation in outcomes of the whole population? The second result of this theorem reassures us that won’t be the case. It follows from the ``if and only if’’ condition above

That “if and only if” condition is called the Lindeberg condition. The condition states that, as we collect infinitely much data, the extreme tails of the \(X_i\) we’re drawing make negligible contributions to the variance of \(X_i\). What does the Lindeberg condition apply to? Imagine if we kept drawing from groups which, though they individually have finite variances, each group’s mean was just a little larger than the previous group’s, so that the new observations come from progressively ``farther out’’ in the tails of the distribution of our whole sample. This type of behavior would prevent the distribution of our sample from converging to anything, so the Lindeberg condition rules it out.


  1. What if you don’t sample independently? Lots of interesting things happen. Fortunately or unfortunately, dependent processes are well beyond our scope here.