James Scott (UT-Austin)
Reference: Bertsekas Chapter 7
We'll study three important inequalities in probability theory:
These are useful for bounding quantities that might otherwise be hard to compute. They are also used in the study of convergence of random variables.
Markov's inequality bounds the upper tail probability for a nonnegative random variable.
Theorem: Let \( X \) be a nonnegative random variable and suppose that \( E(X) < \infty \). Then for any \( t > 0 \),
\[ P(X \geq t) \leq \frac{E(X)}{t} \]
Let's fix any positive number \( t \) and consider the random variable \( Y_t \) defined as
\[ Y_t = \begin{cases} 0, & \mbox{if } X < t \\ t, & \mbox{if } X \geq t \end{cases} \]
Since it is always the case that \( Y_t \leq X \), then \( E(Y_t) \leq E(X) \). Therefore
\[ E(X) \geq E(Y_t) = t \cdot P(Y_t = t) = t \cdot P(X_t \geq t) \]
Chebyshev's inequality bounds the probability that a random variable will deviate very far from its mean.
Theorem: Let \( X \) be a random variable and suppose that \( E(X) = \mu < \infty \) and \( \mbox{var}(X) = \sigma^2 < \infty \). Let \( t > 0 \). Then
\[ P(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2} \]
In words: if the variance of a random variable is small, then the probability that it takes a value far from its mean is also small. In particular, the probability that \( X \) is more than \( t \) units away from its mean falls at least as fast as \( 1/t^2 \).
Clearly \( P(|X-\mu| \geq t) = P(|X-\mu|^2 \geq t^2) \).
Now since \( |X-\mu|^2 \) is nonnegative random variable, we can use Markov's inequality to conclude that
\[ P(|X-\mu|^2 \geq t^2) \leq \frac{E(|X-\mu|^2)}{t^2} = \frac{\sigma^2}{t^2} \]
An alternative for Chebyshev's inequality is obtained by setting \( t = k \sigma \). Then
\[ P(|X - \mu| \geq k \sigma) \leq \frac{1}{k^2} \]
Suppose we randomly poll \( N \) voters in order to estimate \( p \), the true fraction of voters in the population who intend to vote for the incumbent.
Note that \( X \) is a binomial random variable with \( E(X) = Np \) and \( \mbox{X} = Np(1-p) \). So:
Now using Chebyshev's inequality,
\[ \begin{aligned} P(|\hat{p} - p| > \epsilon) \leq \frac{\mathrm{var}(\hat{p})}{\epsilon^2} &= \frac{p(1-p)}{N \epsilon^2} \\ &\leq \frac{1}{4 N \epsilon^2} \end{aligned} \]
since \( p(1-p) \leq 1/4 \) for all \( p \).
So if we poll \( N=1000 \) people, then
\[ P(|\hat{p} - p| > 0.05) \leq \frac{1}{4 \cdot 1000 \cdot 0.05^2} = 0.1 \]
Less than a 10% chance of a polling error of 5% or more. So if I report the interval \( \hat{p} \pm 0.05 \), I can be at least 90% confidence that my interval contains the truth.
This is a very conservative bound: the real probability is much smaller. (Markov and Chebyshev usually give conservative bounds.)
Suppose that \( X_1, \ldots, X_N \sim \) Bernoulli(p) be independent random variables. Define
\[ \bar{X}_N = \frac{1}{N} \cdot \sum_{i=1}^N X_i \]
Then for any \( \epsilon > 0 \),
\[ P(|\bar{X}_N - p| \geq \epsilon) \leq 2e^{-2N\epsilon^2} \]
Hoeffding's inequality is similar to Markov's inequality, but it is sharper. (This is actually a special case of a more general version of the inequality, which we don't really need.)
Exercise: go back to our political poll example and again suppose \( N=1000 \).
Use Hoeffding's inequality to bound the probability that \( P(|\hat{p} - p| \geq \epsilon) \) for \( \epsilon = 0.05 \) and \( \epsilon = 0.01 \) (that is, polling errors of 5% and 1%, respectively.)
Compare these to the bounds you get from the Chebyshev inequality.
We now turn to a topic called “large sample theory” or “limit theory” or “asymptotic theory.”
In calculus, we say that a sequence \( x_N \) converges to a limit \( a \) if, for every \( \epsilon > 0 \), then exists an \( n_0 \) such that \( |x_N - a| \leq \epsilon \) for every \( n \geq n_0 \).
But in probability theory, convergence is more subtle.
For example, consider the trivial calculus sequence of \( x_N = 0 \) for all \( n \). Then clearly \( x_N \) converges to 0.
But now consider the sequence of random variables \( X_1, X_2, \ldots \), where each \( X_N \sim N(0,1) \).
Or consider the sequence of random variables \( X_1, X_2, \ldots \), where each \( X_N \sim N(0,1/n^2) \).
We need to develop some appropriate tools for describing the convergence of random variables.
Definition. Let \( X_1, X_2, \ldots \) be a sequence of random variables. We say that the sequence \( X_N \) converges in probability to some random variable \( X \) if, for every \( \epsilon> 0 \),
\[ P(|X_N - X| \geq \epsilon) \to 0 \]
as \( n \to \infty \). We write this as \( X_N \overset{p}{\to} X \).
Key point: we have turned a claim about random variables into a claim about real numbers (i.e. probabilities) so that we can describe those real numbers using ordinary calculus tools.
Consider the sequence \( X_N \sim N(0, 1/n^2) \). Show that \( X_N \overset{P}{\to} 0 \) (i.e. the trivial random variable that always takes the value 0).
Proof: We need to show that \( P(|X_N| \geq \epsilon) \to 0 \) for all \( \epsilon \). From Chebyshev's inequality, we know that, since \( E(X_N) = 0 \) and \( \mbox{var}(X_N) = 1/n^2 \),
\[ P(|X_N| \geq \epsilon) \leq \frac{\mbox{var}(X_N)}{\epsilon^2} = \frac{1}{n^2 \epsilon^2} \]
And for fixed \( \epsilon \), \( 1/(n^2 \epsilon^2) \to 0 \) as \( n \to \infty \).
The law of large numbers is one of the most important results in all of probability theory.
It says says that the mean of a large sample is close to the mean of the distribution.
We now make this idea precise.
Theorem. Suppose that \( X_1, X_2, \ldots, X_N \) are independent, identically distributed random variables. (Recall that we refer to this as an IID sample.) Suppose each \( X_i \) has mean \( \mu = E(X_i) < \infty \) and variance \( \sigma^2 = \mbox{var}(X_i) < \infty \).
Let \( \bar{X}_N = n^{-1} \sum_{i=1}^n X_i \) be the sample mean. Then \( \bar{X}_N \overset{P}{\to} \mu \).
As an example, consider flipping a biased coin where the probability of heads is \( p \).
This doesn't mean that \( \bar{X}_N = p \) for any \( N \): you never get the answer exactly right. It does mean that for large \( N \), the distribution of \( \bar{X}_N \) is very tightly concentrated around \( p \).
Some comments:
With Chebyshev's inequality, it's a two liner! First, note that
\[ \mbox{var}(\bar{X}_N) = \mbox{var} \left( \frac{1}{N} \sum_{i=1}^N X_i \right) = \frac{1}{N^2} \cdot N \cdot \mbox{var}(X_i) = \frac{\sigma^2}{N} \]
Therefore,
\[ P(|\bar{X}_N - \mu| \geq \epsilon) \leq \frac{\sigma^2}{N \epsilon^2} \, , \]
which converges to 0 for fixed \( \epsilon \) as \( N \to \infty \). QED!
A whole city built by the Law of Large Numbers…
A whole industry, too!
Our second type of convergence is convergence in distribution. Let \( X_1, X_2, \ldots \) be a sequence of random variables, and suppose that \( X_N \) has CDF \( F_N \). Let \( X \) be some other random variable with CDF \( F \). We say that the sequence \( X_N \) converges in distribution to \( X \), written \( X_n \rightsquigarrow X \), if
\[ \lim_{N \to \infty} F_N(t) = F(t) \]
at all \( t \) where \( F(t) \) is continuous.
Intuitively: probability statements about \( X_N \) can be approximated using probability statements about \( X \), and the approximation gets arbitrarily good as \( N \) diverges.
Best math symbol ever!
\[ \huge X_n \rightsquigarrow X \]
OK, you can also write this as \( X_N \stackrel{D}{\to} X \).
Suppose that \( X_1, \ldots, X_n \) are IID random variables with mean \( \mu \) and variance \( \sigma^2 \). Let \( \bar{X}_n \) be the sample mean:
\[ \bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i \]
Now define the \( z \)-score of the sample mean as
\[ \begin{aligned} Z_n &= \frac{\bar{X}_n - \mu}{\mbox{se}(\bar{X}_n)} \\ &= \frac{\sqrt{n} (\bar{X}_n - \mu)}{\sigma} \end{aligned} \]
Note: the second line is from de Moivre's equation.
\[ Z_n = \frac{\sqrt{n} (\bar{X}_n - \mu)}{\sigma} \]
The central limit theorem says that \( Z_n \rightsquigarrow Z \), where \( Z \) is a standard normal random variable. That is,
\[ \lim_{n \to \infty} P(Z_n \leq z) = \Phi(z) \]
In words: we can approximate statements about \( Z_n \) using statements about a standard normal random variable. Informally, we write this as \( Z_n \approx N(0, 1) \).
Here's an equivalent way of expressing the central limit theorem. Since
\[ Z_n = \frac{\sqrt{n} (\bar{X}_n - \mu)}{\sigma} \approx N(0, 1) \]
then
\[ \bar{X}_n \approx N \left(\mu, \frac{\sigma}{\sqrt{n}} \right) \]
Informally, this means that the CDF of \( \bar{X}_n \) is close to that of a normal distribution with mean \( \mu \) and standard deviation \( \sigma/\sqrt{n} \).
The central limit theorem holds basically no matter what the distribution of the \( X_i \)'s are. The only conditions are that the mean and variance are finite!
Some terminology: the CLT implies that \( \bar{X}_n \) is asymptotically normal.
The bigger the sample size, the closer the approximation gets. By \( n=30 \), the normal approximation is pretty good, unless the distribution of \( X_i \) is super heavy tailed and/or skewed.
Let's see some examples in CLT.R.
Your turn: What is the probability that FedEx will need more than one flight to Austin to get all 1810 packages there?
Let \( X_i \) be the weight of package \( i = 1, \ldots, 1810 \). FedEx will need only one flight to Austin if
\[ \sum_{i=1}^{1810} X_i \leq 11422 \]
or equivalently, if
\[ \frac{1}{1810} \sum_{i=1}^{1810} X_i \leq \frac{11422}{1810} = 6.31 \]
That is, average package weight \( \bar{X}_n \) cannot exceed 6.31 pounds (recall \( \mu = 6.1 \) and \( \sigma = 5.6 \)).
So what is \( P(\bar{X}_n \leq 6.31) \)?
The sample size is \( n=1810 \): pretty large! So use the central limit theorem: make approximate statements about \( \bar{X} \) using statements about the normal distribution.
This gives us:
\[ \begin{aligned} P(\bar{X}_n \leq 6.31) &= P\left( \frac{\sqrt{n} (\bar{X}_n - \mu) }{\sigma} \leq \frac{\sqrt{n} (6.31 - \mu)}{\sigma} \right) \\ &\approx P\left( Z \leq \frac{\sqrt{n} (6.31 - \mu)}{\sigma} \right) \quad \mbox{where $Z \sim N(0,1)$} \\ &\approx P\left( Z \leq \frac{\sqrt{1810} (6.31 - 6.1)}{5.6} \right) \\ &= P\left( Z \leq 1.595 \right) \\ &= \Phi(1.595) \approx 0.945 \end{aligned} \]
The central limit theorem assumes that we know \( \sigma \), the true standard deviation of the data points, so that we can \( z \)-score using the true standard error, \( \sigma/\sqrt{n} \).
What if we don't know \( \sigma \)? A natural thing to do is to plug-in the sample standard deviation \( \hat\sigma \) to give us an approximation:
\[ \hat{se}(\bar{X}_n) = \frac{\hat\sigma}{\sqrt{n}}\quad \mbox{where} \quad \hat{\sigma} = \sqrt{\frac{1}{n-1} \sum_{i=1}^m (X_i - \bar{X}_n)^2 } \]
Remember: we call this the plug-in estimate of the standard error.
It turns out that the central limit theorem still holds even if we use the plug-in estimate of the standard error.
That is, let
\[ T_n = \frac{\bar{X}_n - \mu}{\hat{se}(\bar{X}_n)} = \frac{\sqrt{n} (\bar{X}_n - \mu)}{\hat\sigma} \]
Then \( T_n \rightsquigarrow Z \), where \( Z \) is a standard normal random variable. (We call \( T_n \) a \( t \)-statistic rather than a \( z \)-score.)
If you've taken a statistics class before, you may remember spending a lot of time on \( t \) statistics and the \( t \) distribution.
Your experience might have involved a horrible, soul-crushing table that looked something like this:
The \( Z \) statistic and \( T \) statistic are really similar:
In particular, both \( Z \) and \( T \) are asymptotically normal!
The whole point of that awful table was to correct for the small errors in the normal approximation that emerge when \( n \) is modest (say, less than 30).
The whole point of that awful table was to correct for the small errors in the normal approximation that emerge when \( n \) is modest (say, less than 30).
So we'll basically ignore this horrible table and treat the \( T \) statistic as if it were normal (which it is, asymptotically!)
I cannot promise that everyone you encounter will be so sensible.
Some of these people may make you do calculations with the \( t \) distribution, in which case you will have to learn/remind yourself about it.
I cannot promise that everyone you encounter will be so sensible.
Some of these people may make you do calculations with the \( t \) distribution, in which case you will have to learn/remind yourself about it. Sorry :-(
Suppose we're trying to estimate \( \theta \) using an estimator \( \hat{\theta}_n \). Let \( \hat{se}(\hat{\theta}_n) \) be the (estimated) standard error of \( \hat{\theta}_n \). Define
\[ Z_n = \frac{\hat{\theta}_n - \theta}{\hat{se}(\hat{\theta}_n)} \]
If \( Z_n \rightsquigarrow Z \) where \( Z \sim N(0,1) \), we say that \( \hat{\theta} \) is asymptotically normal.
There are variations on the central limit theorem showing that lots of common estimators are asymptotically normal:
One consequence of this fact is that we can use the normal distribution to produce approximate confidence intervals for lots of common statistics problems.
How does that work?
Suppose that we have some \( \hat{\theta}_n \) that we know to be asymptotically normal. Then we have the following approximation:
\[ \frac{\hat{\theta}_n - \theta}{\hat{se}(\hat{\theta}_n)} \sim N(0,1) \]
That is, our estimation error is (approximately) normally distributed:
\[ \hat{\theta}_n - \theta \sim N \left( 0, \hat{se}(\hat{\theta}_n) \right) \]
So, for example, if we go out one standard error:
\[ P \left[ \hat{\theta}_n - \theta \in \left( -\hat{se}, \hat{se} \right) \right] \approx 0.68 \]
And if we go out two standard errors:
\[ P \left[ \hat{\theta}_n - \theta \in \left( -2 \hat{se}, 2 \hat{se} \right) \right] \approx 0.95 \]
Probalistic bounds on estimation error = confidence intervals!
In general, if \( \hat{\theta}_n \) is asymptotically normal, and if we let \( z_\alpha \) denote the \( (1 - \alpha/2) \) quantile of the normal distribution, then the interval estimate
\[ I_n = [\hat{L}_N, \hat{U}_N] = \hat{\theta}_n \pm z_\alpha \cdot \hat{se}(\hat{\theta}_n) \]
is an approximate confidence interval at level \( 1-\alpha \). That is, it covers the true value \( \theta \) with probability approximately equal to \( 1 - \alpha \).
Remember the frequentist coverage principle?
\[ P_{\theta} \left( \theta \in [\hat{L}_N, \hat{U}_N] \right) \geq 1- \alpha \, , \]
Remember our three questions? (What is fixed? What is random? What is the source of this randomness?)
It's the same statement again. Here, as with bootstrapped confidence intervals, the probability claim is approximately true, and the approximation gets better with larger \( n \).
Let's see a couple of examples in normalCI_examples.R.
A really similar line of thinking allows us to construct hypothesis tests.
Suppose we have:
Because of asymptotic normality, we know that if the null hypothesis is true (i.e. \( \theta = \theta_0 \)), then
\[ T_n \sim N(\bar{T}_0, \mbox{se}_0 (T_n)) \]
where \( \bar{T}_0 = E(T \mid \theta = \theta_0) \) and \( \mbox{se}_0 (T_n) \) are the mean and standard error of \( T_n \), assuming that \( \theta = \theta_0 \).
We can then do any kind of test we want:
Ikea produces 148.9 million of those tiny little Allen wrenches each year (5 wrenches per second, all day every day). The wrenches should be 5.0 mm in diameter, on average. If they're not, something has gone awry in the manufacturing process.
The last 50 wrenches that were sampled off the assembly line have measured 5.03 mm in average diameter, with a standard deviation of 0.07 mm.
Should we stop making wrenches and figure out what's wrong? Or is this just a chance statistical fluctuation?
Let \( X_i \) be the diameter of wrench \( i \), and let \( \mu = E(X_i) \). The null hypothesis is that \( \mu = 5.0 \). Let's compute a \( p \)-value under this null hypothesis.
So under the null hypothesis, we can approximate the sampling distribution of \( \bar{X}_n \) as
\[ \bar{X}_n \sim N(5.0, 0.0128) \]
Under this null hypothesis, we have \( p = P(\bar{X}_n \geq 5.03) = 0.0095 \).
Note: in this case we might want to calculate a two-sided \( p \)-value as:
\[ p = P(\bar{X}_n \leq 4.97) + P(\bar{X}_n \geq 5.03 ) = 0.019 \]
The thinking here is that deviations of \( \geq 0.03 \) up and down are both equally important, and they should both count as “more extreme than” a deviation of 5.03.
Remember: in general the \( p \) value is \( P(T \in \Gamma(t_{ob}) \mid H_0) \), where \( \Gamma(t) \) is the set of test statistics that are judged “more extreme than” some specific value \( t \).
Any time we're doing a test under the assumption of asymptotic normality, we can always standardize our test statistic to compare it to a standard \( N(0,1) \) distribution. If \( T_n \sim N(\bar{T}_0, \mbox{se}_0 (T_n)) \) under the null, then
\[ Z_n = \frac{(T_n - \bar{T}_0)}{\mbox{se}_0 (T_n)} \sim N(0,1) \]
So the general recipe for testing under asymptotic normality is:
Note: compare our test with the normal-theory confidence interval for \( \mu \) in light of the data.
\[ \begin{align} \mu &\in \bar{X}_n \pm 2 \cdot \hat{se}(\bar{X}_m) \\ &= 5.03 \pm 2 \cdot 0.0128 \\ &= (5.0044, 5.0556) \end{align} \]
Isn't that way more informative than quoting \( p = 0.019 \)?
I'll show you a few examples of normal-based testing and confidence intervals in action:
In each case, the confidence interval is more scientifically meaningful and interpretable, compared to the \( p \)-value.