James Scott (UT-Austin)
I have something to confess.
I am already scared of my baby son growing up.
One minute he's so tiny and helpless.
And then before you know it, he'll be making terrible choices.
But more than anything else, I'm dreading one moment in particular…
The first time he asks me about the facts of life.
It's inevitable: every young person must eventually learn about the facts of life.
They start out so innocent, and yet soon they begin to wonder…
They eventually become dissatisfied with all the lame explanations they've ever received.
And finally they turn to you and ask the question that every parent dreads…
“Daddy, where do estimators come from?”
This is where you face a very stark choice as a parent.
That every beautiful new estimator that enters the world is actually created by people?
Of course, there is no choice: you have to prepare your child for life as a grown-up.
Today he is young. But some day, he may want to create a special, beautiful estimator of his very own.
So you sit down and try to explain the messy reality of how people create new estimators.
Today, in case you ever have children of your own, I will try to prepare you for that discussion.
We'll talk about two ways of creating sensible estimators:
There are tons of others we won't cover: generalized method of moments, Bayes estimators, MAP estimators, shrinkage estimators, penalized likelihood, minimum-variance unbiased estimators, generalized estimating equations, maximum entropy, minimum description length…
This whole discussion today assumes that we observe IID data \( X_1, \ldots, X_N \) arising from a parametric probability model with parameter \( \theta \):
\[ X_i \stackrel{iid}{\sim} f(x \mid \theta) \]
And that our goal is to estimate either \( \theta \) itself or some function of the parameter \( g(\theta) \). Whatever we're trying to estimate is called the estimand.
Note: \( \theta \) might be a vector in multi-parameter models. For example, \( \theta = (\mu, \sigma^2) \) in a normal model.
The principle behind the method of moments is very simple to state: choose the parameter \( \theta \) so that the theoretical moments and sample moments are identical. (Remember, moments are means, variances, etc.)
Example: suppose we observe \( X_i \stackrel{iid}{\sim} \mbox{Poisson}(\lambda) \), and \( \lambda \) is unknown.
Here's the general principle. Let \( \theta \) be a \( K \)-dimensional parameter. Define the theoretical moments as the following function of \( \theta \):
\[ \alpha^{(k)}(\theta) = E(X^k \mid \theta) \]
and the sample moments as the follow function of the data:
\[ \hat{\alpha}_n^{(k)} = \frac{1}{n} \sum_{i=1}^n X_i^k \]
The law of large numbers says that eventually, \( \hat{\alpha}_n^{(k)} \) converges in probability to \( \alpha^{(k)}(\theta_0) \), where \( \theta_0 \) is the true parameter.
The method of moments estimator \( \hat{\theta}_{MM} \) solves the following system of \( K \) equations:
\[ \begin{align} \alpha^{(1)}(\hat\theta) &= \hat{\alpha}_n^{(1)} \\ \alpha^{(2)}(\hat\theta) &= \hat{\alpha}_n^{(2)} \\ &\vdots \\ \alpha^{(K)}(\hat\theta) &= \hat{\alpha}_n^{(K)} \\ \end{align} \]
This is a system of \( K \) equations in \( K \) unknowns and should therefore (usually!) have a unique solution.
So the general recipe for calculating the method of moments estimator of a \( K \)-dimensional parameter is:
Use probability theory to write down expressions for the theoretical moments as a function of \( \theta \):
\[ \alpha^{(k)}(\theta) = E(X^k \mid \theta) \]
Calculate the sample moments
\[ \hat{\alpha}_n^{(k)} = \frac{1}{n} \sum_{i=1}^n X_i^k \]
Set up the system of \( K \) equations, \( \alpha^{(k)}(\theta) = \hat{\alpha}_n^{(k)} \) for \( k = 1, \ldots, K \).
Solve the system of equations for \( \hat\theta \).
Suppose \( X_i \sim \mbox{Bern}(p) \) for \( i = 1, \ldots, N \). What is \( \hat{p} \) under the method of moments? Note: here \( K=1 \).
Step 1: use probability theory to write down expressions for the theoretical moments as a function of \( p \).
We've done this before:
\[ \alpha^{(1)}(p) = E(X^1 \mid p) = p \]
Step 2: Calculate the sample moments:
\[ \hat{\alpha}_n^{(1)} = \frac{1}{n} \sum_{i=1}^n X_i^1 = \bar{X}_n \]
Step 3: Set up the system of \( K \) equations, \( \alpha^{(k)}(\theta) = \hat{\alpha}_n^{(k)} \) for \( k = 1, \ldots, K \).
Here \( K=1 \), so it's a system of one equation:
\[ \alpha^{(1)}(\theta) = \hat{\alpha}_n^{(1)} \]
So
\[ \bar{X}_n = p \]
Step 4 (solve the system for \( p \)) is easy: this equation is already solved! The MoM estimator is \( \hat{p} = \bar{X}_n \).
Suppose we assume that our data \( X_1, \ldots, X_n \) comes from a normal distribution: \( X_i \stackrel{iid}{\sim} N(\mu, \sigma^2) \).
The unknown parameter vector is \( \theta = (\mu, \sigma^2) \). Here \( K=2 \).
Step 1: use probability theory to write down expressions for the theoretical moments as a function of \( \theta = (\mu, \sigma^2) \).
Here we need two moments, since \( K=2 \):
\[ \begin{align} \alpha^{(1)}(\theta) &= E(X^1 \mid \theta) = \mu \\ \alpha^{(2)}(\theta) &= E(X^2 \mid \theta) = \sigma^2 + \mu^2 \\ \end{align} \]
The second equation follows from the fact that \( \mbox{var}(X) = E(X^2) - E(X)^2 \).
Step 2: Calculate the sample moments
\[ \begin{align} \hat{\alpha}_n^{(1)} &= \frac{1}{n} \sum_{i=1}^n X_i^1 = \bar{X}_n \\ \hat{\alpha}_n^{(2)} &= \frac{1}{n} \sum_{i=1}^n X_i^2 = S_X^2 \end{align} \]
Step 3: Set up the system of \( K \) equations, \( \alpha^{(k)}(\theta) = \hat{\alpha}_n^{(k)} \) for \( k = 1, \ldots, K \).
Here we have a system of two equations in two unknowns:
\[ \begin{align} \mu &= \bar{X}_n \\ \sigma^2 + \mu^2 &= S_X^2 = \frac{1}{n} \sum_{i=1}^n X_i^2 \end{align} \]
Step 4: Solve the system of equations. Clearly the first equation is solved at \( \mu = \bar{X}_n \). So the second equation is solved at
\[ \sigma^2 + \bar{X}_n^2 = S_X^2 \]
Or equivalently, \[ \sigma^2 = S_X^2 - \bar{X}_n^2 \]
Thus the method of moments estimator is \( \hat{\theta} = (\hat{\mu}, \hat{\sigma}^2) = (\bar{X}_n, S_X^2 - \bar{X}_n^2) \).
See gas_method_moments.R.
In most non-crazy situations, the method of moments estimator converges in probability to the right answer:
\[ \hat{\theta}_{MM} \stackrel{P}{\longrightarrow} \theta \]
Note: we say that the estimator is consistent. Consistency means “converging in probability to the right answer with more data.”
Similarly, in most non-crazy situations, the method of moments estimator is asymptotically normal:
\[ Z_n = \frac{\hat{\theta}_{MM} - \theta}{\mbox{se}(\hat{\theta}_{MM})} \rightsquigarrow N(0,1) \, . \]
Thus we can make approximate probability statements about the estimator using the normal distribution.
Note: there is a more general version of the method of moments, called the “generalized method of moments” (GMM). This is wildly popular in econometrics. To understand GMM (not covered here), you have to understand MM.
Outside of econometrics, the most popular way to construct estimators is by the principle of maximum likelihood. Suppose, as before, we observe IID data \( X_1, \ldots, X_n \) from some unknown parametric model with parameter \( \theta \): \( X_i \sim f(X \mid \theta) \).
The likelihood function is defined as follows:
\[ L(\theta) = \prod_{i=1}^n f(X_i \mid \theta) \, . \]
Note: if \( X \) is discrete, then \( f \) refers to the PMF; if \( X \) is continuous, then \( f \) refers to the PDF.
\[ L(\theta) = \prod_{i=1}^n f(X_i \mid \theta) \, . \]
The likelihood function is a function of the parameter \( \theta \):
Suppose we observe Bernoulli trials, \( X_i \sim \mbox{Bern}(p) \).
\[ \begin{align} L(p) &= \prod_{i=1}^n f(X_i \mid p) \\ &= \prod_{i=1}^n p^X_i (1-p)^{1-X_i} \\ &= p^Y (1-p)^{(n-Y)} \end{align} \]
where \( Y = \sum_{i=1}^n X_i \).
Suppose we observe \( Y = 12 \) successes (1) out of \( n=20 \) Bernoulli trials.
Let's try calculating the likelihood function at two different values:
At \( p = 0.4 \), the likelihood function is
\[ L(0.4) = 0.4^{12} (1-0.4)^{(20-12)} = 2.82 \times 10^{-7} \]
Interpretation: if \( p = 0.4 \), the probability of observing this data set \( X \) with \( 12 \) successes and \( 8 \) failures is \( 2.82 \times 10^{-7} \).
At \( p = 0.7 \), the likelihood function is
\[ L(0.7) = 0.7^{12} (1-0.7)^{(20-12)} = 9.08 \times 10^{-7} \]
Interpretation: if \( p = 0.7 \), the probability of observing this data set \( X \) with \( 12 \) successes and \( 8 \) failures is \( 9.08 \times 10^{-7} \).
So it looks like our data would have been more likely to arise if \( p=0.7 \), versus \( p=0.4 \):
Conclusion: \( p=0.7 \) is a better (higher likelihood) guess for the parameter. If these were your only two choices for \( p \), you'd probably choose \( p=0.7 \).
But of course, those aren't the only two choices!
You can guess any probability between 0 and 1.
So let's plot the likelihood as a function of all possible guesses \( p = (0, 1) \).
It looks like \( p=0.6 \) is the choice of \( p \) that makes the data look most likely. It is the maximum likelihood estimate, or MLE.
The maximum likelihood estimate (MLE) is the value of \( \theta \) that maximizes \( L(\theta) \), the likelihood function.
Equivalently, the MLE is the value of \( \theta \) that maximizes the logarithm of the likelihood function,
\[ l(\theta) = \log L(\theta) = \sum_{i=1}^n \log f(X_i \mid \theta) \]
Taking the log doesn't change the answer (since log is a monotonic transformation). But it does avoid the problem of the likelihood becoming so small that it can't be represented using the floating-point numerical system on a computer.
Fact: if we multiply \( L(\theta) \) by any positive constant \( c \), we will not change the MLE. This allows us to be sloppy about ignoring multiplicative constants in the likelihood function.
How do we actually calculate the MLE?
Answer: calculus. Take the derivative of the log likelihood function with respect to \( \theta \), and set it equal to 0.
Suppose \( X_i \sim N(\mu, \sigma^2) \) for \( i=1, \ldots, n \). Let's derive the MLE for \( \theta = (\mu, \sigma^2) \) together on the board.