MLE

Some distributions:

  • Binomial: How many of the next 12 people will be wearing Skyhawk apparel if the probability of any one person wearing Skyhawk apparel is \( p=0.15? \) (a discrete counting distribution)
  • Geometric: How many people will I need to pass before I see someone wearing Skyhawk apparel if the probability of any one person wearing Skyhawk apparel is \( p=0.15? \) (a discrete waiting distribution)
  • Exponential: How long will I have to wait until the next earthquake? (a continuous waiting distribution)
  • Gamma: generalization of exponential
  • Uniform: It could happen anytime in the next 3 hours with equal probability. (A continuous distribution)

A problem:

Katie has \( N \) keys in her purse of which one opens the door to her house. When she arrives home at the end of a long day, she puts her hand into her purse, randomly grabs a key, and tries to unlock the door. If she fails, she puts the key back into her purse and then randomly draws another key. Suppose over the course of 5 days, she manages to unlock her front door on the 8th, 12th, 7th, 6th, and 12th attempts. (Assume the keys are similar, so each has the same chance to be picked.)

Which distribution models this scenario?

  • For any single attempt, what is the probability of picking the right key? (Your answer isn’t a number, but an expression.)
  • If \( X \) is the random variable that gives the number of attempts to grab the right key, what is \( P(X=8) \)?
  • What is the probability of obtaining the values of \( X \) listed above: 8, 12, 7, 6, 12? (Give your best answer, this really requires a course in probability, but I bet you get it.)
  • \( p=\frac{1}{N} \)
  • \( P(X=8)=(1-p)^7\cdot p \).
  • That is seven failures, followed by one success.
  • \[ P(X=8)\cdot P(X=12) \cdot P(X=7)\cdot P(X=6) \cdot P(X=12) \]
  • That is, we multiply the probabilities: this works as long as each trial is independent of the others.
  • \( p \) is the parameter in this distribution
  • We don’t know the value of \( p \) because we don’t know \( N \).
  • We’d like to find the value of \( p \) that maximizes the probability of obtaining the observed result.
  • How? Calculus.
  • Define the function: \[ L(\theta | x_1, x_2, \dots, x_n)=P(X_1=x_1) \cdot P(X_2=x_2) \cdots P(X_n=x_n) \]
  • Think of this as a function of the parameter \( \theta \) (\( p \) for this particular example) where the \( x_i \) s are the observed values of the random variable.
  • Write down \( L(\theta|x_1, \dots x_n) \) for this example. Your function should be a function of \( p \) and the observed values should appear in your function.

\[ \begin{align} &L(p | 8, 12, 7, 6, 12)\\ &=(1-p)^7 \cdot p \cdot (1-p)^{11} \cdot p \cdot (1-p)^6 \cdot p\\ &\cdot (1-p)^5 \cdot p \cdot (1-p)^{11} \cdot p \end{align} \]

  • We’d like to maximize this function!
  • How does that go?
  • So, taking the derivative of \( L(p) \) doesn’t look pleasant - primarily because this is a product and we’ll either have to expand \( L(p) \) or apply the product rule…. a lot.
  • However, \( \ln \) turns products into sums: \( \ln(A\cdot B) = \ln(A) +\ln(B) \)
  • …. and the same value that maximizes \( \ln(L(p)) \) maximizes \( L(p) \)
  • Take \( \ln \) of: \[ \begin{align} &L(p)\\ &=(1-p)^7 \cdot p \cdot (1-p)^{11} \cdot p \cdot (1-p)^6 \cdot p\\ &\cdot (1-p)^5 \cdot p \cdot (1-p)^{11} \cdot p \end{align} \]
  • If you’re comfortable with the following notation, you can write it with summation notation: \[ L(p)=\prod_{i=1}^n (1-p)^{x_i-1}\cdot p \]

\[ \begin{align} \ln(L(p))&=\sum_{i=1}^n \ln(1-p)^{x_i-1} + \sum_{i=1}^n \ln(p)\\ &=\sum_{i=1}^n (x_i-1) \cdot \ln(1-p) + \sum_{i=1}^n \ln(p)\\ &= \ln(1-p) \sum_{i=1}^n (x_i-1) + \ln(p) \cdot \sum_{i=1}^n 1\\ &= \ln(1-p) \sum_{i=1}^n (x_i-1) + \ln(p)\cdot n \end{align} \]

  • What is \( n \)?
  • What is \( \sum_{i=1}^n (x_i-1) \)?
  • \( n=5 \)
  • \( \sum_{i=1}^n (x_i-1)=7+11+6+5+11=40 \)
  • So we have \[ \ln(L(p))= 40\ln(1-p) + 5\ln(p) \]
  • Now maximize \( \ln(L(p)) \).

\[ \frac{d}{dp} \ln(L(p))=-40\cdot \frac{1}{1-p}+\frac{5}{p} \] Set equal to zero: \[ \begin{align} 0&=-40\cdot \frac{1}{1-p}+\frac{5}{p}\\ \frac{40}{1-p}&=\frac{5}{p}\\ \frac{1-p}{40}&=\frac{p}{5}\\ 1-p&=8p\\ 1&=9p\\ p&=\frac{1}{9} \dots \text{ so } N=9\\ \end{align} \]

  • This is called the maximum likelihood estimator process.
  • We maximized \[ L(\theta | x_1, x_2, \dots, x_n)=P(X_1=x_1) \cdot P(X_2=x_2) \cdots P(X_n=x_n) \]
  • We replaced \( P(X_i=x_i) \) with \( (1-p)^{x_i-1}\cdot p \) which is the pmf for the geometric distribution.
  • So, \[ L(\theta | x_1, x_2, \dots, x_n)=f(x_1, \theta) \cdot f(x_2, \theta) \cdots f(x_n, \theta) \] where \( f \) is either the pmf or the pdf depending on whether the random variable is discrete or continuous.
  • We find the value of the parameter, \( \theta \), that maximizes \( L(\theta) \).
  • Typically by instead maximizing \( \ln(L(\theta)) \) since that tends to be easier.