In this section we develop approximations to the sampling distribution of maximum likelihood estimates by using limiting arguments as the sample size increases.
Recall: that for an i.i.d sample of size \(n\) the log-likelihood function is given by: \[l(\theta) = \sum_{i = 1}^n\log f(x_i|\theta)\]
From now we will denote the true value of \(\theta\) with \(\theta_0\). We now state and sketch a proof of the consistensy of the mle.
Theorem: Under appropriate smoothness conditions on \(f\), the mle of an i.i.d sample is consistent. In other words the estimate \(\hat{\theta}\) converges to \(\theta_0\) in probability as \(n\) tends to infinity.
Proof (Sketch)
The goal is to show that the value of \(\hat{\theta}\) that maximizes the log-likehood converges to the true value of \(\theta\)
We know the log-likelihood is given by: \[l(\theta) = \sum_{i = 1}^n\log f(x_i|\theta)\] Equivalently we can consider maximizing, \[\frac{1}{n}l(\theta) =\frac{1}{n}\sum_{i = 1}^n\log f(x_i|\theta)\] Now by the law of large numbers we have that as \(n\rightarrow \infty\) \[\frac{1}{n}l(\theta) \rightarrow E(\log f(X|\theta))\;\; **\] It is important to note here that the RHS of \(**\) is no longer a log-likehood but rather the log of the density function. Moreover we have that \[E(\log f(X|\theta)) = \int_{-\infty}^\infty \log(f(x|\theta))f(x|\theta_0)dx\] Here begin to hand wave a bit, and simply say that it is plausible that for large \(n\) the \(\theta\) that maximizes \(l(\theta)\) is close to the \(\theta\) that maximizes \(E(\log f(X|\theta))\) (more suffiscated argument is needed for a rigorous proof). Let now consider maximizing the RHS of \(**\). Like we would normally we take a derivative with respect to \(\theta\) such that \[\frac{\partial}{\partial\theta}\int \log(f(x|\theta))f(x|\theta_0)dx\] Now since we made some assumptions about the smoothness of \(f\) we are free to interchange the intergral and derivative \[ = \int\frac{\partial}{\partial\theta} \log(f(x|\theta))f(x|\theta_0)dx = \int \frac{\frac{\partial}{\partial\theta} f(x|\theta)}{f(x|\theta)}f(x|\theta_0)dx\] Now if \(\theta = \theta_0\) we have the following: \[ = \int\frac{\partial}{\partial\theta}f(x|\theta_0)dx = \frac{\partial}{\partial\theta}\int f(x|\theta_0)dx = \frac{\partial}{\partial\theta}(1) = 0\] This shows that \(\theta_0\) is a stationary point and we hope a maximum.
We now define Fisher Information in the form of Lemma
Lemma Define \(I(\theta)\) by \[I(\theta) = E\Big[ \frac{\partial}{\partial\theta} \log f(x|\theta) \Big]^2\] Under appropriate smoothness conditions on \(f\), \(I(\theta)\) may also be expressed as \[I(\theta) = -E\Big[ \frac{\partial^2}{\partial\theta^2} \log f(x|\theta) \Big]\]
Proof
First note that \(\int f(x|\theta)dx = 1\), and therefore as we saw above \[\frac{\partial}{\partial\theta}\int f(x|\theta)dx = 0\] Moreover we also have the following identity: \[\frac{\partial}{\partial\theta} f(x|\theta) = \Big[\frac{\partial}{\partial\theta}\log f(x|\theta)\Big]f(x|\theta)\;\;\;\; (1)\] Simply take the derivative of the log function between square braces to see the result.
Now combining these two results we have the following: \[0 = \frac{\partial}{\partial\theta}\int f(x|\theta)dx = \int\frac{\partial}{\partial\theta} f(x|\theta)dx\] \[ = \int \Big[\frac{\partial}{\partial\theta}\log f(x|\theta)\Big]f(x|\theta)dx\;\;\; (2)\] And now taking the second derivative of both sides of the equation yields: \[0 = \frac{\partial}{\partial\theta}\int \Big[\frac{\partial}{\partial\theta}\log f(x|\theta)\Big]f(x|\theta)dx\] Once again we can interchange the integral and derivative by assumptions made on \(f\), moreover applying the product rule gives us the following: \[ = \int \Big(\Big[ \frac{\partial^2}{\partial\theta^2} \log f(x|\theta) \Big]f(x|\theta) + \Big[ \frac{\partial}{\partial\theta} \log f(x|\theta) \Big]\frac{\partial}{\partial\theta}f(x|\theta) \Big)dx\] And applying the indentity in equation \((1)\) above we get \[ = \int \Big(\Big[ \frac{\partial^2}{\partial\theta^2} \log f(x|\theta) \Big]f(x|\theta) + \Big[ \frac{\partial}{\partial\theta} \log f(x|\theta) \Big]^2 f(x|\theta) \Big)dx\] lastly we can distribute the integral on the sum to get: \[ = \int\Big[ \frac{\partial^2}{\partial\theta^2} \log f(x|\theta) \Big]f(x|\theta)dx + \int\Big[ \frac{\partial}{\partial\theta} \log f(x|\theta) \Big]^2 f(x|\theta)dx\] \[= E\Big[ \frac{\partial^2}{\partial\theta^2} \log f(x|\theta) \Big] + E\Big[ \frac{\partial}{\partial\theta}\log f(x|\theta)\Big]^2 = 0\] \[\therefore \;\;I(\theta) = -E\Big[ \frac{\partial^2}{\partial\theta^2} \log f(x|\theta) \Big]\]
In proving the above we get some immediate results namely with the result of equation \((2)\), we have that \[E\Big[ l'(x, \theta) \Big] = 0\] Which implies that \[Var(l'(x, \theta)) = E(l'(x, \theta)^2) - (E(l'(x, \theta)))^2\] \[ = I(\theta) - 0 = I(\theta)\]
The large sample distribution of the mle is approximately normal with mean \(\theta_0\) and variance \(1/(nI(\theta))\). Due to the fact that this is a limiting result we say that the mle is asymptotically unbiased and refer to the variance of the normal distribution as the asymptotic variance. We state this as a theorem and give a sketch proof.
Theorem: Under smoothness conditions on \(f\), the probability distribution of \(\sqrt{nI(\theta)}(\hat{\theta} - \theta)\) tends to a normal distribution.