The proof of consistency of MLE is quite abstract and hard to understand. Here are two proofs of the consistency of MLE which I think are much easier to understand although might not be so rigorous.
Method 1.
Let \(\{X_1,...,X_n\}\) be a sequence of observations, Let \(\hat{\hat{\theta_n}}\) be the MLE using \(\{X_1,...,X_n\}\). We say that \(\hat{\theta_n}\) is consistent if \(\hat{\theta_n}\overset{P}{\rightarrow}\theta\)
To prove the consistency, suppose we have the following condition:
\[E\left[(\hat{\theta_n}-\theta)^2\right]\to 0 \text { as } n \to \infty\]
From the Chebyshev inequality
\[P(|X-E(X)|\ge k)\le \frac{Var(X)}{k^2}\]
We can get
\[P(|\hat{\theta_n}-\theta|\ge \epsilon)\le \frac{E(\hat{\theta_n}-\theta)^2}{\epsilon^2}\] Note \(E(\hat{\theta_n}-\theta)^2\) is the variance of \(\hat{\theta_n}\) which is a random variable.
By the condition \(E\left[(\hat{\theta_n}-\theta)^2\right]\to 0 \text { as } n \to \infty\)
We have
\[0\le P(|\hat{\theta_n}-\theta|\ge \epsilon)\le \frac{E(\hat{\theta_n}-\theta)^2}{\epsilon^2} \to0 \] Therefore,
\[P(|\hat{\theta_n}-\theta|\ge \epsilon)\to 0 \text{ as } n\to\infty\]
we proved the consistency.
The method is very easy to understand, but we have to make a strong assumption without proof.
Method 2.
Method 2 is a little bit harder and it is the most common method that has been used to prove the consistency of the MLE.
Assume \(f(x;\theta)\) is the pdf of a continuous distribution then for an i.i.d sample with sample size \(n\) and true parameter \(\theta_0\) \[L(\theta;x)=f(x_1;\theta)f(x_2;\theta)...f(x_n;\theta)\] and \[l(\theta;x)=\sum_{i=1}^n\log f(x_i;\theta) \tag{1}\]
We multiply \(\frac{1}{n}\) at both sides of \((1)\)
\[\frac{1}{n}l(\theta;x)=\frac{1}{n}\sum_{i=1}^n\log f(x_i;\theta) \tag{2}\] Then for any \(\theta \in \Omega\) (not necessarily \(\theta_0\)), by the law of large numbers implies, we get
\[\frac{1}{n}\sum_{i=1}^n\log f(x_i;\theta)\overset{P}\rightarrow E_{\theta_0}\left[\log f(X;\theta_n)\right]\] Here we treat \(\log f(x_i;\theta)\) as one realization of the random variable \(\log f(X;\theta)\), then we can use the law of large numbers.
Next we can show
\[\begin{align*} E_{\theta_0}\left[\log f(X;\hat{\theta_n})\right]-E_{\theta_0}\left[\log f(X;\theta_0)\right]&=E_{\theta_0}\left[\frac{\log f(X;\hat{\theta_n})}{\log f(X;\theta_0)}\right]\\ &\le \log E_{\theta_0}\left[\frac{f(X;\hat{\theta_n})}{f(X;\theta_0)}\right] \text{ by Jensen's inequality }\\ &=\log \int \frac{f(X;\hat{\theta_n})}{f(X;\theta_0)}f(X;\theta_0)dx\\ &=0 \end{align*}\]
Therefore, \[E_{\theta_0}\left[\log f(X;\hat{\theta_n})\right]\le E_{\theta_0}\left[\log f(X;\theta_0)\right] \] We can see when \(\hat{\theta_n}=\theta_0\) then \(E_{\theta_0}\left[\log f(X;\hat{\theta_n})\right]\) can achieve the maximum value.
And we may say \[l(\hat{\theta_n})\overset{P}\rightarrow l(\theta_0) \text{ when } n \to \infty \tag{3}\]
Next we may treat \((3)\) as \(g(\hat{\theta_n})\overset{P}\rightarrow g(\theta_0)\),here we treat \(\hat{\theta_n}\) and \(\theta_0\) as the variables of the likelihood function then from the S5.1.4 lemma in this note (http://www.math.caltech.edu/~2016-17/2term/ma003/Notes/MLEConsistency.pdf), we may say \[\hat{\theta_n}\overset{P}\rightarrow \theta\]
From the book introduction to mathematical statistics by Robert V. Hogg, they have the following proof.
Because \(\theta_0\) is an interior point in \(\Omega\), \((\theta_0-a, \theta_0 + a) \subset \Omega\), for some \(a > 0\). (This is by the definition of interior point of a set).
Define \(S_n\) to be the event
\[S_n = \left\{ X : l(\theta_0;X) >l(\theta_0-a;X)\right\}\cap \left \{X : l(\theta_0;X) > l(\theta_0 + a;X)\right\} \tag{1}\] Here, \(S_n\) is a set of the \(X\) with size \(n\) that the elements in the set satisfy that both \(l(\theta_0;X) >l(\theta_0-a;X)\) and \(l(\theta_0;X) > l(\theta_0 + a;X)\), this is true from the above proof, i.e when at \(\theta_0\) the log likelihood will achieve the maximum value. Therefore we can write.
\[\lim_{n \to \infty}P(S_n)=1 \tag{2}\] Then the authors defined a new event (set)
\[\{ X: | \hat{ \theta_{n} } \left( X \right) -\theta_{0} | < a \} \cap \{ X: l^{ \prime} \left( \hat{\theta_n} \left( X \right) \right) =0 \}\] And we think this set is bigger than the \(S_n\) since all \(X\) that can satisfy \((1)\) will also satisfy the above two sets i.e \(\{X:|\hat{\theta_{n}}(x)-\theta_{0}|< a\}\) and \(\{ X: l^{ \prime} \left( \hat{\theta_n} \left( X \right) \right) =0 \}\)
Therefore, we can write
\[S_n \subset \{ X: | \hat{ \theta_{n} } \left( X \right) -\theta_{0} | < a \} \cap \{ X: l^{ \prime} \left( \hat{\theta_n} \left( X \right) \right) =0 \}\tag{3}\] Since \(A\subset B \Rightarrow P(A)\le P(B)\) and combine \((3)\) and \((2)\)
We can write
\[1=\lim_{n \to \infty}P(S_n)\le \overline{\lim_{n\to \infty}}P[\{ X: | \hat{ \theta_{n} } \left( X \right) -\theta_{0} | < a \} \cap \{ X: l^{ \prime} \left( \hat{\theta_n} \left( X \right) \right) =0 \}] \le 1\] and since \(P(A\cap B)\le P(A)\) we can write
\[1=\lim_{n \to \infty}P(S_n)\le \overline{\lim_{n\to \infty}}P[\{ X: | \hat{ \theta_{n} } \left( X \right) -\theta_{0} | < a \} \cap \{ X: l^{ \prime} \left( \hat{\theta_n} \left( X \right) \right) =0 \}]\le \overline{\lim_{n\to \infty}}P[\{ X: | \hat{ \theta_{n} } \left( X \right) -\theta_{0} | < a \}\le 1\] we can get,
\[1\le \overline{\lim_{n\to \infty}}P[\{ X: | \hat{ \theta_{n} } \left( X \right) -\theta_{0} | < a \}\le 1\] Finally, we have
\[\lim_{n\to \infty}P[\{ X: | \hat{ \theta_{n} } \left( X \right) -\theta_{0} | < a \}]\overset{P}\rightarrow 1\] \(\therefore\) \[\hat{\theta_n}\overset{P}\rightarrow \theta_0\] \(\square\)
Here are some papers related to consistency of MLE
1.A. Wald. 1949. Note on the consistency of the maximum likelihood estimate. Annals of Mathematical Statistics 20(4):595-601. http://www.jstor.org/stable/2236315
2.J. Wolfowitz. 1949. On Waldโs proof of the consistency of the maximum likelihood estimate.Annals of Mathematical Statistics 20(4):601-602. http://www.jstor.org/stable/2236316
3.Cramer Harald. Mathematical methods of statistics. Princeton university press.