Inference 1: Point Estimation

Auteur·rice

Minerva Mukhopadhyay

Lecture 1: What is inference?

  • So far we have learnt to find summary statistics when a data set is available.

  • The available data set most probably contains a sample from the population of interest, and we might have a rough idea about the shape of the distribution of the variables in the data set, along with the summary statistics obtained from the data.

  • Is this enough? With the help of the summary statistics and the idea of the underlying distribution, can we comment on features of the unobserved samples? Can we say anything about a future observation? Even if we make a statement (or hypothesis) on the future observations, what will be the associated level of confidence?

  • Statistical inference is the process of drawing conclusions about populations or scientific truths from the data.

  • Broadly, the whole procedure of Statistical inference can be split into two parts:

    • Parametric inference: In this case the variables in a data set are modeled using a distribution with some unknown parameters. The main task is to make inference about the unknown parameters. In other words, in parametric inference we make assumptions on the class of distributions from which the variables of a data set are coming from. The uncertainty is regarding the parameters only, which is inferred from the sample observations.

    Example: Suppose we are interested in the average income of the households in Lucknow. As the income distribution is known to be extremely (right) skewed, one way model the distribution of income as \(\mathtt{exponential}\) with parameter \(\lambda\). Now, as the expected value of an \(\mathtt{exponential}(\lambda)\) distribution is \(1/\lambda\), one may put forward the inverse of the sample mean as an estimate of \(\lambda\).

    • Non-parametric inference: In this case the variables of interest are not assumed to follow any particular distribution. The quantity of interest (for example, the location parameter) is estimated directly using sample observations. The estimator used for making inference must satisfy some desirable properties, under some very general assumptions.

      Example: Suppose we have sample of size \(n\) on a variable \(X\), and we are interested in the population expectation \(\mu\). Without making any particular distributional assumption, we can put forward the sample mean \(\bar{X}\) as an estimate of \(\mu\). This is because, we know that under any distribution, if the samples are independent and identically distributed, then \(E(\bar{X})=\mu\), \(\bar{X}\xrightarrow{p}\mu\), etc.

    • In MTH211A, we will mostly focus on parametric inference only.

  • Further, based on the underlying philosophy and interpretation of uncertainty, the statistical methods can be split into two main paradigms:

    • Frequentist: In frequentist’s perspective, probability of an event is the limit of relative frequencies.

    • Bayesian: In Bayesian perspective, probability is a subjective measure of uncertainty.

    • In MTH211A, we will consider both the paradigms. However, the following three modules will be focused on inference from a frequentist perspective.

  • Some notations:

    1. The random quantities, including random samples will be indicated by capital letters, \(X_{1},X_{2},\cdots\). The realizations of will be indicated by small letters, \(x_{1},x_{2}, \ldots\). Example, \(P(X=x)\).

      Note that in descriptive statistics, we focused on summarizing a particular sample realization in hand, say \(x_{1},\ldots,x_{n}\), and therefore we indicated the observations by small letters.

      In inference we try to device statistical methods based on (random) samples, with some desirable properties. As the realizations of the samples are uncertain, the random sample will be indicated by capital letters.

    2. Boldface will be used to indicate vectors, for example, \({\bf X}\) indicates random vector, \({\bf x}\) indicates a vector realization.

    3. The parameters of a distribution are treated as unknown fixed quantities in frequentist inference, and will be indicated in Greek letters, for example, \(\mu\), \(\sigma\), etc. Here also, boldface will be used to indicate parameter vectors, for example, \(\boldsymbol{\mu}\).

Some Definitions and Terminologies:

We have already come across the term population and sample. The goal of Statistics is to infer about some particular features of the population, with the help of sample observations. While learning the methods of inference the following terminologies be used repeatedly.

  • \(X_{1},\ldots, X_{n}\) is a random sample from a distribution, say \(F\):

    • Suppose we are interested in the feature (variable) \(X\) of a population, for example, the CPI of all IITK students at the time of graduation in past 10 years. The population can be finite or infinite.

    • If values of the variable were known for all the members of the population, then (theoretically) it would be possible to calculate the relative frequency of event ‘\(X\) takes a value in \([a,b]\)’. Let the relative frequency be \(r\).

    • Now, if we take a random sample from this population, and denote the value of the variable of interest for that sample by \(X_{1}\), then (prior to observing the sample) we would expect that \(P(X_{1}\in[a,b])=r\). Further, this is true for any \(a,b\in\mathbb{R}\).

    • Therefore the probability distribution of the random sample \(X_{1}\) is expected to be same as the relative frequency distribution of the population. Suppose we denote the distribution by \(F\). Then we say that \(X_{1}\) is a random sample from \(F\).

    • Now, suppose instead of considering a random sample of size \(1\), we consider a random sample of size \(n\). For each of these samples, we expect the same underlying probability distribution (marginally). Thus \(X_{i}\) is identically distributed as the distribution \(F\), for each \(i=1,\ldots,n\).

    • Finally, unless otherwise mentioned, we assume the random samples are mutually independent.

    • Combining the above facts, by the term “\(X_{1},\cdots, X_{n}\) is a random sample from \(F\)”, we mean \(X_{i}\stackrel{i.i.d.}{\sim} F\).

  • Estimator: In parametric inference, usually a form of \(F\) is assumed. For example, one can consider \(F\) to be a normal distribution. Although the form is assumed, the exact distribution remains unspecified, as the parameters of the distribution remains unspecified. The main task of parametric inference is to estimate the parameters with the help of sample observations. For example, suppose \(F\) is assumed to be normal, but the parameters of the distribution \((\mu,\sigma^{2})\) are not specified. Once can estimate the first parameter \(\mu\) by the sample mean, and the second parameter \(\sigma^{2}\) by the sample variance. In that case, we say that the sample mean, \(\bar{X}\), is an estimator of \(\mu\), and sample variance, \(S^{2}\) is an estimator of \(\sigma^{2}\).

  • Sample space of \((X_{1}, \ldots, X_{n})\): The collection of all possible values of \(\{X_{1}, \ldots, X_{n}\}\) is called the sample space of \((X_{1}, \ldots, X_{n})\). As random variables are Borel measurable functions on \(\mathbb{R}\) (or \(\mathbb{C}\)), the sample space of \((X_{1}, \ldots, X_{n})\) is a subset of \(\mathbb{R}^{n}\) (or \(\mathbb{C}^{n}\)).

  • Statistic: Let \(X_{1}, \ldots, X_{n}\) be a random sample of size \(n\) from a population, and \(T(\cdot)\) be a real (or vector) valued function whose domain includes the sample space of \((X_{1}, \ldots, X_{n})\). Then the random variable (or vector) \(Y= T(X_{1}, \cdots, X_{n})\) is called a statistic.

  • Sampling distribution of a statistic: The probability distribution of a statistic is called the sampling distribution of the statistic.

  • \(X\) is modeled as \(\{F_{\theta}: \theta \in \Theta\}\): As discussed above, in parametric inference the distribution of the samples is assumed. By the above-mentioned term we mean that the family of distributions of \(X\) is assumed to be the class \(F\), which is parametrized by \(\theta\), and \(\theta\) takes values in the set \(\Theta\).

Principles of Data Reduction

  • Suppose the class of distributions of the random sample \(\{X_{1}, \ldots, X_{n}\}\) is assumed to be \(\{F_{\theta}: \theta \in \Theta\}\), and the goal is to estimate the parameter \(\theta\).

  • Given a large sample of size \(n\), it is obvious that the statistician would not store all the data, rather she would only extract information which is relevant for \(\theta\). Usually the summarization is done through one or more summary statistics. However, the statistic(s) can be chosen in such a way that no information regarding \(\theta\) is lost, and the dimension of the statistics is as small as possible.

  • Here we learn some basic principles of data reduction.

Lecture 2: The Sufficiency Principle

Definition: [Sufficient Statistic]

Let \({\bf X}\) be a random sample from the distribution \(\{F_{\theta}: \theta \in \Theta \}\). A statistic \(T({\bf X})\) is a sufficient statistic for \(\theta\), if the conditional distribution of the random sample \({\bf X}\) given the value of \(T({\bf X})\) does not depend on \(\theta\).

Explanation:

Let us understand sufficiency with an example.

  • Consider the problem of estimating \(\theta\) given a random sample \(X_{1},\ldots, X_{n}\) from \(\mathtt{Uniform}(0,\theta)\). Suppose a statistician observes a sample \(\{x_{1},\ldots,x_{n}\}\) such that $\max\{x_{1},\ldots,x_{n}\}=t$. Now if she conveys only the value \(t\) to a second statistician as the largest order statistic, then will the second statistician have less information about \(\theta\) compared to the first?

  • The answer is “no”, because the highest order statistics, \(X_{(n)}\), is sufficient for \(\theta\). All the samples with \(X_{(n)}=t\) has same amount of information about \(\theta\).

  • Formally, the conditional distribution of \(X_{1},\ldots, X_{n}\) given \(X_{(n)}=t\) is free of \(\theta\). So, once \(X_{(n)}\) is known, the distribution of the sample does not convey any further information about \(\theta\). All the samples of size \(n\) with \(X_{(n)}=t\) have same information about \(\theta\).

Example 1: Let \(X_{1},\ldots, X_{n}\) be a random sample from \(\mathtt{Bernoulli}(\theta)\) distribution. Then the statistic \(T({\bf X})=\sum_{i=1}^{n} X_{i}\) is sufficient.

Example 2: Let \(X_{1},\ldots, X_{n}\) be a random sample from \(\mathtt{Gamma}(2,\theta)\) distribution. Then the statistic \(T({\bf X})=\sum_{i=1}^{n} X_{i}\) is sufficient.

Theorem 1 (Neyman’s Factorization Theorem)

Let \(X_{1},\cdots, X_{n}\) denote a random sample from a discrete or absolutely continuous distribution that has a joint pmf or joint pdf \(f_{\bf X}(\cdot ;\boldsymbol{\theta}), ~ \boldsymbol{\theta} \in \boldsymbol{\Theta}\). The statistic \(T=T({\bf X})\) is sufficient for \(\theta\) if and only if (iff) there exists functions \(g(t; \boldsymbol{\theta})\) and \(h({\bf x})\) such that, for all sample points \({\bf x}\) and all parameter points \(\boldsymbol{\theta}\) \[ f_{\bf X}({\bf x} ;\boldsymbol{\theta}) = g\left(T({\bf x}) ; \boldsymbol{\theta} \right) h({\bf x}).\]

[Proof]

Example 3: Let \(X_{1},\ldots, X_{n}\) be a random sample from \(\mathtt{Normal}(\mu,1)\) distribution. Then the statistic \(T({\bf X})=\sum_{i=1}^{n} X_{i}\) is sufficient.

Example 4: Let \(X_{1},\ldots, X_{n}\) be a random sample from \(\mathtt{Uniform}(0,\theta)\) distribution. Then the statistic \(T({\bf X})= X_{(n)}\) is sufficient.

Example 5(Exponential family): Let \(X_{1},\ldots, X_{n}\) be a random sample from a distribution with pmf or pdf \(f_{X}(\cdot ;\boldsymbol{\theta}), ~ \boldsymbol{\theta} \in \boldsymbol{\Theta}\) which belongs to an exponential family given by \[f_{X}(x ;\boldsymbol{\theta}) = h(x) c(\boldsymbol{\theta}) \exp \left\{ \sum_{i=1}^{k} w_{i} (\boldsymbol{\theta}) t_{i}(x)\right\} \] where \(\boldsymbol{\theta} = (\theta_{1}, \cdots, \theta_{d})\), \(d\leq k\). Then \[ T({\bf X}) = \left(\sum_{i=1}^{n} t_{1} (X_{i}), \cdots, \sum_{i=1}^{n} t_{k} (X_{i}) \right)\] is a sufficient statistic for \(\boldsymbol{\theta}\).

Note: Let \(X_1,\ldots, X_{n}\) be a random sample from some distribution with pmf or pdf \(f_{X}(\cdot,\boldsymbol{\theta})\). Then \({\bf T} ({\bf X})=\{X_{(1)}, \ldots, X_{(n)}\}\) is sufficient for \(\boldsymbol{\theta}\).

Lecture 3:

Minimal Sufficiency

  • For any family of distributions \(\{F_{\theta}: \theta \in \Theta\}\), sufficient statistic for \(\theta\) is usually not unique. For example, in the \(\mathtt{Uniform}(0,\theta)\) class of distributions, both \(\{X_{(1)}, \ldots, X_{(n)}\}\) are \(X_{(n)}\) are sufficient for \(\theta\).

  • Naturally, one would like to find the simplest (or, most precise) statistic among the pool of sufficient statistics.

  • This leads to the concept of minimal sufficiency.

Definition: [Minimal Sufficient Statistic]

A sufficient statistic \(T({\bf X})\) is called minimal sufficient if, for any other sufficient statistic \(T^{\prime}({\bf X})\), \(T({\bf x})\) is a function of \(T^{\prime}({\bf x})\).

How do we know if a sufficient statistic is minimal sufficient?

Theorem 2

Let \(f({\bf x} ; \theta)\) be the pmf or pdf of a sample \({\bf X}\). Suppose there exists a function \(T({\bf x})\) such that, for every two points \({\bf x}\) and \({\bf y}\), the ratio \(f({\bf x}; \theta)/f({\bf y}; \theta)\) is constant as a function of \(\theta\) iff \(T({\bf x})=T({\bf y})\). Then \(T({\bf X})\) is a minimal sufficient statistic for \(\theta\).

[Proof]

Example 6: Let \(X_{1},\ldots, X_{n}\) be a random sample from \(\mathtt{Normal}(\mu,\sigma^{2})\) distribution. Then the statistic \((\bar{ X},S^{2})\) is minimal sufficient, where \(S^{2}=\mathrm{var}(X)\).

Example 7: Let \(X_{1},\ldots, X_{n}\) be a random sample from \(\mathtt{Uniform}(\theta,\theta+1)\) distribution. Then the statistic \(T({\bf X})=(X_{(1)}, X_{(n)})\) is minimal sufficient.

Note: A minimal sufficient statistic is not unique. Any one-one function of a minimal sufficient statistic is also minimal sufficient.

Ancillary Statistics

Ancillary statistic serves the complementary purpose of a sufficient statistic. That is, while sufficient statistic contains all the information about \(\theta\), which could be obtained from the sample \(\{X_{1}, \ldots, X_{n}\}\), ancillary statistic contains no information about \(\theta\).

Definition: [Ancillary Statistic]

A statistic \(S({\bf X})\) whose distribution does not depend on the parameter \(\theta\) is called an ancillary statistic.

Example 8: Let \(X_{1},\ldots, X_{n}\) be a random sample from \(\mathtt{Normal}(\mu,1)\) distribution. Then the statistic \(S^{2}\) is ancillary for \(\mu\).

Example 9: Let \(X_{1},\ldots, X_{n}\) be a random sample from \(\mathtt{Uniform}(\theta,\theta+1)\) distribution. Then the statistic \(S({\bf X})=(X_{(n)}- X_{(1)})\) is ancillary for \(\theta\).

Some Special Families of Distributions

  1. Location family of distributions:

Let \(X_{1}, \ldots, X_{n}\) be a random sample, where \[X_{i}=\theta+ W_{i}, \quad i=1,\ldots,n,\] where \(W_{i}\) are i.i.d. from some distribution with cumulative distribution function (cdf) \(F\) (does not depend on \(\theta\)). This type of distribution of \(X\) is called location family of distributions, and \(\theta\) is called the location parameter.

Let \(S({\bf X})\) be a statistic such that \[S({\bf x}+d{\bf 1})=S(x_{1}+d,\ldots, x_{n}+d)=S({\bf x}),\] for all real \(d\). Then \(S({\bf X})\) is ancillary for \(\theta\), and is called location-invariant statistic.

Example: Range, mean deviation about mean, standard deviation are location invariant statistics.

  1. Scale family of distributions:

Let \(X_{1}, \ldots, X_{n}\) be a random sample, where \[X_{i}=\theta W_{i}, \quad i=1,\ldots,n,\] where \(W_{i}\) are i.i.d. from some distribution with cumulative distribution function (cdf) \(F\) (does not depend on \(\theta\)). This type of distribution of \(X\) is called scale family of distributions, and \(\theta\) is called the scale parameter.

Let \(S({\bf X})\) be a statistic such that \[S(c{\bf x})=S(cx_{1},\ldots, cx_{n})=S({\bf x}),\] for all \(c>0\). Then \(S({\bf X})\) is ancillary for \(\theta\), and is called scale-invariant statistic.

Example: The statistics \(X_{1}^{2} /\sum_{i=1}^{n}X_{i}^{2}\), \(\min_{i}{X_{i}}/\max_{i}{X_{i}}\) are location invariant statistics.

Lecture 4:

  • So far we came across the concepts of minimal sufficient and ancillary statistics. Intuitively, it seems interesting to know if these two statistics are unrelated (independent).

  • In fact, of two statistics \(S({\bf X})\) and \(T({\bf X})\) are independently distributed, and \(T({\bf X})\) is sufficient for \(\theta\), then \(S({\bf X})\) must be ancillary for \(\theta\). (Why?)

  • The converse, however, is not true in general. However, one special property which ensures independence of minimal sufficient and ancillary statistics is completeness.

Definition: [Complete Family of Distributions]

Let \(T({\bf X})\) be a statistic with pdf or pmf \(f_{T}(\cdot; \theta)\). The family of distribution \(f_{T}(\cdot;\theta), \theta \in \Theta\) is called complete if \(E_{\theta}(g(T))=0\) for all \(\theta\in \Theta\) implies \(P_{\theta}(g(T)=0)=1\) for all \(\theta\in \Theta\).

Remarks:

  1. If the family of a statistic \(T({\bf X})\) is complete, then \(T({\bf X})\) is called a complete statistic.

  2. \(E_{\theta}(g(T))=0\) for all \(\theta\in \Theta\) does not in general imply that \(P_{\theta}(g(T)=0)=1\) for all \(\theta\in \Theta\). For example, let \(\{X_{1}, X_{2}\}\) be a random sample from \(N(\theta,1)\), \(T=X_{1}-X_{2}\) and \(g(T)=T\). Then for any \(\theta\), \(E_{\theta}(g(T))=0\). However, \(P(g(T)=0)=P(X_{1}=X_{2})=0\).

  3. Completeness is a property of the family of distribution of a statistic \(T({\bf X})\). For example, the \(N(\theta,1), \theta\in \mathbb{R}\) family is complete (without proof).

Example 10: The \(\mathtt{Poisson}(\theta), \theta>0\) family is complete.

Example 11: The \(\mathtt{Exponential}(\theta), \theta>0\) family is complete. (Without proof)

Example 12: The \(\mathtt{Binomal}(n,\theta), 0<\theta<1\) family is complete.

Example 13: Let \(\{X_{1}, \ldots, X_{n}\}\) be a random sample from \(\mathtt{Uniform}(0,\theta), ~\theta>0\). Then the family of distributions of \(X_{(n)}\) is complete.

Example 14 (Without proof) : Let \(\{X_{1}, \ldots, X_{n}\}\) be a random sample from an exponential family of distribution with pmf or pdf of the form (*), then the statistics \(T({\bf X})\) given by (**) is complete if \(\{(w_{1}(\boldsymbol{\theta}), \cdots, w_{k}(\boldsymbol{\theta})); \boldsymbol{\theta}\in \Theta \}\) contains an open set in \(\mathbb{R}^{k}\).

Note: The condition \(\{(w_{1}(\boldsymbol{\theta}), \cdots, w_{k}(\boldsymbol{\theta})); \boldsymbol{\theta}\in \Theta \}\) contains an open set in \(\mathbb{R}^{k}\) is crucial. For example, it is easy to see that the family \(\mathtt{Normal}(\theta,\theta^{2})\), \(\theta\in\mathbb{R}\) is not complete.

Note: Example 12 implies that \(T({\bf X})=\sum_{i} X_{i}\) is complete sufficient statistic of \(\theta\), when \(\{X_{1},\ldots, X_{n}\}\) is a random sample from \(\mathtt{Bernoulli}(\theta)\). Similarly, Example 13 implies that \(T({\bf X})=X_{(n)}\) is complete sufficient statistic of \(\theta\), when \(\{X_{1},\ldots, X_{n}\}\) is a random sample from \(\mathtt{Uniform}(0,\theta)\).

Theorem 3 (Basu’s Theorem)

Let \(\{X_{1},\ldots, X_{n}\}\) be a sample from the family of distributions with pmf or pdf \(f_{\bf X}(\cdot; \theta)\), \(\theta\in \Theta\). If \(T({\bf X})\) is a complete-sufficient statistics for \(\theta\), then \(T({\bf X})\) is independent of every ancillary statistic of \(\theta\).

[Proof]

Note: Basu’s theorem provides us a way to verify independence of two statistics without explicitly deriving the joint/conditional distributions. Let us see an example.

Example 15: Let \(\{X_{1}, \ldots, X_{n}\}\) be a random sample from \(\mathtt{Normal}(\mu,1), ~\mu\in\mathbb{R}\). Then the sample mean and sample variance, \(\bar{X}\) and \(S^{2}\), are independent.

We conclude this section with a theorem which would be proved later.

Theorem 4

A complete-sufficient statistic is minimal sufficient.

Lecture 5:

Desirable properties of a point estimators

  • After reducing data, it would be of interest to find an estimator of the parameter of interest \(\theta\).

  • There could be many possible estimators of \(\theta\). For example, if \(\theta\) is the location parameter of the population, then both the sample mean, \(\bar{X}\), and the sample median, \(\tilde{X}_{me}\), can be a possible estimator. How to choose the best one among these?

  • In order to make the task of evaluating the estimators easy, we list down some desirable properties that a good estimator should satisfy.

Definition: [Mean Squared Error]

The mean squared error (MSE) of an estimator \(T({\bf X})\) of the parameter \(\theta\) is defined as \(E_{\theta}(T({\bf X})- \theta)^{2}\).

Remarks:

  1. Here \(E_{\theta}(\cdot)\) means that given that the data generating process (DGP) of the sample \(\{ X_{1}, \ldots, X_{n}\}\) involves \(\theta\). As the expectation is taken over \({\bf X}\), the MSE is a function of \(\theta\) only. For example, let \(\{ X_{1}, \ldots, X_{n}\}\) is a random sample from \(N(\mu,\sigma^{2})\), then the MSE of \(\bar{X}\) is \(E_{\mu,\sigma^{2}}(\bar{X} -\mu)^{2}= \mathrm{var}_{\mu,\sigma^{2}}(\bar{X})=\sigma^{2}/n\).

  2. Instead of MSE one can use other (expected) measures of difference, for example, mean absolute error (MAE) \(E_{\theta}|T({\bf X})- \theta|\). However, like RMSE and mean deviation, the former one is preferred for algebraic amenability.

  3. The MSE can be decomposed into two parts: \[ \text{MSE}= E_{\theta}\left[ \{T({\bf X})-E_{\theta}(T({\bf X})\}+\{E_{\theta}(T({\bf X})- \theta\}\right]^{2}= \mathrm{var}_{\theta}\left\{T({\bf X})\right\}+\{E_{\theta}(T({\bf X}))- \theta\}^{2}=\text{variance}+\text{bias}^{2}.\] Intuitively, the MSE is the sum of two quantities, one measuring accuracy (bias\(^{2}\)), and the other measuring precision (variance).

  4. Naturally, one would like to use the point estimator having minimum MSE, among the pool of all possible estimators. However, it is difficult to obtain the estimator with minimum MSE. Therefore, one possible way is to focus on a (reasonable) sub-class of all possible estimators, and try to find estimator with minimum MSE in that subclass. The subclass we focus on is the class of unbiased estimators.

Definition: [Unbiased estimator]

The expected difference of the estimator \(T({\bf X})\) and the parameter \(\theta\), when the DGP involves \(\theta\), is called bias. An estimator whose bias is identically equal to zero is called an unbiased estimator.

Example 16: Let \(\{X_{1}, \ldots, X_{n}\}\) be a random sample from some population with finite mean \(\mu\). Then sample mean \(\bar{X}\) is an unbiased estimator of \(\mu\). Further, if \(\{X_{1}, \ldots, X_{n}\}\) is a random sample from some population with finite variance \(\sigma^{2}\), then the estimator \(S^{\star2}=nS^{2}/(n-1)\) is an unbiased estimator of \(\sigma^{2}\).

Note For an unbiased estimator \(T({\bf X})\), MSE is equal to variance of the estimator.

Example 17: Let \(\{X_{1}, \ldots, X_{n}\}\) be a random sample from \(\mathtt{Normal}(\mu,\sigma^{2}), ~\mu\in\mathbb{R},~\sigma^{2}>0\). Then \(\bar{X}\) and \(S^{\star2}\) are the unbiased estimators of \(\mu\) and \(\sigma^{2}\). Further, the MSEs of sample mean and \(S^{\star2}\) with respect to \(\mu\) and \(\sigma^{2}\) are \(\sigma^{2}/n\) and \(2\sigma^{4}/(n-1)\), respectively.

Remarks:

  1. A biased estimator may have lower MSE than an unbiased estimator. For example, in the above example, it is easy to see that the sample variance \(S^{2}\) has MSE \((2n-1)\sigma^{4}/n^{2}\), which is lower than the MSE of the unbiased estimator \(S^{\star2}\).

Example 18: Let \(\{X_{1}, \ldots, X_{n}\}\) be a random sample from \(\mathtt{Bernoulli}(p), ~0\leq p\leq 1\). Then \(\bar{X}\) is unbiased estimator of \(p\), and the MSE of \(\bar{X}\) is \(p(1-p)/n\).

  1. An unbiased estimator of \(\theta\) may not exist. For example, in the \(\mathtt{Bernoulli}(p)\) example above, an unbiased estimator of \(p^{2}\) does not exist.

  2. As MSE is equal to variance for an unbiased estimator, the best estimator (in terms of MSE) in the class of unbiased estimators is one with minimum variance.

Definition: [Uniformly minimum variance unbiased estimator]

An estimator of \(\theta\), \(T({\bf X})\), is called a uniformly minimum variance unbiased estimator (UMVUE) if

  1. \(E_{\theta}\{ T({\bf X})\}=\theta\) for all \(\theta\in \Theta\), and

  2. for any other estimator \(T^{\prime}({\bf X})\) with \(E_{\theta}\{ T^{\prime}({\bf X})\}=\theta\), \(\mathrm{var}_{\theta}\{T({\bf X})\}\leq \mathrm{var}_{\theta}\{T^{\prime}({\bf X})\}\) for all \(\theta\in \Theta\).

Remarks:

  1. In the above definition, it is important that (i) and (ii) satisfy for all \(\theta\in \Theta\). If (i) and (ii) satisfy for a particular choice of \(\theta\), say \(\theta_{0}\), then the corresponding estimator \(T({\bf X})\) is called locally minimum variance unbiased estimator (LMVUE).

Theorem 4 (Properties of UMVUE: 1)

Let \(T({\bf X})\) be the UMVUE of \(\theta\), then \(T({\bf X})\) is unique.

[Proof of Theorem 4]

Note: As the class of unbiased estimator is quite large (often uncountable), finding an UMVUE is not an easy task. Next, we will see how the sufficiency principle helps in finding an UMVUE of \(\theta\).

Theorem 5 (Rao-Blackwell)

Let \(T_{1}({\bf X})\) be an unbiased estimator of \(\theta\) and \(T({\bf X})\) be a sufficient statistic for \(\theta\). Then the conditional expectation \(\phi(t)= E(T_{1}({\bf X})\mid T({\bf X})=t)\) defines a statistic \(\phi(T)\). This statistic \(\phi(T)\) is

  1. a function of the sufficient statistic \(T({\bf X})\) for \(\theta\),

  2. is an unbiased estimator of \(\theta\), and

  3. satisfies \(\mathrm{var}_{\theta}(\phi(T)) \leq \mathrm{var}_{\theta}(T_{1}({\bf X}))\) for all \(\theta \in \Theta\).

[Proof of Theorem 5]

  • Proof of Theorem 5 requires the following result:

Result

Under the existence and finiteness of all the relevant expectations, the following properties of conditional expectation and variance are satisfied:

  1. \[ E\left[ E\left\{ h(Y)\mid Z \right\}\right]=E\left\{ h(Y)\right\},\]

  2. \[ \text{var}\left[ E\left\{ h(Y)\mid Z \right\}\right] + E\left[ \text{var} \left\{ h(Y)\mid Z \right\}\right] = \text{var} \left\{ h(Y)\right\}.\]

Example 19: Let \(X_{1},\ldots, X_{n}\) be \(n\) random samples from \(\mathtt{Bernoulli}(\theta)\). Then \(X_{1}\) an unbiased estimator of \(\theta\). However, a better estimator can be constructed by considering \(E(X_{1}\mid \sum_{i=1}^{n} X_{i})\) as \(\sum_{i=1}^{n} X_{i}\) is a sufficient statistic.

Remarks

  1. If \(T_{1}({\bf X})\) is solely a function of a the sufficient statistic \(T({\bf X})\) then \(\mathrm{var}_{\theta}(\phi(T)) = \mathrm{var}_{\theta}(T_{1}({\bf X}))\) for all \(\theta \in \Theta\).

  2. For any statistic \(T({\bf X})\), and an unbiased estimator \(T_{1}({\bf X})\), \(E\left\{E(T_{1}({\bf X})\mid T({\bf X})=t)\right\} =\theta\), and conclusion (iii) in Theorem 4 holds. However, if \(T({\bf X})\) is not sufficient, then \(E(T_{1}({\bf X})\mid T({\bf X})=t)\) may not be a statistic. For instance, consider the function \(E( n^{-1}\sum_{i=1}^{n} X_{i} \mid X_{1})\) in the above example. Observe that \[E\left( \frac{1}{n}\sum_{i=1}^{n} X_{i} \mid X_{1}=x_{1} \right) =\frac{x_{1}}{n} + \frac{1}{n} \sum_{i=2}^{n} E(X_{i}) =\frac{x_{1}+(n-1)\theta}{n}, \] which is not an estimator.

  3. Rao-Blackwell Theorem indicates that the UMVUE must be a function of sufficient statistic. If not, then a better estimate can be obtained by considering the conditional expectation given a sufficient statistic.

Note: Rao-Blackwell Theorem provides a way to improve on an existing estimator. Based on the theorem, we can make a general recommendation of selecting an appropriate (unbiased) function of a sufficient statistic as an estimator of \(\theta\). However, the class of sufficient statistics is also uncountable (as one-one function of sufficient statistic is also sufficient). Thus, this theorem does not directly indicate a choice of UMVUE.

Role of unbiased estimator of zero:

An unbiased estimator of zero, say \(S({\bf X})\), is an estimator satisfying \(E\left\{S({\bf X}) \right\}=0\). One can interpret \(S({\bf X})\) as a random noise. For any unbiased estimator, \(T({\bf X})\), of \(\theta\) a (uncountable) class of unbiased estimators of \(\theta\) can be obtained by adding a multiple of \(S({\bf X})\) with \(T({\bf X})\), as \(T({\bf X})+\alpha S({\bf X})\), \(\alpha\in \mathbb{R}\).

Ideally, the best unbiased estimator should be independent of \(S({\bf X})\), as any part of the best unbiased estimator should not be explained by a random noise. The following this formally states this result.

Theorem 6 (Properties of UMVUE: 2)

\(T({\bf X})\) is the UMVUE of \(\theta\) if and only if \(T({\bf X})\) is uncorrelated with all unbiased estimators of zero.

[Proof of Theorem 6]

Remarks

  1. The above characterization of UMVUE is of limited application, as the class of unbiased estimators of zero is also very large. However, the above theorem is useful in proving that an unbiased estimator of \(\theta\) is not an UMVUE.

Example 20: Let \(X_{1},\ldots, X_{n}\) be a random sample from \(\mathrm{Poisson}(\theta)\) distribution. The estimators \(T_{1}({\bf X})=\bar{X}\), \(T_{2}({\bf X})=S^{\star}\) and \(T_{3}({\bf X})=X_{1}\) are unbiased for \(\theta\). However, we can discard \(T_{3}\) as a possible candidate of UMVUE as it is not a function of any sufficient statistic. Now, to check if \(T_{2}\) can be the UMVUE, consider the covariance of \(T_{2}\) and \(T_{1}-T_{3}\). Verify that the covariance is non-zero. Hence \(T_{2}\) can not be an UMVUE of \(\theta\).

  1. It is not, in general, easy to characterize the class of unbiased estimators of zero, consequently, verifying if \(\text{cov}(T({\bf X}),U({\bf X}))=0\), for all unbiased estimator \(U({\bf X})\) of zero is difficult. However, if the distribution of \(T({\bf X})\) is complete, then it implies that the only (with probability one) unbiased estimator of zero is zero (with probability one) itself. Thus, correlation of any estimator \(T\) of \(\theta\) and an unbiased estimator of zero, must be zero. The Lehman-Scheffe theorem formalizes this idea.

Theorem 7 (Lehman-Scheffe)

If \(T\) is a complete-sufficient statistic, and there exists an unbiased estimator \(T_{1}({\bf X})\) of \(\theta\), then the UMVUE of \(\theta\) is given by \(\phi(T)=E(T_{1}\mid T)\).

[Proof of Theorem 7]

Remark

  1. The above theorem says that if one can obtain an unbiased estimator, \(T({\bf X})\), of \(\theta\) based on a complete-sufficient statistics, then that much be the UMVUE of \(\theta\). For instance, in Example 20, \(T_{1}\) is the UMVUE of \(\theta\).

Methods of Finding UMVUE:

  1. Rao-Blackwellization: If an unbiased estimator of \(\theta\) is available, and the distribution of the complete sufficient statistic in known. Then one might find the UMVUE ba taking the conditional expectation the unbiased estimator given the complete sufficient statistic.

Example 21: Let \(X_{1},\ldots, X_{n}\) be a random sample from \(\mathrm{Poisson}(\lambda)\) distribution. Consider the problem of estimating \(P(X=1)=\theta = \lambda \exp\{-\lambda\}\). Observe that \(Y_{1}=I_{\{1\}}(X_{1})\) is an unbiased estimator of \(\theta\), and \(T({\bf X}) = \sum_{i=1}^{n}X_{i}\) is sufficient for \(\lambda\), and hence for \(\theta\). The conditional distribution of \(Y_{1}\) given \(T\) is a two point distribution with probability mass \(\phi(t)=t(n-1)^{t-1}/n^{t}\) for \(Y_{1}=1\) and \(1-q\) otherwise. Thus the conditional expectation is \(\phi(t)\). Thus the statistic \(\phi(T)=T(1-1/n)^{T} /(n-1)\) is the UMVUE for \(\theta\).

  1. Method of Solving: Let \(T({\bf X})\) be a complete sufficient statistic. As it is known that any function of \(T\) which is an unbiased estimator of \(\theta\) is the UMVUE of \(\theta\), one can obtain the UMVUE directly by solving for the function \(g(T)\) such that \(E_{\theta}(g(T))=\theta\).

Example 21 (continue): In the same problem of estimating \(P(X=1)=\theta = \lambda \exp\{-\lambda\}\), we can apply the direct solving method as follows: \[ E_{\lambda} \left\{ g(T)\right\} = \theta \quad \Leftrightarrow \quad g(0)+ g(1)n\lambda +\cdots + g(t) \frac{n^{t}\lambda^{t}}{t!} +\cdots = \lambda + (n-1)\lambda^{2} + \cdots + \frac{(n-1)^{t-1}\lambda^{t}}{t!} + \cdots , \] which arrives at the same choice of UMVUE.

Remarks

  1. From Lehman-Scheffe theorem, it is intuitive that if \(T\) is a complete sufficient statistic, and \(T^{\prime}\) is any other sufficient statistic, then \(\phi(T^{\prime})\) (\(\phi\) as described in Rao-Blackwell theorem) must be a function of \(\phi(T)\), which in turn implies that \(T\) must be minimal sufficient. We conclude this section with the proof of Theorem 4, stated in Lecture 4.

Theorem 4

A complete-sufficient statistic is minimal sufficient.

[Proof of Theorem 4]

Lecture 6: Information Inequalities:

  • In many situations it is not possible to obtain a complete sufficient statistic, or to verify if a reasonable estimator is complete sufficient. Another approach towards finding the UMVUE is to set a tight (achievable) lower-bound to the class of variances of all the unbiased estimators. If the variance of an unbiased estimator achieves the lower bound then it must be the UMVUE.

  • The Rao-Cramer lower bound serves this purpose.

Theorem 8 (Cramer-Rao Lowerbound, CRLB)

Let \(X_{1},\ldots, X_{n}\) be a sample with pdf \(f_{\bf X}({\bf x};\theta)\), \(\theta\in \Theta\), satisfying the following regularity conditions:

  1. \(\Theta\) is an open interval in \(\mathbb{R}\), and the support \(S_{\bf X}\) does not depend on \(\theta\).

  2. For each \({\bf x}\in S_{\bf X}\) and \(\theta\in \Theta\), the derivative \(\partial \log f_{\bf X}({\bf x}; \theta)/\partial \theta\) exists and is finite.

  3. For any statistic \(S({\bf X})\) with \(E(|S({\bf X})|)<\infty\) for all \(\theta\), we have \[\displaystyle\frac{\partial}{\partial\theta} E_{\theta}\left\{S({\bf X})\right\} = \int S({\bf x}) \frac{\partial}{\partial\theta} f_{\bf X} ({\bf x} ; \theta) d{\bf x}.\]

Let \(T({\bf X})\) be any estimator satisfying \(\text{var}_{\theta}\left\{T({\bf X})\right\}<\infty\).

Define \(E_{\theta}\left\{ T({\bf X})\right\}=\psi(\theta)\), \(\psi^{\prime}(\theta)= \displaystyle\frac{\partial}{\partial\theta} =\psi(\theta)\) and \(I(\theta) = E_{\theta} \left[ \displaystyle\frac{\partial}{\partial\theta} \log f_{\bf X} ({\bf x} ; \theta) \right]^{2}\).

Then \(T({\bf X})\) satisfies

\[ \text{var}_{\theta}\left\{T({\bf X})\right\} \geq \frac{[\psi^{\prime}(\theta)]^{2}}{I(\theta)}. \tag{*}\]

[Proof of Theorem 8]

Remarks

  1. From the proof it follows that equality holds in (*) if \(T({\bf X})-\psi(\theta)= k(\theta) \frac{\partial}{\partial \theta} \log f_{\bf X}({\bf X};\theta)\) with probability one. Integrating both sides with respect to \(\theta\), we observe that equality holds for one-parameter exponential family with \(T({\bf X})\) being a sufficient statistic.

  2. When \(\psi(\theta)=\theta\), then the CRLB reduces to \(\text{var}_{\theta}\left\{T({\bf X})\right\} \geq [I(\theta)]^{-1}\).

  3. When \(X_{1}, \dots, X_{n}\) are i.i.d., then \(I(\theta)=n I_{1}(\theta)\), where \(I_{1}(\theta) = E_{\theta} \left[ \displaystyle\frac{\partial}{\partial\theta} \log f_{X_{1}} ( x ; \theta) \right]^{2}\).

  4. The quantity \(I(\theta)\) is called the Fisher information of the sample \(X_{1}, \ldots, X_{n}\). As \(n\) increases, \(I(\theta)\) increases and consequently the variance of the UMVUE decreases. Thus the estimator becomes more concentrated around \(\theta\), i.e., has more information about \(\theta\). To understand the intuition behind the term information further, read this.

  5. If \(f_{\bf X}({\bf x} ; \theta)\) satisfies \(\displaystyle\frac{\partial}{\partial \theta} E_{\theta} \left[ \frac{\partial}{\partial \theta} \log f_{\bf X} ({\bf X} ; \theta)\right] = \int \frac{\partial}{\partial \theta} \left[ \left\{ \frac{\partial}{\partial \theta} \log f_{\bf X} ({\bf x}; \theta) \right\} f_{\bf X} ({\bf x} ; \theta) \right] dx\), then \[ E_{\theta} \left[ \left\{ \frac{\partial}{\partial \theta} \log f_{\bf X} ({\bf X} ; \theta ) \right\}^{2} \right] = - E_{\theta} \left[ \frac{\partial^{2}}{\partial \theta^{2}}\log f_{\bf X} ({\bf X} ; \theta ) \right] \]

  6. As stated before, CRLB provides another way to verify if an unbiased estimator of \(\theta\) is UMVUE or not. We can simply compute the variance, and check if the variance matches the CRLB. However, one must be cautious about applying this method. The regularity conditions (i)-(iii) must be satisfied by the underlying class of distributions, and the estimators under consideration. For example, \(\mathtt{Uniform}(0,\theta)\), location exponential distributions do not satisfy criteria (i).

  7. Even when the family of distributions satisfies the regularity conditions, CRLB may not be achieved by the UMVUE of \(\theta\). The following is one such example.

Example 22: Let \(X_{1}, \ldots, X_{n}\) be a random sample from \(\mathtt{Normal}(\mu,\sigma^{2})\). It is not difficult to see that the CRLB is \(2\sigma^{4}/n\), and the variance of the UMVUE \(S^{\star 2}\) (being an unbiased estimator based on a complete sufficient statistic) is \(2\sigma^{4}/(n-1)\).

Example 23: Let \(X_{1}, \ldots, X_{n}\) be a random sample from a location family of distributions, i.e., \(X_{i}=\theta +W_{i}\), \(i=1,\ldots,n\), where \(W_{i}\)s are i.i.d. from a distribution with p.d.f. \(f_{W}\), free of \(\theta\). Then \(I(\theta)=E\left[ \{ f^{\prime}_{W}(W)/f_{W}(W) \}^{2} \right]\) is free of \(\theta\).

We conclude this section with two definitions. Both the definition assumes that the underlying family of distributions satisfy the regularity conditions stated in Theorem 8.

Definition 1 (Efficiency of an Estimator): The ratio of the CRLB to the actual variance of an unbiased estimator \(T({\bf X})\) is called efficiency of \(T\).

Definition 2 (Efficient Estimator): An unbiased estimator of \(\theta\), \(T({\bf X})\) is said to be efficient if variance of \(T\) attains the CRLB.

Lecture 7: Methods of Estimation.

  • So far, we have talked about properties of good estimators. Next, we will discuss some common methods of finding estimators of \(\theta\).

  • The two most popular frequentist methods of finding estimators are (A) method of moments (MoM), and (B) maximum likelihood (ML). We will now discuss these methods in details.

(A) Method of Moments (MoM)

  • One of the simplest and oldest methods of finding estimators is the method of moments or substitution principle.

  • Let \(X_{1}, \ldots, X_{n}\) be a random sample from a population with distribution function \(\{F_{\boldsymbol{\theta}}; \boldsymbol{\theta}\in \Theta\}\). Method of moments estimators are found by equating the first \(k\) sample moments moments to the corresponding \(k\) population moments, and solving the resulting system of equations.

  • Let \(\boldsymbol{\theta}\) be a \(k\)-dimensional parameter. Then usually we require a system of \(k\)-equations to get an estimate of \(\boldsymbol{\theta}\). Suppose the Let \(M_{r}^{\prime}\) be the \(r\)-th sample raw moment, and \(\mu_{r}^{\prime}=E\left( X_{1}^{r} \right)\) be the \(r\)-th raw moment of \(F_{\boldsymbol{\theta}}\), \(r=1,\ldots, k\). Of course, for each \(r\), \(\mu_{r}^{\prime}\) will be a function of \(\boldsymbol{\theta}\). Then we obtain estimates of \(\boldsymbol{\theta}\) by solving the following equations: \[ M_{r}^{\prime} = \mu_{r}^{\prime},\qquad r=1,\ldots,k. \]

Example 23: Let \(X_{1}, \ldots, X_{n}\) be a random sample from \(\mathtt{Normal}(\mu,\sigma^{2})\). The MoM estimators of \(\mu\) and \(\sigma^{2}\) are \(\bar{X}\) and \(S^{2}\) respectively.

Remarks

  1. From Weak (Strong) Law of Large Numbers, \(M_{r}^{\prime} \xrightarrow{P ~(a.s)} \mu_{r}^{\prime}\). Therefore, if one is interested in the population moments, then MoM provides consistent (strongly consistent) estimators.

  2. However, MoM may lead to estimators having sub-optimal sampling properties, and may in some cases lead to absurd estimators.

Example 24: Let \(X_{1}, \ldots, X_{n}\) be a random sample from \(\mathtt{Uniform}(\alpha,\beta)\). The MoM estimators of \(\alpha\) and \(\beta\) are \(T_{1}({\bf X})\) and \(T_{2}({\bf X})\), respectively, where \[ T_{1}({\bf X}) =\bar{X}- \sqrt{\frac{3\sum_{i=1}^{n} (X_{i}-\bar{X})^{2}}{n}}, \quad \text{and} \quad T_{2}({\bf X}) =\bar{X} + \sqrt{\frac{3\sum_{i=1}^{n} (X_{i}-\bar{X})^{2}}{n}}.\] Observe that, the none of the estimates are based on minimal sufficient statistic.

(B) Maximum Likelihood (ML) Estimation

Definition 1 (Likelihood Function): Let \({\bf X}=(X_{1}, \ldots, X_{n})^{\prime}\) be a random sample from a population with distribution function \(\{F_{\boldsymbol{\theta}}; \boldsymbol{\theta}\in \Theta\}\). Suppose the distribution \(\{F_{\boldsymbol{\theta}}; \boldsymbol{\theta}\in \Theta\}\) possesses a pdf (or, pmf) \(f_{\bf X}(\cdot ; \theta)\). Further, suppose \({\bf x}\) is a realization of \({\bf X}\). Then the function of \(\theta\) defined as \(L(\theta \mid {\bf x}) = f({\bf x} ; \theta)\) is called the likelihood function.

Definition 2 (Maximum Likelihood Estimate, MLE): Given a realization \({\bf x}\), let \(\widehat{\theta}\) be the value in \(\Theta\) that maximizes the likelihood function \(L(\theta \mid {\bf x})\) with respect to \(\theta\), then \(\widehat{\theta}\) is called the MLE of the parameter \(\theta\).

Note that, the maximizer \(\widehat{\theta}\) is nothing but a function of the realization \({\bf x}\). Thus we can treat the maximizer of the likelihood function as a statistic or estimator of \(\theta\). This estimator is called Maximum Likelihood Estimator (MLE). Notationally we write \(\widehat{\theta}=\widehat{\theta}_{ML}({\bf X})\).

Example 24: Suppose there are \(n\) tosses of a coin, and we do not know the value of \(n\), or the probability of head (\(p\)). However, we know that \(n\) is between \(3\) to \(5\) and one of the sides of the coin is twice as heavy as the other (i.e., either \(p=2(1-p)\) or \((1-p)=2p\)). Then what is the MLE of \(\theta=(n,p)\)?

Remarks

  1. If the likelihood function is differentiable with respect to \(\theta\), the one may take the differentiation approach for finding the MLE. In case of a several value function \(L(\boldsymbol{\theta})\), if the function is twice continuously differentiable with respect to each \(\theta_{j}\), then a critical point of \(L(\boldsymbol{\theta}\mid {\bf x})\) can be obtained by equating \(\displaystyle\frac{\partial }{\partial \boldsymbol{\theta}}L(\boldsymbol{\theta}\mid {\bf x})={\bf 0}\). Then to verify, if the critical point is a maximizer, one can check if the Hessian matrix \(\displaystyle\frac{\partial^{2} }{\partial \boldsymbol{\theta}\partial \boldsymbol{\theta}^{\prime}} L(\boldsymbol{\theta}\mid {\bf x})\) is negative definite at the critical point.

  2. Often it is convenient to work with the log likelihood function, instead of the likelihood function. As logarithm is a monotone function, the maximizer of likelihood and the log likelihood are the same. The log likelihood is generally denoted by \(l(\boldsymbol{\theta}; {\bf x})\).

Example 25: Let \(X_{1}, \ldots, X_{n}\) be a random sample from \(\mathtt{Normal}(\mu,\sigma^{2})\). Then the MLE of \(\mu\) and \(\sigma^{2}\) are \(\bar{X}\) and \(S^{2}\), respectively.

Example 26: Let \(X_{1}, \ldots, X_{n}\) be a random sample from \(\mathtt{Uniform}(\alpha,\beta)\). Then the MLE of \(\alpha\) and \(\beta\) are \(X_{(1)}\) and \(X_{(n)}\), respectively.

Remarks

  1. In the above two examples, we have seen that MLE is a function of a sufficient statistic. This is in general true.

Theorem 9 (Properties of MLE: 1)

Let \(X_{1}, \ldots, X_{n}\) be a random sample from some distribution with pdf (or pmf) \(f_{\theta}; \theta \in \Theta\), and let \(T({\bf X})\) be a sufficient statistic for \(\theta\). Then an MLE, if exists and unique, is a function of \(T\). If MLE exists, but is not unique then, one can find an MLE which is a function of \(T\).

[Proof of Theorem 9]

Remark

  1. Maximum likelihood estimate may not exist. See the following example.

Example 27: Let \(X_{1},X_{2}\) be a random sample from \(\mathtt{Bernoulli}(\theta)\), \(\theta\in (0,1)\). Suppose the realization \((0,0)\) is observed. Then the MLE does not exist.

Remark

  1. Even if MLE exists, it may not be unique. See the following example.

Example 28: Let \(X_{1},\ldots, X_{n}\) be a random sample from \(\mathtt{Double~Exponential}(\theta,\sigma)\) distribution, \(\theta\in \mathbb{R}\). Then the \(\widehat{\theta}_{ML}\) is the median \(X_{1},\ldots, X_{n}\), which is not unique.

  • In spite of all the above drawbacks, MLE is by far the most popular and reasonable frequentist method of estimation. The reason is that MLE possesses a list of desirable properties. We will discuss some of them below.

Theorem 10 (Properties of MLE: 2)

Suppose the regularity conditions of CRLB (see Theorem 8) are satisfied, the log-likelihood is twice differentiable, and there exists an unbiased estimator \(\widehat{\theta}\) of \(\theta\), the variance of which attains the CRLB, then \(\widehat{\theta}=\widehat{\theta}_{ML}({\bf X})\).

[Proof of Theorem 10]

Corollary:

Theorem 10 implies that if the CRLB is attained by any estimator, then it must be an MLE. However, the converse is not true, i.e., the variance of an MLE may not attain the CRLB.

Theorem 11 (Properties of MLE: 3, Invariance Property)

Let \(\{f_{\bf X}(\cdot;{\theta}):\theta\in \Theta\}\) be a family of PDFs (PMFs), and let \(L(\theta\mid {\bf x})\) be the likelihood function. Suppose \(\Theta\in \mathbb{R}^{k}, k\geq 1\). Let \(h:\Theta \to \Lambda\) be a mapping of \(\Theta\) onto \(\Lambda\), where \(\Lambda\) is a subset of \(\mathbb{R}^{p} ~(1\leq p\leq k)\). If \(\widehat{\theta}_{ML}({\bf X})\) is an MLE of \(\theta\), then \(h\left(\widehat{\theta}_{ML}({\bf X})\right)\) is an MLE of \(h(\theta)\).

[Proof of Theorem 11]

Example 29: Let \(X_{1},\ldots, X_{n}\) be a random sample from \(\mathtt{Gamma}(1,\theta)\) distribution, \(\theta>0\). Find an MLE of \(\theta\).

Example 30: Let \(X_{1},\ldots, X_{n}\) be a random sample from \(\mathtt{Poisson}(\theta)\) distribution, \(\theta>0\). Find an MLE of \(P(X=1)=\exp\{-\theta\}\).

Theorem 12 (Properties of MLE: 4, Large Sample Property)

Under some regularity conditions, the following is true \[ \sigma_{\theta}^{-1} \sqrt{n} \left\{ \widehat{\theta}_{ML}({\bf X}) -\theta \right\} \xrightarrow{D} \mathrm{N}(0,1), \quad \text{where} \quad \sigma_{\theta}^{-2} = E_{\theta} \left[ \frac{\partial \log f_{\bf X} (\cdot; \theta)}{\partial \theta}\right]^{2}. \]

[Without Proof]

Corollary:

Theorem 11 implies that under appropriate regularity conditions, MLE is a consistent estimator of \(\theta\), i.e., \(\widehat{\theta}_{ML}({\bf X})\xrightarrow{p} \theta\). Further, it can be shown that MLE is an asymptotically efficient estimator of \(\theta\), i.e., the variance of MLE approaches the CRLB as \(n\rightarrow \infty\).