9.4 - Fisher’s information number and the Cramér-Rao lower bound

Introduction

One of the reasons Fisher was so interested in likelihoods was the idea that they could provide a way to quantify the amount of information about a parameter \(\theta\) available in the sample.
The amount of information available can be measured by the “sharpness” of the log-likelihood function

Visual motivation

Consider the two concave log-likelihoods below:

The steepness of the log-likelihood at any \(\theta\) is determined by the score function:

\[\dot \ell(\theta) = \frac{d}{d\theta} \ln L(\theta)\]

Note that for concave log-likelihoods, we find MLEs by setting these score functions to 0 and solving:

\[\hat\theta_{MLE} = \{\theta: \dot {\ell} (\theta)=0\}\]

Visual motivation

Consider the two concave log-likelihoods below:

We measure “sharpness” of the log-likelihood by finding \(Var(\dot \ell(\theta))\)
Under the assumption of a concave log-likelihood:

\[E(\dot \ell(\theta)) = 0 \Rightarrow Var(\dot \ell(\theta)) = E(\dot \ell(\theta)^2)\]

This quantity is known as Fisher’s information number:

\[I(\theta) = E(\dot \ell(\theta)^2)\]

Intuitively: “variability in the steepness of the log-likelihood”

Using the 2nd derivative

As you know from calc, we can also measure “concavity” using the second derivative
Another (often easier) way to compute \(I(\theta)\) is:

\[I(\theta) = E\left(-\ddot \ell (\theta)\right),\] where:

\[\ddot\ell(\theta) = \frac{d^2}{d\theta^2} \ln L(\theta)\]

Finding 2nd derivatives is usually easier than finding 2nd moments!

Exponential scale example

Let \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} EXP(\beta)\), with pdf:

\[f_Y(y) = \frac{1}{\beta}e^{-y/\beta}, y >0 \]

Find \(I(\beta)\), the amount of information this sample carries about \(\beta\).

\[\ln L(\beta) = -n\ln(\beta) - \frac{\sum_{i=1}^n Y_i}{\beta}\]

\[\dot \ell(\beta) = -\frac{n}{\beta} + \frac{\sum_{i=1}^n Y_i}{\beta^2}\]

\[\ddot \ell(\beta) = \frac{n}{\beta^2}-\frac{2\sum_{i=1}^n Y_i}{\beta^3}\]

\[\Rightarrow I(\beta) = E(-\ddot \ell(\beta)) = E\left(-\frac{n}{\beta^2}+\frac{2\sum_{i=1}^n Y_i}{\beta^3}\right) = -\frac{n}{\beta^2} + \frac{2n\beta}{\beta^3} = \frac{n}{\beta^2}\]

The Cramér-Rao lower bound

Harald Cramér and C.R. Rao (of the Rao-Blackwell theorem) used Fisher’s information number to derive a lower bound on the variance of any unbiased estimator of \(\theta\) (or a function thereof), known as the Cramér-Rao lower bound (CRLB).

C.R. Rao

Published independently in 1945

Image source: Penn State University

Harald Cramér

Published similar results in 1946

Image source: Wikipedia

Statement of CRLB

Let \(\hat\theta\) be an unbiased estimator of \(\theta\) based on an i.i.d. sample of size \(n\).
Then:

\[Var(\hat\theta) \ge \frac{1}{I(\theta)},\]

where \(I(\theta)\) is Fisher’s information number.

Implication: more information \(\Rightarrow\) smaller lower-bound on the variance.
If an estimator \(\hat\theta\) is unbiased and achieves the CRLB, then \(\hat\theta\) is said to be efficient.
Efficient estimators are by definition MVUEs (but not all MVUEs are efficient!)

CRLB for estimates of functions

Let \(\hat\tau(\theta)\) be an unbiased estimator of \(\tau(\theta)\) based on an i.i.d. sample of size \(n\).
Then:

\[Var(\hat\tau(\theta)) \ge \frac{\tau'(\theta)^2}{I(\theta)}\]

Exponential scale revisited

Let \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} EXP(\beta)\)
Goal:
1. Find the CRLB on any unbiased estimator of \(\beta\)
2. Find the variance of \(\bar Y\) and compare it to the CRLB
3. Discuss implications

We already found that \(I(\beta) = \frac{n}{\beta^2} \Rightarrow CRLB = \frac{1}{I(\beta)} = \frac{\beta^2}{n}\)

\(Var(\bar Y) = \frac{\sigma^2}{n} = \frac{\beta^2}{n}\)

Note that \(\bar Y\) achieves the CRLB \(\Rightarrow \bar Y\) is an efficient estimator, and is therefore the MVUE.

Exponential variance

The variance of the exponential distribution is \(\beta^2\)
What is the CRLB on the variance of any unbiased estimator of \(\beta^2\)?

\[\tau(\beta) = \beta^2\] \[\tau'(\beta) = 2\beta\] \[CRLB = \frac{\tau'(\beta)^2}{I(\beta)} = \frac{4\beta^2}{n/\beta^2} = \frac{4\beta^4}{n}\]

Note that:

\[E(\bar Y^2) = E(\bar Y)^2 + Var(\bar Y) = \beta^2 + \frac{\beta^2}{n},\]

\[\Rightarrow \hat \tau(\beta) = \frac{n}{n+1}\bar Y^2\]

is an unbiased estimator of \(\tau(\beta) = \beta^2.\)

Does \(\hat\tau(\beta)\) achieve the CRLB for unbiased estimators of \(\tau(\beta)=\beta^2\)?

Studying variance of \(\hat\tau(\beta)\)

Using facts that:
1. \(\bar Y \sim GAM(n, \beta/n)\)
2. \(E(\bar Y^k) = \frac{(\beta/n)^k \Gamma(n+k)}{\Gamma(n)}\)

\[Var(\hat\tau(\beta)) = Var\left(\frac{n}{n+1}\bar Y^2\right) = \frac{n^2}{(n+1)^2}(E(\bar Y^4) - E(\bar Y^2)^2)\]

\[= \frac{n^2}{(n+1)^2}\left[\frac{(\beta/n)^4\Gamma(n+4)}{\Gamma(n)}-\left(\frac{(\beta/n)^2\Gamma(n+2)}{\Gamma(n)}\right)^2\right]\] \[= \frac{(\beta/n)^4n^2}{(n+1)^2}\left[\frac{\Gamma(n+4)}{\Gamma(n)}-\left(\frac{\Gamma(n+2)}{\Gamma(n)}\right)^2\right]\]

Studying variance of \(\hat\tau(\beta)\)

library(tidyverse)

b <- 3

ggplot()+ 
  geom_function(fun=\(n) 4*b^4/n, aes(color='CRLB'))+
  geom_function(fun=\(n) ((b/n)^4*n^2/(n+1)^2) * (gamma(n+4)/gamma(n)-(gamma(n+2)/gamma(n))^2),
                aes(color='Variance of unbiased estimator'))+
  labs(color='',x='n',y='Variance')+
  theme_classic(base_size=18)+
  xlim(c(1,20))

\(\therefore \hat \tau(\beta)\) is not efficient, however we can still use Rao-Blackwell to prove that it is the MVUE of \(\beta^2\), since it is an unbiased function of the sufficient statistic \(\sum_i Y_i\).

Asymptotic normality and efficiency of MLEs

A powerful point in favor of Fisher’s MLEs is their asymptotic normality and efficiency.
Let \(\hat\tau(\theta) = \tau(\hat\theta_{MLE})\) be the MLE of \(\tau(\theta)\).
Then, under regularity conditions:

\[\frac{\tau(\hat\theta_{MLE})-\tau(\theta)}{\sqrt{CRLB}} \rightarrow_d N(0,1)\]

or, in other words:

\[\tau(\hat\theta_{MLE}) \sim AN\left(\tau(\theta),\frac{\tau'(\theta)^2}{I(\theta)}\right)\]

Implications

Under regularity conditions:

MLEs are asymptotically unbiased;
MLEs are asymptotically efficient;
MLEs are asymptotically normal, thus:

\[\tau(\hat\theta_{MLE}) \pm 1.96 \sqrt{\widehat{CRLB}}\] is an asymptotically-justified 95% confidence interval for \(\tau(\theta)\)

A very powerful combination of properties!

Studying asymptotic dist of \(\bar Y^2\)

Let \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} EXP(\beta)\)
Consider estimating \(\tau(\beta) = \beta^2\)
We know that \(\bar Y\) is the MLE of \(\beta\), thus by invariance \(\bar Y^2\) is the MLE of \(\beta^2\).
Simulating sampling distribution of \(\bar Y^2\):

library(purrrfect)

(many_mles <- parameters(~n, ~beta,
                         seq(10,500,by=50), c(0.2,0.5,1))
  
        %>% add_trials(10000)
        %>% mutate(Ysample = pmap(list(n,beta), .f= \(n,b) rexp(n, rate=1/b)))
        %>% mutate(ybar = map_dbl(Ysample, mean),
                   mle = ybar^2
                  )
) %>% head

# A tibble: 6 × 6
      n  beta .trial Ysample      ybar     mle
  <dbl> <dbl>  <dbl> <list>      <dbl>   <dbl>
1    10   0.2      1 <dbl [10]> 0.0987 0.00974
2    10   0.2      2 <dbl [10]> 0.225  0.0505 
3    10   0.2      3 <dbl [10]> 0.175  0.0307 
4    10   0.2      4 <dbl [10]> 0.257  0.0661 
5    10   0.2      5 <dbl [10]> 0.132  0.0175 
6    10   0.2      6 <dbl [10]> 0.199  0.0394

Sampling distributions of MLEs

ggplot(data = many_mles) + 
  geom_histogram(aes(x=mle), bins=100) + 
  geom_vline(aes(xintercept = beta^2))+ 
  theme_classic()+ 
  xlab('MLE') +
  facet_grid(n~beta, scales = 'free', labeller = label_both)

Bias of MLEs

mle_bias_variance <- many_mles %>% 
  summarize(var_mle = var(mle), 
            bias_mle = mean(mle-beta^2),
            .by = c(n,beta)
            )
ggplot(data = mle_bias_variance) + 
  geom_line(aes(x=n, y =bias_mle)) + 
  theme_classic()+ 
  ylab('Bias')+ 
  geom_hline(aes(yintercept=0), linetype=2)+
  facet_wrap(~beta, labeller = label_both)

Variance of MLEs

ggplot(data = mle_bias_variance) + 
  geom_line(aes(x=n, y =var_mle, col='Variance of MLE')) + 
  geom_line(aes(x=n, y= 4*beta^4/n, col='CRLB'))+
  theme_classic()+ 
  labs(color='', y='Variance')+ 
  facet_wrap(~beta, labeller = label_both, 
             scales='free_y')

Deriving asymptotic CI

Recall:

\[\tau(\hat\theta_{MLE}) \pm 1.96 \sqrt{\widehat{CRLB}}\]

is an asymptotically-justified 95% confidence interval for \(\tau(\theta)\).

We know CRLB for unbiased estimators of \(\tau(\beta) = \beta^2\) is \(CRLB = \frac{4\beta^4}{n}\)
Since \(\beta\) is unknown, we can plug in the MLE, thus:

\[\widehat{CRLB} = \frac{4\bar Y^4}{n}\]

Thus the asymptotic 95% CI is:

\[\bar Y^2 \pm 1.96 \sqrt{\frac{4\bar Y^4}{n}}\]

Simulating 95% CI coverage

We can simply add the 95% CIs and coverage status to our original simulation study:

 head(many_mles)

# A tibble: 6 × 6
      n  beta .trial Ysample      ybar     mle
  <dbl> <dbl>  <dbl> <list>      <dbl>   <dbl>
1    10   0.2      1 <dbl [10]> 0.0987 0.00974
2    10   0.2      2 <dbl [10]> 0.225  0.0505 
3    10   0.2      3 <dbl [10]> 0.175  0.0307 
4    10   0.2      4 <dbl [10]> 0.257  0.0661 
5    10   0.2      5 <dbl [10]> 0.132  0.0175 
6    10   0.2      6 <dbl [10]> 0.199  0.0394

(many_mles_with_ci <- many_mles 
  %>% mutate(lcl  = mle - 1.96*sqrt(4*ybar^4/n),
             ucl  = mle + 1.96*sqrt(4*ybar^4/n)
             )
  %>% mutate(covers = ifelse(lcl < beta^2 & ucl > beta^2, 1, 0))
) %>% head

# A tibble: 6 × 9
      n  beta .trial Ysample      ybar     mle      lcl    ucl covers
  <dbl> <dbl>  <dbl> <list>      <dbl>   <dbl>    <dbl>  <dbl>  <dbl>
1    10   0.2      1 <dbl [10]> 0.0987 0.00974 -0.00233 0.0218      0
2    10   0.2      2 <dbl [10]> 0.225  0.0505  -0.0121  0.113       1
3    10   0.2      3 <dbl [10]> 0.175  0.0307  -0.00736 0.0687      1
4    10   0.2      4 <dbl [10]> 0.257  0.0661  -0.0158  0.148       1
5    10   0.2      5 <dbl [10]> 0.132  0.0175  -0.00420 0.0393      0
6    10   0.2      6 <dbl [10]> 0.199  0.0394  -0.00945 0.0883      1

(coverage <- many_mles_with_ci 
  %>% summarize(coverage = mean(covers),
            .by = c(n,beta)
            )
) %>% head

# A tibble: 6 × 3
      n  beta coverage
  <dbl> <dbl>    <dbl>
1    10   0.2    0.858
2    10   0.5    0.865
3    10   1      0.857
4    60   0.2    0.928
5    60   0.5    0.928
6    60   1      0.928

Plotting coverage

ggplot(data = coverage) + 
  geom_line(aes(x=n, y =coverage)) + 
  theme_classic(base_size =18)+ 
  labs(y='Coverage')+ 
  geom_hline(aes(yintercept = 0.95), linetype=2)+
  facet_wrap(~beta, labeller = label_both)