9.4 - Fisher’s information number and the Cramér-Rao lower bound
Introduction
One of the reasons Fisher was so interested in likelihoods was the idea that they could provide a way to quantify the amount of information about a parameter \(\theta\) available in the sample.
The amount of information available can be measured by the “sharpness” of the log-likelihood function
Visual motivation
Consider the two concave log-likelihoods below:
The steepness of the log-likelihood at any \(\theta\) is determined by the score function:
\[\dot \ell(\theta) = \frac{d}{d\theta} \ln L(\theta)\]
Note that for concave log-likelihoods, we find MLEs by setting these score functions to 0 and solving:
\[\hat\theta_{MLE} = \{\theta: \dot {\ell} (\theta)=0\}\]
Visual motivation
Consider the two concave log-likelihoods below:
We measure “sharpness” of the log-likelihood by finding \(Var(\dot \ell(\theta))\)
Under the assumption of a concave log-likelihood:
\[E(\dot \ell(\theta)) = 0 \Rightarrow Var(\dot \ell(\theta)) = E(\dot \ell(\theta)^2)\]
This quantity is known as Fisher’s information number :
\[I(\theta) = E(\dot \ell(\theta)^2)\]
Intuitively: “variability in the steepness of the log-likelihood”
Using the 2nd derivative
As you know from calc, we can also measure “concavity” using the second derivative
Another (often easier) way to compute \(I(\theta)\) is:
\[I(\theta) = E\left(-\ddot \ell (\theta)\right),\] where:
\[\ddot\ell(\theta) = \frac{d^2}{d\theta^2} \ln L(\theta)\]
Finding 2nd derivatives is usually easier than finding 2nd moments!
Exponential scale example
Let \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} EXP(\beta)\) , with pdf:
\[f_Y(y) = \frac{1}{\beta}e^{-y/\beta}, y >0 \]
Find \(I(\beta)\) , the amount of information this sample carries about \(\beta\) .
\[\ln L(\beta) = -n\ln(\beta) - \frac{\sum_{i=1}^n Y_i}{\beta}\]
\[\dot \ell(\beta) = -\frac{n}{\beta} + \frac{\sum_{i=1}^n Y_i}{\beta^2}\]
\[\ddot \ell(\beta) = \frac{n}{\beta^2}-\frac{2\sum_{i=1}^n Y_i}{\beta^3}\]
\[\Rightarrow I(\beta) = E(-\ddot \ell(\beta)) = E\left(-\frac{n}{\beta^2}+\frac{2\sum_{i=1}^n Y_i}{\beta^3}\right) = -\frac{n}{\beta^2} + \frac{2n\beta}{\beta^3} = \frac{n}{\beta^2}\]
The Cramér-Rao lower bound
Harald Cramér and C.R. Rao (of the Rao-Blackwell theorem) used Fisher’s information number to derive a lower bound on the variance of any unbiased estimator of \(\theta\) (or a function thereof), known as the Cramér-Rao lower bound (CRLB).
Harald Cramér
Published similar results in 1946
Image source: Wikipedia
Statement of CRLB
Let \(\hat\theta\) be an unbiased estimator of \(\theta\) based on an i.i.d. sample of size \(n\) .
Then:
\[Var(\hat\theta) \ge \frac{1}{I(\theta)},\]
where \(I(\theta)\) is Fisher’s information number.
Implication: more information \(\Rightarrow\) smaller lower-bound on the variance.
If an estimator \(\hat\theta\) is unbiased and achieves the CRLB, then \(\hat\theta\) is said to be efficient.
Efficient estimators are by definition MVUEs (but not all MVUEs are efficient!)
CRLB for estimates of functions
Let \(\hat\tau(\theta)\) be an unbiased estimator of \(\tau(\theta)\) based on an i.i.d. sample of size \(n\) .
Then:
\[Var(\hat\tau(\theta)) \ge \frac{\tau'(\theta)^2}{I(\theta)}\]
Exponential scale revisited
Let \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} EXP(\beta)\)
Goal:
Find the CRLB on any unbiased estimator of \(\beta\)
Find the variance of \(\bar Y\) and compare it to the CRLB
Discuss implications
We already found that \(I(\beta) = \frac{n}{\beta^2} \Rightarrow CRLB = \frac{1}{I(\beta)} = \frac{\beta^2}{n}\)
\(Var(\bar Y) = \frac{\sigma^2}{n} = \frac{\beta^2}{n}\)
Note that \(\bar Y\) achieves the CRLB \(\Rightarrow \bar Y\) is an efficient estimator , and is therefore the MVUE.
Exponential variance
The variance of the exponential distribution is \(\beta^2\)
What is the CRLB on the variance of any unbiased estimator of \(\beta^2\) ?
\[\tau(\beta) = \beta^2\] \[\tau'(\beta) = 2\beta\] \[CRLB = \frac{\tau'(\beta)^2}{I(\beta)} = \frac{4\beta^2}{n/\beta^2} = \frac{4\beta^4}{n}\]
\[E(\bar Y^2) = E(\bar Y)^2 + Var(\bar Y) = \beta^2 + \frac{\beta^2}{n},\]
\[\Rightarrow \hat \tau(\beta) = \frac{n}{n+1}\bar Y^2\]
is an unbiased estimator of \(\tau(\beta) = \beta^2.\)
Does \(\hat\tau(\beta)\) achieve the CRLB for unbiased estimators of \(\tau(\beta)=\beta^2\) ?
Studying variance of \(\hat\tau(\beta)\)
Using facts that:
\(\bar Y \sim GAM(n, \beta/n)\)
\(E(\bar Y^k) = \frac{(\beta/n)^k \Gamma(n+k)}{\Gamma(n)}\)
\[Var(\hat\tau(\beta)) = Var\left(\frac{n}{n+1}\bar Y^2\right) = \frac{n^2}{(n+1)^2}(E(\bar Y^4) - E(\bar Y^2)^2)\]
\[= \frac{n^2}{(n+1)^2}\left[\frac{(\beta/n)^4\Gamma(n+4)}{\Gamma(n)}-\left(\frac{(\beta/n)^2\Gamma(n+2)}{\Gamma(n)}\right)^2\right]\] \[= \frac{(\beta/n)^4n^2}{(n+1)^2}\left[\frac{\Gamma(n+4)}{\Gamma(n)}-\left(\frac{\Gamma(n+2)}{\Gamma(n)}\right)^2\right]\]
Studying variance of \(\hat\tau(\beta)\)
library (tidyverse)
b <- 3
ggplot ()+
geom_function (fun= \(n) 4 * b^ 4 / n, aes (color= 'CRLB' ))+
geom_function (fun= \(n) ((b/ n)^ 4 * n^ 2 / (n+ 1 )^ 2 ) * (gamma (n+ 4 )/ gamma (n)- (gamma (n+ 2 )/ gamma (n))^ 2 ),
aes (color= 'Variance of unbiased estimator' ))+
labs (color= '' ,x= 'n' ,y= 'Variance' )+
theme_classic (base_size= 18 )+
xlim (c (1 ,20 ))
\(\therefore \hat \tau(\beta)\) is not efficient, however we can still use Rao-Blackwell to prove that it is the MVUE of \(\beta^2\) , since it is an unbiased function of the sufficient statistic \(\sum_i Y_i\) .
Asymptotic normality and efficiency of MLEs
A powerful point in favor of Fisher’s MLEs is their asymptotic normality and efficiency.
Let \(\hat\tau(\theta) = \tau(\hat\theta_{MLE})\) be the MLE of \(\tau(\theta)\) .
Then, under regularity conditions:
\[\frac{\tau(\hat\theta_{MLE})-\tau(\theta)}{\sqrt{CRLB}} \rightarrow_d N(0,1)\]
or, in other words:
\[\tau(\hat\theta_{MLE}) \sim AN\left(\tau(\theta),\frac{\tau'(\theta)^2}{I(\theta)}\right)\]
Implications
Under regularity conditions:
MLEs are asymptotically unbiased;
MLEs are asymptotically efficient;
MLEs are asymptotically normal, thus:
\[\tau(\hat\theta_{MLE}) \pm 1.96 \sqrt{\widehat{CRLB}}\] is an asymptotically-justified 95% confidence interval for \(\tau(\theta)\)
A very powerful combination of properties!
Studying asymptotic dist of \(\bar Y^2\)
Let \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} EXP(\beta)\)
Consider estimating \(\tau(\beta) = \beta^2\)
We know that \(\bar Y\) is the MLE of \(\beta\) , thus by invariance \(\bar Y^2\) is the MLE of \(\beta^2\) .
Simulating sampling distribution of \(\bar Y^2\) :
library (purrrfect)
(many_mles <- parameters (~ n, ~ beta,
seq (10 ,500 ,by= 50 ), c (0.2 ,0.5 ,1 ))
%>% add_trials (10000 )
%>% mutate (Ysample = pmap (list (n,beta), .f= \(n,b) rexp (n, rate= 1 / b)))
%>% mutate (ybar = map_dbl (Ysample, mean),
mle = ybar^ 2
)
) %>% head
# A tibble: 6 × 6
n beta .trial Ysample ybar mle
<dbl> <dbl> <dbl> <list> <dbl> <dbl>
1 10 0.2 1 <dbl [10]> 0.0987 0.00974
2 10 0.2 2 <dbl [10]> 0.225 0.0505
3 10 0.2 3 <dbl [10]> 0.175 0.0307
4 10 0.2 4 <dbl [10]> 0.257 0.0661
5 10 0.2 5 <dbl [10]> 0.132 0.0175
6 10 0.2 6 <dbl [10]> 0.199 0.0394
Sampling distributions of MLEs
ggplot (data = many_mles) +
geom_histogram (aes (x= mle), bins= 100 ) +
geom_vline (aes (xintercept = beta^ 2 ))+
theme_classic ()+
xlab ('MLE' ) +
facet_grid (n~ beta, scales = 'free' , labeller = label_both)
Bias of MLEs
mle_bias_variance <- many_mles %>%
summarize (var_mle = var (mle),
bias_mle = mean (mle- beta^ 2 ),
.by = c (n,beta)
)
ggplot (data = mle_bias_variance) +
geom_line (aes (x= n, y = bias_mle)) +
theme_classic ()+
ylab ('Bias' )+
geom_hline (aes (yintercept= 0 ), linetype= 2 )+
facet_wrap (~ beta, labeller = label_both)
Variance of MLEs
ggplot (data = mle_bias_variance) +
geom_line (aes (x= n, y = var_mle, col= 'Variance of MLE' )) +
geom_line (aes (x= n, y= 4 * beta^ 4 / n, col= 'CRLB' ))+
theme_classic ()+
labs (color= '' , y= 'Variance' )+
facet_wrap (~ beta, labeller = label_both,
scales= 'free_y' )
Deriving asymptotic CI
\[\tau(\hat\theta_{MLE}) \pm 1.96 \sqrt{\widehat{CRLB}}\]
is an asymptotically-justified 95% confidence interval for \(\tau(\theta)\) .
We know CRLB for unbiased estimators of \(\tau(\beta) = \beta^2\) is \(CRLB = \frac{4\beta^4}{n}\)
Since \(\beta\) is unknown, we can plug in the MLE, thus:
\[\widehat{CRLB} = \frac{4\bar Y^4}{n}\]
Thus the asymptotic 95% CI is:
\[\bar Y^2 \pm 1.96 \sqrt{\frac{4\bar Y^4}{n}}\]
Simulating 95% CI coverage
We can simply add the 95% CIs and coverage status to our original simulation study:
# A tibble: 6 × 6
n beta .trial Ysample ybar mle
<dbl> <dbl> <dbl> <list> <dbl> <dbl>
1 10 0.2 1 <dbl [10]> 0.0987 0.00974
2 10 0.2 2 <dbl [10]> 0.225 0.0505
3 10 0.2 3 <dbl [10]> 0.175 0.0307
4 10 0.2 4 <dbl [10]> 0.257 0.0661
5 10 0.2 5 <dbl [10]> 0.132 0.0175
6 10 0.2 6 <dbl [10]> 0.199 0.0394
(many_mles_with_ci <- many_mles
%>% mutate (lcl = mle - 1.96 * sqrt (4 * ybar^ 4 / n),
ucl = mle + 1.96 * sqrt (4 * ybar^ 4 / n)
)
%>% mutate (covers = ifelse (lcl < beta^ 2 & ucl > beta^ 2 , 1 , 0 ))
) %>% head
# A tibble: 6 × 9
n beta .trial Ysample ybar mle lcl ucl covers
<dbl> <dbl> <dbl> <list> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10 0.2 1 <dbl [10]> 0.0987 0.00974 -0.00233 0.0218 0
2 10 0.2 2 <dbl [10]> 0.225 0.0505 -0.0121 0.113 1
3 10 0.2 3 <dbl [10]> 0.175 0.0307 -0.00736 0.0687 1
4 10 0.2 4 <dbl [10]> 0.257 0.0661 -0.0158 0.148 1
5 10 0.2 5 <dbl [10]> 0.132 0.0175 -0.00420 0.0393 0
6 10 0.2 6 <dbl [10]> 0.199 0.0394 -0.00945 0.0883 1
(coverage <- many_mles_with_ci
%>% summarize (coverage = mean (covers),
.by = c (n,beta)
)
) %>% head
# A tibble: 6 × 3
n beta coverage
<dbl> <dbl> <dbl>
1 10 0.2 0.858
2 10 0.5 0.865
3 10 1 0.857
4 60 0.2 0.928
5 60 0.5 0.928
6 60 1 0.928
Plotting coverage
ggplot (data = coverage) +
geom_line (aes (x= n, y = coverage)) +
theme_classic (base_size = 18 )+
labs (y= 'Coverage' )+
geom_hline (aes (yintercept = 0.95 ), linetype= 2 )+
facet_wrap (~ beta, labeller = label_both)