8.1 - bias, precision, and MSE of point estimators

Definition of point estimator

Let \(\theta\) represent an unknown parameter governing properties of a parent population.
A point estimator \(\hat\theta\) is a statistic that represents a single estimated value of an unknown parameter given a sample drawn i.i.d. from the population.

Parameter and estimator examples

Example 1:

\(\theta = \mu = E(Y)\), the population mean.
A good point estimator might be \(\hat\theta = \bar Y = \frac{\sum_{i=1}^n Y_i }{n}\).

Example 2:

\(\theta = \mu^2\), the squared mean of a population
Possible estimator: \(\hat\theta_1 = \bar Y^2\).
Another possible estimator: \(\hat\theta_2 = \frac{\sum_{i=1}^n Y_i^2}{n}\)
Which one is better??

Example 3:

Let \(\theta = \sigma^2 = Var(Y)\), the population variance.
Point estimator #1: \(\hat\theta_1 = \hat\sigma^2_1 = \frac{\sum_{i} (Y_i-\bar Y)^2}{n}\)
Point estimator #2: \(\hat\theta_2 = \hat\sigma^2_2 = \frac{\sum_{i} (Y_i-\bar Y)^2}{n-1}\)
Most statistical software calculates \(\hat\sigma^2_2\): WHY?

Properties of point estimators

So what makes an estimator “good” or “bad”?
To get at this question, we need to define some properties.
Comparing these properties for different estimators allows statisticians to determine which are best for estimating a given \(\theta\).
All of these properties consider behaviors of an estimator \(\hat\theta\) across repeated samples, that is, with respect to its sampling distribution.

Bias

The bias of a point estimator \(\hat\theta\) of \(\theta\) is

\[B(\hat\theta) = E(\hat\theta - \theta) = E(\hat\theta) - \theta\]

If \(B(\hat\theta) = 0\), which is the same thing as saying \(E(\hat\theta) = \theta\), then \(\hat\theta\) is said to be an unbiased estimator of \(\theta\).
Unbiasedness is obviously a desirable property: on average across repeated sampling, \(\hat\theta\) will equal the unknown parameter \(\theta\).

Variance

One could have a very unbiased estimator that is highly variable.
Precision is also a desirable property of an estimator, one that is measured by its variance: the smaller the variance of \(\hat\theta\), the greater the precision.
The variance of an estimator is defined to be \(Var(\hat\theta)\).
The standard error of an estimator is simply the standard deviation of the estimator:

\[SE(\hat\theta) = \sqrt{Var(\hat\theta)}\]

Note that both \(Var(\hat\theta)\) and \(SE(\hat\theta)\) might also depend on unknown parameters.

Which marksperson wins silver?

Image source: Research Gate

The bias-variance trade-off

All of the plots below are sampling distributions of \(\hat\theta\):

Mean-square error

The mean-square error (MSE) incorporates both the precision and the bias of an estimator.
Defined:

\[MSE(\hat\theta) = E\left[(\hat\theta - \theta)^2\right]\]

Can show: \(MSE(\hat\theta) = B(\hat\theta)^2 + Var(\hat\theta)\) (practice!)
Thus, the MSE favors unbiased and precise estimators, but may be willing to sacrifice a bit of bias for more precision, or vice versa.

Example: estimating a population mean

Let \(Y_1,Y_2,...,Y_n\) be an i.i.d. sample drawn from a population with mean \(\mu\) and variance \(\sigma^2\).
Population is NOT necessarily normal!

Fact 1: \(\bar Y\) is an unbiased estimator of \(\mu\).

\[\scriptsize E(\bar Y) = E\left(\frac{1}{n}\sum_{i=1}^n Y_i \right) =\frac{1}{n}\sum_{i=1}^n E(Y_i) = \frac{1}{n} \cdot n\mu = \mu\]

Fact 2: \(SE(\bar Y) =\sigma/\sqrt{n}\).

\[\scriptsize Var(\bar Y) = Var\left(\frac{1}{n}\sum_{i=1}^n Y_i \right) =\frac{1}{n^2}\sum_{i=1}^n Var(Y_i) = \frac{1}{n^2} \cdot n\sigma^2 = \frac{\sigma^2}{n}\]

\[\scriptsize SE(\bar Y) = \sqrt{Var(\bar Y)} = \frac{\sigma}{\sqrt{n}}\]

Moral:
- the sample mean is an unbiased estimator of the population mean
- the sample mean becomes more precise as \(n\) increases
- These are distribution-agnostic facts!

Example: comparing bias of two variance estimators

Let \(Y_1,Y_2,...,Y_n\) be an i.i.d. sample drawn from a population with mean \(\mu\) and variance \(\sigma^2\).
Compare the bias of these two estimators of \(\sigma^2\):

\[\hat\sigma^2_1 = \frac{\sum_{i=1}^n (Y_i-\bar Y)^2}{n}\,\,\,\,\,\, \hat\sigma^2_2 = \frac{\sum_{i=1}^n (Y_i-\bar Y)^2}{n-1}\]

\[\scriptsize E\left(\sum_{i=1}^n (Y_i-\bar Y)^2\right) =E\left(\sum_{i=1}^n \left(Y_i^2-2Y_i\bar Y+\bar Y^2\right)\right) =E\left(\sum_{i=1}^n Y_i^2-2\bar Y \underbrace{\sum_{i=1}^nY_i}_{n\bar Y}+\sum_{i=1}^n\bar Y^2\right)\]

\[\scriptsize =E\left(\sum_{i=1}^n Y_i^2-n\bar Y^2\right) =\sum_{i=1}^n E(Y_i^2)-nE(\bar Y^2)=\sum_{i=1}^n (\mu^2 +\sigma^2)-n\left(\mu^2 + \frac{\sigma^2}{n}\right) = \sigma^2(n-1)\]

\[E(\hat\sigma^2_1) = \sigma^2\frac{n-1}{n}\] \[\Rightarrow B(\hat\sigma^2_1) = \sigma^2\frac{n-1}{n}-\sigma^2 = -\frac{\sigma^2}{n}\]

\[E(\hat\sigma^2_2) = \sigma^2\frac{n-1}{n-1} = \sigma^2\] \[\Rightarrow B(\hat\sigma^2_2) = \sigma^2-\sigma^2 = 0\]

Simulating two variance estimators

Comparing the two estimators of \(\sigma^2\) via simulation, if \(Y_i \sim EXP(\lambda) \Rightarrow \sigma^2 = 1/\lambda^2\).

library(purrrfect)
library(tidyverse)

set.seed(12225)
N <- 10000
(two_variance_estimators <- parameters(~lambda, ~n, 
                                       c(0.2, 0.5,1), c(2, 5, 10, 25, 75, 100)
                                       )
  %>% add_trials(10000) 
  %>% mutate(true_sigma2 = 1/lambda^2)
  %>% mutate(Ysample = pmap(list(lambda, n), .f = \(l, n) rexp(n, rate = l)))
  %>% mutate(SSE = map_dbl(Ysample, \(y) sum((y-mean(y))^2)))
  %>% mutate(estimator1 = SSE/n, estimator2 = SSE/(n-1))
) %>% head

# A tibble: 6 × 8
  lambda     n .trial true_sigma2 Ysample     SSE estimator1 estimator2
   <dbl> <dbl>  <dbl>       <dbl> <list>    <dbl>      <dbl>      <dbl>
1    0.2     2      1          25 <dbl [2]>  2.93       1.47       2.93
2    0.2     2      2          25 <dbl [2]>  2.02       1.01       2.02
3    0.2     2      3          25 <dbl [2]> 39.5       19.7       39.5 
4    0.2     2      4          25 <dbl [2]>  5.80       2.90       5.80
5    0.2     2      5          25 <dbl [2]>  4.48       2.24       4.48
6    0.2     2      6          25 <dbl [2]> 16.2        8.11      16.2

Aggregating to find bias

To find bias/variance/MSE we will summarize() for each value in the parameter grid:

(sample_variance_biases <- two_variance_estimators
         %>% summarize(bias_estimator1 = mean(estimator1-true_sigma2), 
                       bias_estimator2 = mean(estimator2-true_sigma2), 
                       .by = c(n, lambda, true_sigma2)
                       )
)

# A tibble: 18 × 5
       n lambda true_sigma2 bias_estimator1 bias_estimator2
   <dbl>  <dbl>       <dbl>           <dbl>           <dbl>
 1     2    0.2          25       -12.9            -0.737  
 2     5    0.2          25        -5.08           -0.0952 
 3    10    0.2          25        -2.57           -0.0800 
 4    25    0.2          25        -0.745           0.265  
 5    75    0.2          25        -0.282           0.0520 
 6   100    0.2          25        -0.251          -0.00117
 7     2    0.5           4        -1.92            0.168  
 8     5    0.5           4        -0.767           0.0410 
 9    10    0.5           4        -0.395           0.00590
10    25    0.5           4        -0.135           0.0264 
11    75    0.5           4        -0.0432          0.0103 
12   100    0.5           4        -0.0596         -0.0198 
13     2    1             1        -0.504          -0.00874
14     5    1             1        -0.209          -0.0113 
15    10    1             1        -0.0869          0.0145 
16    25    1             1        -0.0365          0.00363
17    75    1             1        -0.0107          0.00269
18   100    1             1        -0.00441         0.00565

Plotting vs \(n\)

The simulated biases above can be put into perspective by plotting them as a function of \(n\) for each \(\sigma\):

ggplot(data = sample_variance_biases,aes(x=n))+ 
  geom_point(aes(y = bias_estimator1, color='n denominator')) + 
  geom_line(aes(y = bias_estimator1, color='n denominator')) + 
  geom_point(aes(y = bias_estimator2, color='n-1 denominator')) + 
  geom_line(aes(y = bias_estimator2, color='n-1 denominator')) + 
  geom_hline(yintercept=0) + 
  facet_wrap(~true_sigma2, labeller = label_bquote(sigma^2 == .(true_sigma2)))+ 
  labs(y='Bias',
       x= 'Sample size (n)',
       color='',
       title=expression(paste('Simulated bias of two different estimators of ', sigma^2))) + 
  theme_classic()

The \(n-1\) estimator appears unbiased for all \(n\), and for all \(\sigma^2\).
The \(n\) estimator is negatively biased (too small) for small \(n\), but this bias goes away as \(n\) gets large.
The bias of the \(n\) estimator for small \(n\) is more severe for larger values of \(\sigma^2\).

General notes on simulating estimator properties

Finding bias/variance/MSE (and later CI coverage) of estimators requires aggregating (aka “summarizing”) the simulation studies, often by each parameter in the grid. This is a new step for us.
Because it’s a simulation study, theoretically unbiased estimators may often have non-zero simulated biases. This is due to the random nature of simulation studies. One way to tell if a simulated bias is “real” or “random”:
- Up the number of replications (say from \(N=10,000\) to \(N=100,000\) - what happens if we do that in the previous code?)
- Consider whether/how the simulated bias changes as \(n\) (sample size) increases - is it changing systematically (evidence of real bias) or “jumping around” from one small number to another, especially from small negative to small positive (evidence of random simulation error)?

Relative vs absolute bias

In the previous example, it’s tempting to look at the plot and think “the bias of the \(n\) estimator isn’t that bad if \(\sigma^2\) is small.” This would be misleading.
Often the bias of an estimator depends on the size of the parameter it’s trying to estimate.
This motivates an alternative way to measure bias.
Consider which is worse for an arbitrary example trying to estimate a parameter \(\theta\) using an estimator \(\hat\theta\):

\[\theta = 4, B(\hat\theta) = 2\] \[\theta = 100, B(\hat\theta)= 10\]
While the absolute bias is lower when \(\theta = 4\), the bias is half as big as the parameter!
When \(\theta = 100\), the bias is only 10% of the size. This motivates an alternative way to measure bias: the relative bias.

Definition of relative bias

The relative bias of an estimator is:

\[RelBias(\hat\theta) = \frac{B(\hat\theta)}{\theta} = \frac{E(\hat\theta - \theta)}{\theta}\]

The relative bias can be interpreted as a percent, that is “what percent of \(\theta\) is this bias?”

Relative bias in simulation study

update_biases  <- (sample_variance_biases
                   %>% mutate(relbias_estimator1 = bias_estimator1/true_sigma2,
                              relbias_estimator2 = bias_estimator2/true_sigma2
                              )
                   )
ggplot(data = update_biases,aes(x=n))+ 
  geom_point(aes(y = relbias_estimator1, color='n denominator')) + 
  geom_line(aes(y = relbias_estimator1, color='n denominator')) + 
  geom_point(aes(y = relbias_estimator2, color='n-1 denominator')) + 
  geom_line(aes(y = relbias_estimator2, color='n-1 denominator')) + 
  geom_hline(yintercept=0) + 
  facet_wrap(~true_sigma2, labeller = label_bquote(sigma^2 == .(true_sigma2)))+ 
  labs(y='Relative bias',
       x= 'Sample size (n)',
       color='',
       title=expression(paste('Simulated relative bias of two different estimators of ', sigma^2))) + 
  theme_classic()

Relative bias vs n for 2 different estimators

Now we can clearly see that the \(n-1\) estimator has the same relative bias for small \(n\)

Example: estimating Bernoulli \(p\)

Let \(Y_1,Y_2,...,Y_n\) be i.i.d. \(\sim BERN(p)\) random variables
Let \(X = \sum_{i=1}^n Y_i\) (note \(X\sim BIN(n,p)\))
Consider two estimators of \(p\):

\(\hat p_1 = \frac{X}{n}\), the typical sample proportion estimator
\(\hat p_2 = \frac{X+1}{n + 2}\): the “add one success and one failure” estimator

Compare the bias, variance, and MSE of these estimators.

\[\scriptsize E(\hat p_1) = E\left(\frac{X}{n}\right) = \frac{np}{n} = p \Rightarrow B(\hat p_1) = 0\]

\[\scriptsize Var(\hat p_1) = Var\left(\frac{X}{n}\right) = \frac{np(1-p)}{n^2} = \frac{p(1-p)}{n}\]

\[\scriptsize MSE(\hat p_1) = B(\hat p_1)^2 + Var(\hat p_1) = 0 +\frac{p(1-p)}{n}=\frac{p(1-p)}{n}\]

\[\scriptsize E(\hat p_2) = E\left(\frac{X+1}{n+2}\right) = \frac{np+1}{n+2} \Rightarrow B(\hat p_2) = \frac{np+1}{n+2}-p=\frac{1-2p}{n+2}\]

\[\scriptsize Var(\hat p_2) = Var\left(\frac{X+1}{n+2}\right) = \frac{np(1-p)}{(n+2)^2} \]

\[\scriptsize MSE(\hat p_2) = B(\hat p_2)^2 + Var(\hat p_2) = \left(\frac{1-2p}{n+2}\right)^2 +\frac{np(1-p)}{(n+2)^2}\]

\(\hat p_1\) is unbiased always; \(\hat p_2\) has some bias if \(p \ne 0.5\).
MSEs are difficult to compare.

Plotting the MSEs

Although \(\hat p_2\) is biased when \(p\ne 0.5\), except for extreme values of \(p\) near 0 or 1, it appears to have the better MSE
As \(n\) increases the discrepancies between the estimators lessen

Relative efficiency

If two estimators are both unbiased, then we can compare how “efficiently” they use the information in the sample by taking the ratio of their variances.
Definition: The efficiency of \(\hat\theta_1\) relative to \(\hat\theta_2\), defined as \(RE(\hat\theta_1,\hat\theta_2)\), is:

\[RE(\hat\theta_1,\hat\theta_2) = \frac{Var(\hat\theta_1)}{Var(\hat\theta_2)}.\]

\(RE(\hat\theta_1,\hat\theta_2) > 1\) implies we prefer \(\hat\theta_2\), whereas if \(RE(\hat\theta_1,\hat\theta_2) < 1\) we prefer \(\hat\theta_1\).

Example: estimating uniform upper bound

Suppose \(Y_1,Y_2,...,Y_n\) are i.i.d. \(UNIF(0,\theta)\).
Consider two estimators of \(\theta\):
- \(\hat\theta_1=2\bar Y\)
- \(\hat\theta_2 = \left(\frac{n+1}{n}\right)Y_{(n)}\)
Compare bias of estimators and find \(RE(\hat\theta_1,\hat\theta_2)\).

Studying \(\hat\theta_1\)

\[E(\hat\theta_1) = E(2\bar Y) = 2E(\bar Y) = 2 \mu = 2\frac{\theta}{2} = \theta \Rightarrow B(\hat\theta_1) = 0\] \[Var(\hat\theta_1) = Var(2\bar Y) = 4Var(\bar Y) = 4\frac{\sigma^2}{n} = 4\frac{\theta^2/12}{n} =\frac{\theta^2}{3n}\]

Studying \(\hat\theta_2\)

Need sampling distribution of \(Y_{(n)}\). For \(y\in (0,\theta)\):

\[F_Y(y) = \frac{y}{\theta},\,\,f_Y(y) = \frac{1}{\theta}\]

\[f_{Y_{(n)}}(y) = n \frac{1}{\theta}\cdot\left(\frac{y}{\theta}\right)^{n-1}=n\frac{y^{n-1}}{\theta^n}, 0 < y <\theta\]

\[E(Y_{(n)}) = \int_0^\theta y\cdot n\frac{y^{n-1}}{\theta^n} dy = \frac{n}{n+1}\theta \Rightarrow E(\hat\theta_1)= E\left(\frac{n+1}{n} Y_{(n)}\right) = \theta \Rightarrow B(\hat\theta_2) = 0\]

\[E(Y_{(n)}^2) = \int_0^\theta y^2\cdot n\frac{y^{n-1}}{\theta^n} dy = \frac{n}{n+2}\theta^2 \Rightarrow Var(Y_{(n)}) = \theta^2 \frac{n}{(n+2)(n+1)^2}\] \[\Rightarrow Var(\hat\theta_2) = Var\left(\frac{n+1}{n} Y_{(n)}\right)=\frac{\theta^2}{n(n+2)}\]

Finding relative efficiency

Since both \(\hat\theta_1\) and \(\hat\theta_2\) are unbiased, it makes sense to find their relative efficiency:

\[RE(\hat\theta_1,\hat\theta_2) = \frac{\theta^2/3n}{\theta^2/n(n+2)} = \frac{n+2}{3} > 1\, \forall\, n>1\]

Thus, the unbiased estimator \(\hat\theta_2\) is more precise than the unbiased estimator \(\hat\theta_1\).
Another perspective:
- When \(n=1\), both estimators use the sample information equally well (why might this be?)
- When \(n=10\), \(\hat\theta_2\) uses the sample information 4 times more efficiently than \(\hat\theta_1\) (is 4 times as precise)