Bias-variance tradeoff

Author

Labor Econ - Hunan U

1 Introduction

In Class Module 1, we talked about the bias-variance tradeoff and briefly mentioned the James-Stein estimator. The main take-away is that a biased estimator may have a better performance than an unbiased estimator in terms of achieving a lower MSE. In statistics, the most notable example is the James-Stein estimator.

We use R to demonstrate the superiority of James-Stein estimator for a particular model, which is also a good exercise for R programming and statistical computing.

2 Set-up

The hiring problem. Alice is a hiring manager. She looks at \(N\) different aspects of each candidate’s characteristics (say communication skills, analytical skills, etc). A candidate’s true characteristic in aspect \(i\) is \(μ_i\). Alice cannot observe \(μ_i\), but she can have an imperfect measure \(z_i\) of each \(μ_i\). Alice wishes to have a good estimate of a candidate’s characteristics \(μ=(μ_1, \dots, μ_N)\) based on the measures \(z = (z_1, \dots, z_n)\).

We build a simple model for the hiring problem:

Data are generated from

\[ z_i | \mu_i \sim N (\mu_i, 1), \quad i = 1, \dots, N. \]

In plain words, the first observation \(z_1\) is drawn from the normal distribution with mean \(μ_1\) and variance 1, the second observation \(z_2\) is drawn from the normal distribution with mean \(μ_2\) and variance 1, and so on.
Note that for each candidate \(\mu\), we have only one measurement \(z\), which is composed of \(N\) numbers, \((z_1,\dots, z_N)\).

Question: How to obtain a good estimate of \(μ = (μ_1, \dots, μ_N)\) based on the one observation \(z = (μ_1, \dots, μ_N)\)?

The maximum-likelihood estimator (MLE) is \(\hat{\mu}^{(MLE)} = z\). We can verify that this estimator is unbiased:

\[ \mathbb{E} \hat{\mu}^{(MLE)} = \mu. \]

The James-Stein Estimator is

\[ \hat{\mu}^{(JS)} = (1- \frac{N-2}{||z||^2} ) z. \]

Clearly, the James-Stein Estimator is biased. Indeed, it simply “shrinks” the unbiased MLE by \(1- \frac{N-2}{||z||^2}\). When \(N\ge3\), we have \(1- \frac{N-2}{||z||^2} < 1\).

James-Stein Theorem (1961)

For \(N \ge 3\), the James-Stein Estimator dominates the MLE in terms of achieving a lower MSE:

\[ \mathbb{E} ||\hat{μ}^{(JS)} - μ||^2 < \mathbb{E} ||\hat{μ}^{(MLE)} - μ||^2, \]

where \(||\hat{μ} - μ||^2 = \sum_{i=1}^N (\hat{μ}_i - μ_i) ^ 2\).

Proving this theorem may be thorny if you have limited exposure to mathematical statistics before. However, we can use R to run some experiments/simulations, and data will tell us whether this theorem holds in reality.

3 Experiments

The function run_experiment() performs the following operations:

Take an input N and generate some vector \(μ\) with length N
Repeat the following steps nrep(=100) times:
- generate a measurement z based on μ
- compute the μ_MLE and μ_JSE based on z
- compute the err_MLE and err_JSE based on z and μ
Return a data frame containing all the err_MLE and err_JSE

run_experiment = function(N) {
    nrep = 100
    err_MLE = rep(0,nrep)
    err_JSE = rep(0,nrep)
    μ = rnorm(N) # generated from standard normal dist
    
    for (i in 1:nrep) {
        e = rnorm(N,0,1)
        z = μ + e
        μ_MLE = z
        μ_JSE = (1-(N-2)/sum(z^2))*z
        err_MLE[i] = sum((μ_MLE-μ)^2)/N
        err_JSE[i] = sum((μ_JSE-μ)^2)/N
    }
    err_both = as.data.frame(cbind(err_MLE,err_JSE))
    names(err_both) = c("err_MLE","err_JSE")
    return(err_both)
}

We have the data for all the MSEs on 100 samples of James-Stein estimator and Maximum Likelihood estimator. We use the box plot to compare them.

get_box = function(N) {
    err = run_experiment(N)
    boxplot(err, ylab = "Error: ||hat{μ}-μ||/N")
    title(paste("μ_i is generated from Normal(0,1), sample size N=",N,sep=""))
}

get_box(N=3)

get_box(N=5)

get_box(N=2)

The statistical experiments confirm the superiority of James-Stein estimator when \(N \ge 3\).