run_experiment = function(N) {
nrep = 100
err_MLE = rep(0,nrep)
err_JSE = rep(0,nrep)
μ = rnorm(N) # generated from standard normal dist
for (i in 1:nrep) {
e = rnorm(N,0,1)
z = μ + e
μ_MLE = z
μ_JSE = (1-(N-2)/sum(z^2))*z
err_MLE[i] = sum((μ_MLE-μ)^2)/N
err_JSE[i] = sum((μ_JSE-μ)^2)/N
}
err_both = as.data.frame(cbind(err_MLE,err_JSE))
names(err_both) = c("err_MLE","err_JSE")
return(err_both)
}Bias-variance tradeoff
1 Introduction
In Class Module 1, we talked about the bias-variance tradeoff and briefly mentioned the James-Stein estimator. The main take-away is that a biased estimator may have a better performance than an unbiased estimator in terms of achieving a lower MSE. In statistics, the most notable example is the James-Stein estimator.
We use R to demonstrate the superiority of James-Stein estimator for a particular model, which is also a good exercise for R programming and statistical computing.
2 Set-up
The hiring problem. Alice is a hiring manager. She looks at \(N\) different aspects of each candidate’s characteristics (say communication skills, analytical skills, etc). A candidate’s true characteristic in aspect \(i\) is \(μ_i\). Alice cannot observe \(μ_i\), but she can have an imperfect measure \(z_i\) of each \(μ_i\). Alice wishes to have a good estimate of a candidate’s characteristics \(μ=(μ_1, \dots, μ_N)\) based on the measures \(z = (z_1, \dots, z_n)\).
We build a simple model for the hiring problem:
- Data are generated from
\[ z_i | \mu_i \sim N (\mu_i, 1), \quad i = 1, \dots, N. \]
In plain words, the first observation \(z_1\) is drawn from the normal distribution with mean \(μ_1\) and variance 1, the second observation \(z_2\) is drawn from the normal distribution with mean \(μ_2\) and variance 1, and so on.
Note that for each candidate \(\mu\), we have only one measurement \(z\), which is composed of \(N\) numbers, \((z_1,\dots, z_N)\).
Question: How to obtain a good estimate of \(μ = (μ_1, \dots, μ_N)\) based on the one observation \(z = (μ_1, \dots, μ_N)\)?
- The maximum-likelihood estimator (MLE) is \(\hat{\mu}^{(MLE)} = z\). We can verify that this estimator is unbiased:
\[ \mathbb{E} \hat{\mu}^{(MLE)} = \mu. \]
- The James-Stein Estimator is
\[ \hat{\mu}^{(JS)} = (1- \frac{N-2}{||z||^2} ) z. \]
Clearly, the James-Stein Estimator is biased. Indeed, it simply “shrinks” the unbiased MLE by \(1- \frac{N-2}{||z||^2}\). When \(N\ge3\), we have \(1- \frac{N-2}{||z||^2} < 1\).
For \(N \ge 3\), the James-Stein Estimator dominates the MLE in terms of achieving a lower MSE:
\[ \mathbb{E} ||\hat{μ}^{(JS)} - μ||^2 < \mathbb{E} ||\hat{μ}^{(MLE)} - μ||^2, \]
where \(||\hat{μ} - μ||^2 = \sum_{i=1}^N (\hat{μ}_i - μ_i) ^ 2\).
Proving this theorem may be thorny if you have limited exposure to mathematical statistics before. However, we can use R to run some experiments/simulations, and data will tell us whether this theorem holds in reality.
3 Experiments
The function run_experiment() performs the following operations:
Take an input
Nand generate some vector \(μ\) with lengthNRepeat the following steps
nrep(=100)times:generate a measurement
zbased onμcompute the
μ_MLEandμ_JSEbased onzcompute the
err_MLEanderr_JSEbased onzandμ
Return a data frame containing all the
err_MLEanderr_JSE
We have the data for all the MSEs on 100 samples of James-Stein estimator and Maximum Likelihood estimator. We use the box plot to compare them.
get_box = function(N) {
err = run_experiment(N)
boxplot(err, ylab = "Error: ||hat{μ}-μ||/N")
title(paste("μ_i is generated from Normal(0,1), sample size N=",N,sep=""))
}get_box(N=3)get_box(N=5)get_box(N=2)The statistical experiments confirm the superiority of James-Stein estimator when \(N \ge 3\).