Model Selection: Bias-Variance Tradeoff

Rasim Muzaffer Musal

Today’s Lecture is going to draw from

Chapter 2.2 of James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.
Chapter 2.9 of Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (Vol. 2, pp. 1-758). New York: springer.
Chapter 4.7 of Murphy, K. P. (2022). Probabilistic machine learning: an introduction. MIT press.

Frequentist vs Bayesian Framework

Frequentist framework has varying data and fixed parameters that we try to estimate.
Bayesian framework has data that is fixed (since it is observed) and parameters have varying.
In frequentist statistics we estimate parameters \(p(D|\hat{\theta})\).
However there is \(p(D|\theta^{*})\) which is the true function.

Bias-Variance Trade off

An important idea in data science is the bias-variance tradeoff. This can be explained simply as under and overfitting issues but we need to dig a little deeper.
We can complicate a model to reduce the difference between predictions and the actual values.
This will reduce the difference between these two sets of values but our goal is never to predict what we already observed.
Measures such as AIC, BIC are helpful but we need some discussion which will help us understand a fundamental issue better.

Terminology: For discussion

f is the learning function
\(\theta\) is parameter, \(\hat{\theta}()\) is estimator,
\(\hat{\theta}(D)=\hat{\theta}\) is the estimand. \(\theta^{*}\) is the real parameter value.
\(p^{*}\) is the real probability distribution.
\(p^{*}(D)\) is distribution of data from an unknown distribution \(p^{*}\).
\(p^{*}(\hat{\theta}(D))\) is the distribution of over the estimand (for instance distribution of standard error.)

Terminology: Bias

\[\begin{align} bias(\hat{\theta}(.)) \triangleq E_{p(D|\theta^{*})} \bigg[\hat{\theta}(D) \bigg]-\theta^{*} \end{align}\]

How different is our estimate of \(\theta\), \(\hat{\theta}\) from \(\theta\) itself \(\theta^{*}\).

Terminology: Bias Example

MLE for Gaussian mean bias() is unbiased.

\[\begin{align} 0 = E[\bar{x}]-\mu = E \bigg[ \frac{1}{N} \sum_{n=1}^{N} x_{n} \bigg] - \mu = \frac{N \mu}{N} - \mu \end{align}\]

Terminology: Variance of X

\[\begin{align} E \bigg[(X - E[X])^{2} \bigg] = E[X^{2}] - [E(X)]^{2} \end{align}\]

The above is the generic form of X’s variance. We will be thinking about the variance of \(\theta\) where the expectations will be taken in respect to \(p(D|\theta^{*})\).

Terminology: Variance

To quantify how much \(\hat{\theta}\) changes with data we are going to apply the variance calculation in the form \(E(X^{2}) - (E(X))^{2}\).

\[\begin{align} Var \bigg[ \hat{\theta} \bigg] \triangleq E_{p(D|\theta^{*}) } \bigg[ \hat{\theta^{2}} \bigg] - \bigg( E_{p(D|\theta^{*})} \bigg[\hat{\theta} \bigg] \bigg)^{2} \end{align}\]

Example of a high variance but unbiased estimator is the first observation in a set of data. Sample mean is low variance but also unbiased.

PML book section 4.7.6.4

Discusses Maximum a-posteriori (MAP) estimator vs MLE
MLE is unbiased but has higher variance compared to MAP estimators.
To understand this discussion you need to know a bit of Bayesian statistics which goes beyond the scope of the current course.

Bias-Variance Trade off 1

\[\begin{align} &MSE = \\ E \bigg[ (\hat{\theta} - \theta^{*} )^{2} \bigg] & = E\bigg[\big[(\hat{\theta}-\bar{\theta}) +(\bar{\theta}-\theta^{*})\big]^{2} \bigg] \end{align}\]

\(\bar{\theta}=E(\hat{\theta})\) where \(E(\hat{\theta})\) is the point estimate of the r.v. \(\theta\)’s distribution as we vary the data.

\[\begin{align} =E \bigg[ \big(\hat{\theta} - \bar{\theta} \big)^{2} \bigg] + 2(\bar{\theta}-\theta^{*})E\big[ \hat{\theta} - \bar{\theta} \big])+(\bar{\theta}-\theta^{*})^{2} \end{align}\]

Bias-Variance Trade off 2

Note \(2(\bar{\theta}-\theta^{*})E\big[ \hat{\theta} - \bar{\theta} \big]\) is equal to 0 since the expected value of \(\theta\) is \(E[\hat{\theta]} = \bar{\theta}\)

\[\begin{align} & = E \bigg[ (\hat{\theta}-\bar{\theta})^{2} \bigg] + (\bar{\theta}-\theta^{*})^{2} \\ & = variance + bias^{2} \end{align}\]

Expected Prediction Error (EPE) 1

Also called test error or generalization error.
The variable Y is generated from \[\begin{align} Y=f(X)+\epsilon \\ E(\epsilon)=0\\ Var(\epsilon)=\sigma^{2} \end{align}\]
The \(\epsilon\) is the irreducible error even if you know the true function \(f\). \(\hat{f}_{k}(x_{0})\) is the estimated regression fit for point \(x_{0}\). There is a philosophical discussion to be had here if we have time.

Expected Prediction Error (EPE) 2

Expected Prediction Error at \(x_{0}\)(EPE)

\[\begin{align} EPE_{k}(x_{0}) & =E[(Y-\hat{f_{k}}(x_{0}))^{2}|X=x_{0}]\\ & \sigma^{2}+ \end{align}\]