Model Selection: Bias-Variance Tradeoff

Rasim Muzaffer Musal

Today’s Lecture is going to draw from

  • Chapter 2.2 of James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.

  • Chapter 2.9 of Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (Vol. 2, pp. 1-758). New York: springer.

  • Chapter 4.7 of Murphy, K. P. (2022). Probabilistic machine learning: an introduction. MIT press.

Frequentist vs Bayesian Framework

  • Frequentist framework has varying data and fixed parameters that we try to estimate.

  • Bayesian framework has data that is fixed (since it is observed) and parameters have varying.

  • In frequentist statistics we estimate parameters \(p(D|\hat{\theta})\).

  • However there is \(p(D|\theta^{*})\) which is the true function.

Bias-Variance Trade off

  • An important idea in data science is the bias-variance tradeoff. This can be explained simply as under and overfitting issues but we need to dig a little deeper.

  • We can complicate a model to reduce the difference between predictions and the actual values.

  • This will reduce the difference between these two sets of values but our goal is never to predict what we already observed.

  • Measures such as AIC, BIC are helpful but we need some discussion which will help us understand a fundamental issue better.

Terminology: For discussion

  • f is the learning function
  • \(\theta\) is parameter, \(\hat{\theta}()\) is estimator,
  • \(\hat{\theta}(D)=\hat{\theta}\) is the estimand. \(\theta^{*}\) is the real parameter value.
  • \(p^{*}\) is the real probability distribution.
  • \(p^{*}(D)\) is distribution of data from an unknown distribution \(p^{*}\).
  • \(p^{*}(\hat{\theta}(D))\) is the distribution of over the estimand (for instance distribution of standard error.)

Terminology: Bias

\[\begin{align} bias(\hat{\theta}(.)) \triangleq E_{p(D|\theta^{*})} \bigg[\hat{\theta}(D) \bigg]-\theta^{*} \end{align}\]

  • How different is our estimate of \(\theta\), \(\hat{\theta}\) from \(\theta\) itself \(\theta^{*}\).

Terminology: Bias Example

  • MLE for Gaussian mean bias() is unbiased.

\[\begin{align} 0 = E[\bar{x}]-\mu = E \bigg[ \frac{1}{N} \sum_{n=1}^{N} x_{n} \bigg] - \mu = \frac{N \mu}{N} - \mu \end{align}\]

Terminology: Variance of X

\[\begin{align} E \bigg[(X - E[X])^{2} \bigg] = E[X^{2}] - [E(X)]^{2} \end{align}\]

  • The above is the generic form of X’s variance. We will be thinking about the variance of \(\theta\) where the expectations will be taken in respect to \(p(D|\theta^{*})\).

Terminology: Variance

  • To quantify how much \(\hat{\theta}\) changes with data we are going to apply the variance calculation in the form \(E(X^{2}) - (E(X))^{2}\).

\[\begin{align} Var \bigg[ \hat{\theta} \bigg] \triangleq E_{p(D|\theta^{*}) } \bigg[ \hat{\theta^{2}} \bigg] - \bigg( E_{p(D|\theta^{*})} \bigg[\hat{\theta} \bigg] \bigg)^{2} \end{align}\]

  • Example of a high variance but unbiased estimator is the first observation in a set of data. Sample mean is low variance but also unbiased.

PML book section 4.7.6.4

  • Discusses Maximum a-posteriori (MAP) estimator vs MLE

  • MLE is unbiased but has higher variance compared to MAP estimators.

  • To understand this discussion you need to know a bit of Bayesian statistics which goes beyond the scope of the current course.

Bias-Variance Trade off 1

\[\begin{align} &MSE = \\ E \bigg[ (\hat{\theta} - \theta^{*} )^{2} \bigg] & = E\bigg[\big[(\hat{\theta}-\bar{\theta}) +(\bar{\theta}-\theta^{*})\big]^{2} \bigg] \end{align}\]

  • \(\bar{\theta}=E(\hat{\theta})\) where \(E(\hat{\theta})\) is the point estimate of the r.v. \(\theta\)’s distribution as we vary the data.

\[\begin{align} =E \bigg[ \big(\hat{\theta} - \bar{\theta} \big)^{2} \bigg] + 2(\bar{\theta}-\theta^{*})E\big[ \hat{\theta} - \bar{\theta} \big])+(\bar{\theta}-\theta^{*})^{2} \end{align}\]

Bias-Variance Trade off 2

  • Note \(2(\bar{\theta}-\theta^{*})E\big[ \hat{\theta} - \bar{\theta} \big]\) is equal to 0 since the expected value of \(\theta\) is \(E[\hat{\theta]} = \bar{\theta}\)

\[\begin{align} & = E \bigg[ (\hat{\theta}-\bar{\theta})^{2} \bigg] + (\bar{\theta}-\theta^{*})^{2} \\ & = variance + bias^{2} \end{align}\]

Expected Prediction Error (EPE) 1

  • Also called test error or generalization error.
  • The variable Y is generated from \[\begin{align} Y=f(X)+\epsilon \\ E(\epsilon)=0\\ Var(\epsilon)=\sigma^{2} \end{align}\]
  • The \(\epsilon\) is the irreducible error even if you know the true function \(f\). \(\hat{f}_{k}(x_{0})\) is the estimated regression fit for point \(x_{0}\). There is a philosophical discussion to be had here if we have time.

Expected Prediction Error (EPE) 2

  • Expected Prediction Error at \(x_{0}\)(EPE)

\[\begin{align} EPE_{k}(x_{0}) & =E[(Y-\hat{f_{k}}(x_{0}))^{2}|X=x_{0}]\\ & \sigma^{2}+ \end{align}\]