Chapter 2.2 of James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112, p. 18). New York: springer.
Chapter 2.9 of Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (Vol. 2, pp. 1-758). New York: springer.
Chapter 4.7 of Murphy, K. P. (2022). Probabilistic machine learning: an introduction. MIT press.
Frequentist framework has varying data and fixed parameters that we try to estimate.
Bayesian framework has data that is fixed (since it is observed) and parameters have varying.
In frequentist statistics we estimate parameters \(p(D|\hat{\theta})\).
However there is \(p(D|\theta^{*})\) which is the true function.
An important idea in data science is the bias-variance tradeoff. This can be explained simply as under and overfitting issues but we need to dig a little deeper.
We can complicate a model to reduce the difference between predictions and the actual values.
This will reduce the difference between these two sets of values but our goal is never to predict what we already observed.
Measures such as AIC, BIC are helpful but we need some discussion which will help us understand a fundamental issue better.
\[\begin{align} bias(\hat{\theta}(.)) \triangleq E_{p(D|\theta^{*})} \bigg[\hat{\theta}(D) \bigg]-\theta^{*} \end{align}\]
\[\begin{align} 0 = E[\bar{x}]-\mu = E \bigg[ \frac{1}{N} \sum_{n=1}^{N} x_{n} \bigg] - \mu = \frac{N \mu}{N} - \mu \end{align}\]
\[\begin{align} E \bigg[(X - E[X])^{2} \bigg] = E[X^{2}] - [E(X)]^{2} \end{align}\]
\[\begin{align} Var \bigg[ \hat{\theta} \bigg] \triangleq E_{p(D|\theta^{*}) } \bigg[ \hat{\theta^{2}} \bigg] - \bigg( E_{p(D|\theta^{*})} \bigg[\hat{\theta} \bigg] \bigg)^{2} \end{align}\]
Discusses Maximum a-posteriori (MAP) estimator vs MLE
MLE is unbiased but has higher variance compared to MAP estimators.
To understand this discussion you need to know a bit of Bayesian statistics which goes beyond the scope of the current course.
\[\begin{align} &MSE = \\ E \bigg[ (\hat{\theta} - \theta^{*} )^{2} \bigg] & = E\bigg[\big[(\hat{\theta}-\bar{\theta}) +(\bar{\theta}-\theta^{*})\big]^{2} \bigg] \end{align}\]
\[\begin{align} =E \bigg[ \big(\hat{\theta} - \bar{\theta} \big)^{2} \bigg] + 2(\bar{\theta}-\theta^{*})E\big[ \hat{\theta} - \bar{\theta} \big])+(\bar{\theta}-\theta^{*})^{2} \end{align}\]
\[\begin{align} & = E \bigg[ (\hat{\theta}-\bar{\theta})^{2} \bigg] + (\bar{\theta}-\theta^{*})^{2} \\ & = variance + bias^{2} \end{align}\]
\[\begin{align} EPE_{k}(x_{0}) & =E[(Y-\hat{f_{k}}(x_{0}))^{2}|X=x_{0}]\\ & \sigma^{2}+ \end{align}\]