Loss Functions
- For a decision \(d\in\mathcal{D}\), a loss function defines the penalty of decision \(d\):
\[ L(\theta,d)\]
- We want to minimise the loss function \(d^*=arg\min_dL(\theta,d)\), however we consider \(\theta\sim\pi(\theta|x)\), therefore we want:
\[ d^*=arg\min_d\mathbb{E}_\pi[L(\theta,d)] \]
Common Loss Functions
- For a given loss function, can calculate the expectation and minimise:
\[ \mathbb{E}[L(\theta,d)] = \int L(\theta,d)\pi(\theta|x)d\theta\]
Quadratic Loss
- Loss function:
\[ L(\theta,d) = (\theta-d)^2\]
- The optimal decision \(d^*\) is \(d^*=\mathbb{E}[\theta|x]\).
Absolute Error Loss
- Loss function:
\[ L(\theta,d) = |\theta-d|\]
- The optimal decision \(d^*\) is the median of the posterior.
Linear Loss
- Loss function:
\[ L(\theta,d) = \begin{cases}g(d-\theta),\text{ if }d>\theta\\h(\theta-d)\text{ if }d<\theta\end{cases}\]
- The optimal decision \(d^*\) is \(d^*=q=\frac{h}{g+h}\) posterior quantile
0-1 Loss
- Loss function:
\[ L(\theta,d) = \begin{cases}0,\text{ if }|d-\theta|\leq\epsilon\\1,\text{ if }|d-\theta|>\epsilon\end{cases}\] * The optimal decision \(d^*\) is the posterior mode.
Predictive Inference
- We ‘average’ over uncertainty of parameter estimates to get a predictive density function of a future observation
\[ f(y|x)=\int f(y|\theta)\pi(\theta|x)d\theta\]
- We can calculate this through integration or Monte Carlo simulation
Posterior Asymtotics
- The posterior distribution has certain properties as \(n\rightarrow \infty\)
- Consistency:
- Considering a true value \(\theta =\theta_0\) as \(n\rightarrow \infty\) posterior probability that \(\theta = \theta_0\) (or lies in neighborhood) approaches 1.
- Asymptotic Normality:
- As \(n\rightarrow \infty\) then:
\[ \pi(\theta|x) \rightarrow N(\theta_0,I_n(\theta_0)^{-1})\]
- Often don’t have the true value \(\theta_0\), therefore in practice we often substitute MLE \(\hat{\theta}\)
Importance Sampling
- Similar to rejection sampling, however instead of accepting with a probability, we weight each sample
- Algorithm:
- Generate \(x^{(i)}\) from \(g(x)\)
- Give \(x^{(i)}\) weight \(w^{(i)}\propto\frac{f(x^{(i)})}{g(x^{(i)})}\)
- Weighted samples under \(g(x)\) act as expectations under \(f(x)\)
Unnormalised Distribution
- If \(f(x) = \frac{\tilde f(x)}{Z}\), where normalising constant \(Z=\int \tilde f(x)dx\)
\[ \mathbb{E}_f[h(x)]=\sum^N_{i=1}W(x^{(i)})h(x^{(i)})\]
\[ W(x)=\frac{\tilde w(x)}{\sum^N_{i=1}\tilde w(x)}\]
- Therefore, normalise weights before taking expectation
Variability in Weights
- Variability in weights means some samples contribute more than others
- For efficiency, we would like all samples to contribute equally as possible, \(W^{(i)}=\frac{1}{n}\)
- Low weight variability
- For efficiency, we would like all samples to contribute equally as possible, \(W^{(i)}=\frac{1}{n}\)
- Weight variance is often measured through effective sample size (ESS)
- Consider the normalised weights \(W^{(i)}\)
\[ ESS = \left[\sum^n_{i=1}(W^{(i)})^2\right]^{-1}\]
- Note:
- \(1\leq ESS\leq n\)
- ESS \(=n\) when \(W^{(i)}=\frac{1}{n}\) for all \(i\) (optimal)
- Maximise ESS through choosing \(g(x)\) to closely match \(f(x)\)