Loss Functions

For a decision \(d\in\mathcal{D}\), a loss function defines the penalty of decision \(d\):

\[ L(\theta,d)\]

We want to minimise the loss function \(d^*=arg\min_dL(\theta,d)\), however we consider \(\theta\sim\pi(\theta|x)\), therefore we want:

\[ d^*=arg\min_d\mathbb{E}_\pi[L(\theta,d)] \]

Common Loss Functions

For a given loss function, can calculate the expectation and minimise:

\[ \mathbb{E}[L(\theta,d)] = \int L(\theta,d)\pi(\theta|x)d\theta\]

Quadratic Loss

Loss function:

\[ L(\theta,d) = (\theta-d)^2\]

The optimal decision \(d^*\) is \(d^*=\mathbb{E}[\theta|x]\).

Absolute Error Loss

Loss function:

\[ L(\theta,d) = |\theta-d|\]

The optimal decision \(d^*\) is the median of the posterior.

Linear Loss

Loss function:

\[ L(\theta,d) = \begin{cases}g(d-\theta),\text{ if }d>\theta\\h(\theta-d)\text{ if }d<\theta\end{cases}\]

The optimal decision \(d^*\) is \(d^*=q=\frac{h}{g+h}\) posterior quantile

0-1 Loss

Loss function:

\[ L(\theta,d) = \begin{cases}0,\text{ if }|d-\theta|\leq\epsilon\\1,\text{ if }|d-\theta|>\epsilon\end{cases}\] * The optimal decision \(d^*\) is the posterior mode.

Predictive Inference

We ‘average’ over uncertainty of parameter estimates to get a predictive density function of a future observation

\[ f(y|x)=\int f(y|\theta)\pi(\theta|x)d\theta\]

We can calculate this through integration or Monte Carlo simulation

Posterior Asymtotics

The posterior distribution has certain properties as \(n\rightarrow \infty\)
Consistency:
- Considering a true value \(\theta =\theta_0\) as \(n\rightarrow \infty\) posterior probability that \(\theta = \theta_0\) (or lies in neighborhood) approaches 1.
Asymptotic Normality:
- As \(n\rightarrow \infty\) then:

\[ \pi(\theta|x) \rightarrow N(\theta_0,I_n(\theta_0)^{-1})\]

Often don’t have the true value \(\theta_0\), therefore in practice we often substitute MLE \(\hat{\theta}\)

Importance Sampling

Similar to rejection sampling, however instead of accepting with a probability, we weight each sample
Algorithm:
- Generate \(x^{(i)}\) from \(g(x)\)
- Give \(x^{(i)}\) weight \(w^{(i)}\propto\frac{f(x^{(i)})}{g(x^{(i)})}\)
Weighted samples under \(g(x)\) act as expectations under \(f(x)\)

Unnormalised Distribution

If \(f(x) = \frac{\tilde f(x)}{Z}\), where normalising constant \(Z=\int \tilde f(x)dx\)

\[ \mathbb{E}_f[h(x)]=\sum^N_{i=1}W(x^{(i)})h(x^{(i)})\]

\[ W(x)=\frac{\tilde w(x)}{\sum^N_{i=1}\tilde w(x)}\]

Therefore, normalise weights before taking expectation

Variability in Weights

Variability in weights means some samples contribute more than others
- For efficiency, we would like all samples to contribute equally as possible, \(W^{(i)}=\frac{1}{n}\)
  - Low weight variability
Weight variance is often measured through effective sample size (ESS)
- Consider the normalised weights \(W^{(i)}\)

\[ ESS = \left[\sum^n_{i=1}(W^{(i)})^2\right]^{-1}\]

Note:
- \(1\leq ESS\leq n\)
- ESS \(=n\) when \(W^{(i)}=\frac{1}{n}\) for all \(i\) (optimal)
Maximise ESS through choosing \(g(x)\) to closely match \(f(x)\)

Loss Functions and Asymptotics

Jake

13/10/2022