Loss Functions and Asymptotics

Jake

13/10/2022

Loss Functions

  • For a decision \(d\in\mathcal{D}\), a loss function defines the penalty of decision \(d\):

\[ L(\theta,d)\]

  • We want to minimise the loss function \(d^*=arg\min_dL(\theta,d)\), however we consider \(\theta\sim\pi(\theta|x)\), therefore we want:

\[ d^*=arg\min_d\mathbb{E}_\pi[L(\theta,d)] \]

Common Loss Functions

  • For a given loss function, can calculate the expectation and minimise:

\[ \mathbb{E}[L(\theta,d)] = \int L(\theta,d)\pi(\theta|x)d\theta\]

Quadratic Loss

  • Loss function:

\[ L(\theta,d) = (\theta-d)^2\]

  • The optimal decision \(d^*\) is \(d^*=\mathbb{E}[\theta|x]\).

Absolute Error Loss

  • Loss function:

\[ L(\theta,d) = |\theta-d|\]

  • The optimal decision \(d^*\) is the median of the posterior.

Linear Loss

  • Loss function:

\[ L(\theta,d) = \begin{cases}g(d-\theta),\text{ if }d>\theta\\h(\theta-d)\text{ if }d<\theta\end{cases}\]

  • The optimal decision \(d^*\) is \(d^*=q=\frac{h}{g+h}\) posterior quantile

0-1 Loss

  • Loss function:

\[ L(\theta,d) = \begin{cases}0,\text{ if }|d-\theta|\leq\epsilon\\1,\text{ if }|d-\theta|>\epsilon\end{cases}\] * The optimal decision \(d^*\) is the posterior mode.

Predictive Inference

  • We ‘average’ over uncertainty of parameter estimates to get a predictive density function of a future observation

\[ f(y|x)=\int f(y|\theta)\pi(\theta|x)d\theta\]

  • We can calculate this through integration or Monte Carlo simulation

Posterior Asymtotics

  • The posterior distribution has certain properties as \(n\rightarrow \infty\)
  • Consistency:
    • Considering a true value \(\theta =\theta_0\) as \(n\rightarrow \infty\) posterior probability that \(\theta = \theta_0\) (or lies in neighborhood) approaches 1.
  • Asymptotic Normality:
    • As \(n\rightarrow \infty\) then:

\[ \pi(\theta|x) \rightarrow N(\theta_0,I_n(\theta_0)^{-1})\]

  • Often don’t have the true value \(\theta_0\), therefore in practice we often substitute MLE \(\hat{\theta}\)

Importance Sampling

  • Similar to rejection sampling, however instead of accepting with a probability, we weight each sample
  • Algorithm:
    • Generate \(x^{(i)}\) from \(g(x)\)
    • Give \(x^{(i)}\) weight \(w^{(i)}\propto\frac{f(x^{(i)})}{g(x^{(i)})}\)
  • Weighted samples under \(g(x)\) act as expectations under \(f(x)\)

Unnormalised Distribution

  • If \(f(x) = \frac{\tilde f(x)}{Z}\), where normalising constant \(Z=\int \tilde f(x)dx\)

\[ \mathbb{E}_f[h(x)]=\sum^N_{i=1}W(x^{(i)})h(x^{(i)})\]

\[ W(x)=\frac{\tilde w(x)}{\sum^N_{i=1}\tilde w(x)}\]

  • Therefore, normalise weights before taking expectation

Variability in Weights

  • Variability in weights means some samples contribute more than others
    • For efficiency, we would like all samples to contribute equally as possible, \(W^{(i)}=\frac{1}{n}\)
      • Low weight variability
  • Weight variance is often measured through effective sample size (ESS)
    • Consider the normalised weights \(W^{(i)}\)

\[ ESS = \left[\sum^n_{i=1}(W^{(i)})^2\right]^{-1}\]

  • Note:
    • \(1\leq ESS\leq n\)
    • ESS \(=n\) when \(W^{(i)}=\frac{1}{n}\) for all \(i\) (optimal)
  • Maximise ESS through choosing \(g(x)\) to closely match \(f(x)\)