Density Estimation

Jake

19/08/2021

Estimation of Unknown Density

  • Parametric density estimations are one approach to estimation of an unknown density, however they often miss significant structures within data
    • For example normal is unimodal however the data may be multimodal
  • Histograms can be used as a basic density estimation technique
    • Binwidth is used as a smoothing parameter
      • Small bin width (less smoothing) results in small counts in the bins, therefore lots of variation and a noisy estimate.
      • Large bin width (more smoothing) results in larger counts in the bins, which can result in important structures being smoothed out if they are at a scale smaller than the binwidth

Histogram

  • The formal definition of a histogram is as follows:
    • Note that \(f(x) = \frac{d}{dx}F(x)\) where \(F(x) = P(X\leq x)\)

    • The real line is divided into intervals \((b_j,b_{j+1}]\), each of length \(h\), aka \(b_{j+1}-b_j=h\)

    • For \(x\in (b_j,b_{j+1}]\) we can approximate \(f(X)\) by \(f(b_j)\), and approximate \(f(b_j)\) by:

      \[ f(b_j) = \frac{F(b_{j+1})-F(b_j)}{h},\quad\text{where }\hat{F}(x)=\frac{\#\lbrace x_j\leq x\rbrace}{n}\]

  • Therefore the final estimate is:

\[ \hat{f}(x)=\frac{\hat{F}(b_{j+1})-\hat{F}(b_j)}{h},\quad \hat{f}(x) = \frac{n_j}{nh}\] * \(h\) manages the trade-off between bias and variance + A small \(h\) results in a highly variable \(\hat{f}(x)\) + A large \(h\) results in a less variable \(\hat{f}(x)\) however it will have more bias

  • To converge on the true density, the bin width would need to decrease (\(h\rightarrow 0\)) and the observations per bin to increase (\(n\rightarrow\infty\)). The number of observations in each bin increases if \(nh\rightarrow\infty\)
    • The probability of a single observation lying in a bin is \(\sim f(x)h\), therefore the expected number in a bin is \(\sim f(x)nh\), which increases as \(nh\rightarrow\infty\)
  • The most commonly used measure for the closeness of our approximation \(\hat{f}(x)\) to the true density \(f(x)\) is mean intergrated squared error (MISE):

\[ ISE = \int^\infty_{-\infty} (\hat{f}(x)-f(x))^2 dx \]

\[ MISE = \mathbb{E}(ISE) = \int^\infty_{-\infty} MSE(\hat{f}(x)) dx\]

  • For a histogram estimator, the approximate result of MISE is:

\[ MISE\approx\underbrace{\frac{1}{nh}}_{\text{Variance}}+\underbrace{\frac{h^2R(f')}{12}}_{\text{Bias}},\quad\text{ where } R(f') = \int^\infty_{-\infty}f'(x)^2 dx\]

  • The minimiser of approximation to MISE is:

\[ h = \left(\frac{6}{R(f')}\right)^{1/3}n^{-1/3}\]

  • This value however depends on \(R(f')\), which depends on an unknown density. We often assume guassian for this calculation, resulting in:

\[ h^* = 3.491\hat{\sigma}n^{-1/3} \]

  • We can plot histograms in R
    • hist(x, nclass, plot=T, probability = F)
    • x is a vector of data
    • plot=T plots the histogram
      • If set to false, a list is returned with the components \(\$\)count (vector of counts or estimated density values within bins) and \(\$\)break (breakpoints between histogram classes)
    • probability = T estimates density rather than raw counts

Nonparametric Density Estimation

  • Histogram density estimates have an unattractive appearance as they are not continuous
    • Smooth/continuous estimates are visually more appealing and tend to perform better mathematically in the MISE sense
  • Consider the following way to write the empirical distribution function:
    • This cannot be differentiated to get an estimate of \(f(x_j)\) as it is not continuous at jump points
    • We replace the indicator function with a smooth approximation, allowing for differentiation at jump points

\[ \hat{F}(x)=\frac{1}{n}\sum_j\underbrace{I(x_j\leq x)}_{\text{Replace}}\]

\[ I(x_j\leq x)=\begin{cases}1,\quad \text{if }x_j\leq x\\ 0,\quad\text{otherwise} \end{cases}\]

  • Letting \(G(u)\) be the distribution function of a RV \(U\) with density function \(K(u)\). \(U\) has mean 0 and variance of 1.
  • Now consider the RV \(Z=x_j+hU\) with mean \(x_j\) and variance \(h^2\). The distribution of this variable is therefore:
    • As h increases, the distribution function \(G\) converges to a RV which is constant and equal to \(x_j\), the indicator function \(I(x_j\leq x)\).
    • As \(h\rightarrow 0\), \(G\) approches the step function.
    • As \(h\rightarrow\infty\), \(G\) approaches flat.

\[ P(x_j + hU\leq x) = G\left(\frac{x-x_j}{h}\right)\]

  • Therefore, we use this as our smooth approximation in the CDF formula:

\[ \hat{F}(x) = \frac{1}{n}\sum_jG\left(\frac{x-x_j}{h}\right)\]

  • Deriving this gives the density estimation, noting \(G'(u) = K(u)\):
    • \(h\) is called bandwidth in this case and \(K(u)\) is the kernel density function

\[ \hat{f}(x) = \frac{1}{nh}\sum_jK\left(\frac{x-x_j}{h}\right)\]

  • We note assumptions for this kernel function:
    • \(K(u)\) is differentiable
    • \(\int K(u)du = 1\)
    • \(\int uK(u)du = 0\)
    • $u^2K(u) du =^2_K<$
  • There are many common choices for the kernel density:
  • Guassian Kernel

\[ K(u) = \frac{1}{\sqrt{2\pi}}\exp\left(-\frac{u^2}{2}\right)\]

  • Epanechnikov Kernel

\[ K(u) = \frac{3}{4}(1-u^2), |u|\leq 1\]

  • Biweight Kernel

\[ K(u) = \frac{15}{16}(1-u^2)^2, |u|\leq 1\]

  • Triweight Kernel

\[ K(u) = \frac{35}{32}(1-u^2)^3, |u|\leq 1 \]

  • As \(h\rightarrow 0\) and \(nh\rightarrow\infty\), MISE for the kernel estimator is approximately:

\[ MISE\approx \int^\infty_{-\infty}\underbrace{\frac{h^4f''(x)^2\sigma^4_K}{4}}_{Bias(\hat{f}(x))^2}+\underbrace{\frac{f(x)R(K)}{nh}}_{Var(\hat{f}(x))}dx\]

\[ MISE\approx\underbrace{\frac{R(K)}{nh}}_{\text{Variance}}+\underbrace{\frac{h^4\sigma^4_KR(f'')}{4}}_{\text{Bias}}\]

\[ R(K) = \int^\infty_{-\infty}K(u)^2 du,\quad \sigma^2_K = \int^\infty_{-\infty}u^2K(u) du,\quad R(f'')=\int^\infty_{-\infty}f''(x)^2 dx\]

  • With optimal choice of bandwith

\[ h^* = \left(\frac{R(K)}{\sigma^4_KR(f'')}\right)^{1/5}n^{-1/5}\]

  • We can plot non-parametric density estimates in R:
    • density(x, n=50,window=ā€˜g’, width)
    • x is a vector of data
    • n is the number of equally spaced points to compute the density
    • window is the kernel function to use
    • width specifies the bandwith, which can be numerical or a string to call an automatic bandwidth selector

Variability Plots

  • To give a graphical representation of the uncertainity of a kernel estimate we plot variability plots

  • These plots are created by creating \(m\) bootstraped samples of size \(n\), completing density estimates for each resample at a common grid of values and taking the upper/lower 2.5 percentage points at each computated point.

  • An example in R

Bivariate Kernel Density Estimation

  • We may wish to estimate bivariate or multivariate distributions, such as \(f(x,y)\)
    • \(F(x,y) = P(X\leq x, Y\leq y)\) for the bivariate distribution function
    • \(f(x,y)=\frac{\partial^2}{\partial x\partial y}F(x,y)\)
  • Estimator of \(F(x,y)\):
    • Note we assume independence, therefore \(I(x_j\leq x,y_j\leq y) = I(x_j\leq x)I(y_j\leq y)\)

\[ \hat{F}(x,y) = \frac{1}{n}\sum_jI(x_j\leq x,y_j\leq y)\]

  • We approximate the EDF similarly as before, noting independence:

\[ \hat{F}(x,y) = \frac{1}{n}\sum_jG\left(\frac{x-x_j}{h_1}\right)G\left(\frac{y-y_j}{h_2}\right)\]

\[ \hat{f}(x,y) = \frac{1}{nh_1h_2}\sum_jK\left(\frac{x-x_j}{h_1}\right)K\left(\frac{y-y_j}{h_2}\right)\]