Estimation of Unknown Density

Parametric density estimations are one approach to estimation of an unknown density, however they often miss significant structures within data
- For example normal is unimodal however the data may be multimodal
Histograms can be used as a basic density estimation technique
- Binwidth is used as a smoothing parameter
  - Small bin width (less smoothing) results in small counts in the bins, therefore lots of variation and a noisy estimate.
  - Large bin width (more smoothing) results in larger counts in the bins, which can result in important structures being smoothed out if they are at a scale smaller than the binwidth

Histogram

The formal definition of a histogram is as follows:
- Note that $f(x) = \frac{d}{dx}F(x)$ where $F(x) = P(X\leq x)$
- The real line is divided into intervals $(b_j,b_{j+1}]$, each of length $h$, aka $b_{j+1}-b_j=h$
- For $x\in (b_j,b_{j+1}]$ we can approximate $f(X)$ by $f(b_j)$, and approximate $f(b_j)$ by:
  
  \[ f(b_j) = \frac{F(b_{j+1})-F(b_j)}{h},\quad\text{where }\hat{F}(x)=\frac{\#\lbrace x_j\leq x\rbrace}{n}\]
Therefore the final estimate is:

\[ \hat{f}(x)=\frac{\hat{F}(b_{j+1})-\hat{F}(b_j)}{h},\quad \hat{f}(x) = \frac{n_j}{nh}\] * $h$ manages the trade-off between bias and variance + A small $h$ results in a highly variable $\hat{f}(x)$ + A large $h$ results in a less variable $\hat{f}(x)$ however it will have more bias

To converge on the true density, the bin width would need to decrease ($h\rightarrow 0$) and the observations per bin to increase ($n\rightarrow\infty$). The number of observations in each bin increases if $nh\rightarrow\infty$
- The probability of a single observation lying in a bin is $\sim f(x)h$, therefore the expected number in a bin is $\sim f(x)nh$, which increases as $nh\rightarrow\infty$
The most commonly used measure for the closeness of our approximation $\hat{f}(x)$ to the true density $f(x)$ is mean intergrated squared error (MISE):

\[ ISE = \int^\infty_{-\infty} (\hat{f}(x)-f(x))^2 dx \]

\[ MISE = \mathbb{E}(ISE) = \int^\infty_{-\infty} MSE(\hat{f}(x)) dx\]

For a histogram estimator, the approximate result of MISE is:

\[ MISE\approx\underbrace{\frac{1}{nh}}_{\text{Variance}}+\underbrace{\frac{h^2R(f')}{12}}_{\text{Bias}},\quad\text{ where } R(f') = \int^\infty_{-\infty}f'(x)^2 dx\]

The minimiser of approximation to MISE is:

\[ h = \left(\frac{6}{R(f')}\right)^{1/3}n^{-1/3}\]

This value however depends on $R(f')$, which depends on an unknown density. We often assume guassian for this calculation, resulting in:

\[ h^* = 3.491\hat{\sigma}n^{-1/3} \]

We can plot histograms in R
- hist(x, nclass, plot=T, probability = F)
- x is a vector of data
- plot=T plots the histogram
  - If set to false, a list is returned with the components $\$$count (vector of counts or estimated density values within bins) and $\$$break (breakpoints between histogram classes)
- probability = T estimates density rather than raw counts

tit = read.table('titanic.txt', header = T)
tit = na.omit(tit)
hist(tit$Age, nclass = 5, probability = F, plot = T)

hist(tit$Age, nclass = 20, probability = F, plot = T)

Nonparametric Density Estimation

Histogram density estimates have an unattractive appearance as they are not continuous
- Smooth/continuous estimates are visually more appealing and tend to perform better mathematically in the MISE sense
Consider the following way to write the empirical distribution function:
- This cannot be differentiated to get an estimate of $f(x_j)$ as it is not continuous at jump points
- We replace the indicator function with a smooth approximation, allowing for differentiation at jump points

\[ \hat{F}(x)=\frac{1}{n}\sum_j\underbrace{I(x_j\leq x)}_{\text{Replace}}\]

\[ I(x_j\leq x)=\begin{cases}1,\quad \text{if }x_j\leq x\\ 0,\quad\text{otherwise} \end{cases}\]

Letting $G(u)$ be the distribution function of a RV $U$ with density function $K(u)$. $U$ has mean 0 and variance of 1.
Now consider the RV $Z=x_j+hU$ with mean $x_j$ and variance $h^2$. The distribution of this variable is therefore:
- As h increases, the distribution function $G$ converges to a RV which is constant and equal to $x_j$, the indicator function $I(x_j\leq x)$.
- As $h\rightarrow 0$, $G$ approches the step function.
- As $h\rightarrow\infty$, $G$ approaches flat.

\[ P(x_j + hU\leq x) = G\left(\frac{x-x_j}{h}\right)\]

Therefore, we use this as our smooth approximation in the CDF formula:

\[ \hat{F}(x) = \frac{1}{n}\sum_jG\left(\frac{x-x_j}{h}\right)\]

Deriving this gives the density estimation, noting $G'(u) = K(u)$:
- $h$ is called bandwidth in this case and $K(u)$ is the kernel density function

\[ \hat{f}(x) = \frac{1}{nh}\sum_jK\left(\frac{x-x_j}{h}\right)\]

We note assumptions for this kernel function:
- $K(u)$ is differentiable
- $\int K(u)du = 1$
- $\int uK(u)du = 0$
- $u^2K(u) du =^2_K<$
There are many common choices for the kernel density:
Guassian Kernel

\[ K(u) = \frac{1}{\sqrt{2\pi}}\exp\left(-\frac{u^2}{2}\right)\]

Epanechnikov Kernel

\[ K(u) = \frac{3}{4}(1-u^2), |u|\leq 1\]

Biweight Kernel

\[ K(u) = \frac{15}{16}(1-u^2)^2, |u|\leq 1\]

Triweight Kernel

\[ K(u) = \frac{35}{32}(1-u^2)^3, |u|\leq 1 \]

As $h\rightarrow 0$ and $nh\rightarrow\infty$, MISE for the kernel estimator is approximately:

\[ MISE\approx \int^\infty_{-\infty}\underbrace{\frac{h^4f''(x)^2\sigma^4_K}{4}}_{Bias(\hat{f}(x))^2}+\underbrace{\frac{f(x)R(K)}{nh}}_{Var(\hat{f}(x))}dx\]

\[ MISE\approx\underbrace{\frac{R(K)}{nh}}_{\text{Variance}}+\underbrace{\frac{h^4\sigma^4_KR(f'')}{4}}_{\text{Bias}}\]

\[ R(K) = \int^\infty_{-\infty}K(u)^2 du,\quad \sigma^2_K = \int^\infty_{-\infty}u^2K(u) du,\quad R(f'')=\int^\infty_{-\infty}f''(x)^2 dx\]

With optimal choice of bandwith

\[ h^* = \left(\frac{R(K)}{\sigma^4_KR(f'')}\right)^{1/5}n^{-1/5}\]

We can plot non-parametric density estimates in R:
- density(x, n=50,window=‘g’, width)
- x is a vector of data
- n is the number of equally spaced points to compute the density
- window is the kernel function to use
- width specifies the bandwith, which can be numerical or a string to call an automatic bandwidth selector

plot(density(tit$Age, width = 5), type = 'l')

plot(density(tit$Age, width = 20), type = 'l')

Variability Plots

To give a graphical representation of the uncertainity of a kernel estimate we plot variability plots
These plots are created by creating $m$ bootstraped samples of size $n$, completing density estimates for each resample at a common grid of values and taking the upper/lower 2.5 percentage points at each computated point.
An example in R

dvals <- matrix(nrow = 200, ncol=50)
for (i in 1:200){
  dvals[i,] <- density(sample(tit$Age, replace = T), from = min(tit$Age), to = max(tit$Age), n=50)$y 
}
lower <- apply(dvals,2,quantile,prob=0.025)
upper <- apply(dvals,2,quantile,prob=0.975)
grid <- seq(from = min(tit$Age), to = max(tit$Age), length=50)
plot(grid,upper,type='l',lty=2,xlab='x',ylab='f(x)')
lines(grid,lower,lty=2)
lines(density(tit$Age, from = min(tit$Age), to = max(tit$Age), n=50), lty = 1)

Bivariate Kernel Density Estimation

We may wish to estimate bivariate or multivariate distributions, such as $f(x,y)$
- $F(x,y) = P(X\leq x, Y\leq y)$ for the bivariate distribution function
- $f(x,y)=\frac{\partial^2}{\partial x\partial y}F(x,y)$
Estimator of $F(x,y)$:
- Note we assume independence, therefore $I(x_j\leq x,y_j\leq y) = I(x_j\leq x)I(y_j\leq y)$

\[ \hat{F}(x,y) = \frac{1}{n}\sum_jI(x_j\leq x,y_j\leq y)\]

We approximate the EDF similarly as before, noting independence:

\[ \hat{F}(x,y) = \frac{1}{n}\sum_jG\left(\frac{x-x_j}{h_1}\right)G\left(\frac{y-y_j}{h_2}\right)\]

\[ \hat{f}(x,y) = \frac{1}{nh_1h_2}\sum_jK\left(\frac{x-x_j}{h_1}\right)K\left(\frac{y-y_j}{h_2}\right)\]

Density Estimation

Jake

19/08/2021

Estimation of Unknown Density

Histogram

Nonparametric Density Estimation

Variability Plots

Bivariate Kernel Density Estimation