Kernel Density Estimation (KDE) and Kernel Regression (KR)

Kernel Density Estimation (KDE)

Let’s start with an example (from the edX course Applied Machine Learning by Microsoft): Let’s say that we have data points representing crime rates along one street. As shown in the following figure, let’s say crime 1 happens at location 15, crime 2 happens at location 12, and so on.
We want to know how likely it is to have a crime at a particular place along the street. In other words we want to estimate the density of crimes along the street.
So we’re going to assume that crimes happen in the future in similar places to where they’ve happened in the past, and we’ll place a bump on each crime that has happened, and we’ll add all the bumps up.
We have some flexibility in the size of the bumps. We could have chosen narrower bumps or wider bumps. If we chose wider bumps, our kernel density estimate will be smoother, and if we chose narrower bumps, the kde estimate will be peakier.
The following figure illustrates the math formula to be used for KDE: \(K(.)\) is a density, i.e., a positive function that integrates to 1 and \(h\) is a positive number representing the bandwidth of the Kernel estimate of the density of the data \(x\).

The next animations show how the density function estimated by KDE tends to over / under-fit the data with (low / high) value of the bandwidth parameter for the Kernel function used. They also show the density functions estimated by using different Kernel functions.

It’s a non-parametric regression technique. The idea is to estimate the dependent variable value for a query point by a weighted average of the dependent variable values of the similar points. Similar data points are to be found using the Kernel function.

The following animations show how the regression curve fitted by KR tends to over / under-fit the data with (low / high) value of the bandwidth parameter for the Gaussian Kernel function used.