- In this article, the basic concepts of the Kernel Density Estimation and Kernel Regression along their applications will be discussed.
Kernel Density Estimation (KDE)
- Let’s start with an example (from the edX course Applied Machine Learning by Microsoft): Let’s say that we have data points representing crime rates along one street. As shown in the following figure, let’s say crime 1 happens at location 15, crime 2 happens at location 12, and so on.
- We want to know how likely it is to have a crime at a particular place along the street. In other words we want to estimate the density of crimes along the street.
- So we’re going to assume that crimes happen in the future in similar places to where they’ve happened in the past, and we’ll place a bump on each crime that has happened, and we’ll add all the bumps up.
- We have some flexibility in the size of the bumps. We could have chosen narrower bumps or wider bumps. If we chose wider bumps, our kernel density estimate will be smoother, and if we chose narrower bumps, the kde estimate will be peakier.
- The following figure illustrates the math formula to be used for KDE: \(K(.)\) is a density, i.e., a positive function that integrates to 1 and \(h\) is a positive number representing the bandwidth of the Kernel estimate of the density of the data \(x\).
- The next animations show how the density function estimated by KDE tends to over / under-fit the data with (low / high) value of the bandwidth parameter for the Kernel function used. They also show the density functions estimated by using different Kernel functions.

- The next animation shows density function estimated by KDE in 2 dimensions.
