Understanding Gaussian Kernel Density:

A ‘by (R)Hand’ Introduction

Marc Coca Moreno

2019-07-26

Abstract

Kernel Density Estimations (KDE) are beautiful and, sometimes, they come very handy when your data is continuous or does not follow a parametric distribution (a normal distribution e.g.). However, unlike histograms, KDE may be a little harder to explain.

The following document aims to introduce and explain in an intuitive and pleasant way the raison d’être and the calculations behind one of the most famous Kernel Estimations: the Gaussian Kernel.

As the title suggest, most of the calculations will be computed, more or less, ‘by hand’ and yes, it is said ‘more or less’ because we are going to use R to do the simple calculations. However, because of the introductory nature of the article, more complex topics like integration are set apart. The visualization power of kernel densities is the main interest of this publication.

The article is structured into two parts: (1) the first part, the introduction, reviews some of the basic concepts of normal distribution and raise the issue of Kernel Densities; (2) the second part explains how to implement the Gaussian Kernel Density in R without using predefined functions. You can skip the first part if you don’t needed.

INTRODUCTION:
in Statistics, there is always something more that you don’t know


M. E. Tarter & R. A. Kronmal (1976), “An Introduction to the Implementation and Theory of Nonparametric Density Estimation”, The American Statistician, 30 (3), 105.
https://www.tandfonline.com/doi/abs/10.1080/00031305.1976.10479153?journalCode=utas20

There is a tendency to take for granted and leave unchanged the things we use most often. Also, the more we use a technique, the less critical we frequently become of its use. Perhaps this is why the histogram has so often been used and is still used for purposes for which it is quite poorly suited and for which it was never intended.


Distributions are important, or so we were told. For this reason, you will probably remember your very first steps in Statistics with one simple word: Histogram.

__Figure 1. Histogram__ Figure 1. Histogram

Histograms are a very quick and simple way to get a glance at your data distribution. As a result, you stablish a link: “If data distribution is important and histogram let me plot it, perhaps it is probably better that I fully understand how to draw histograms”.

__Johann Carl Friedrich Gauß (1777 - 1855)__ Johann Carl Friedrich Gauß (1777 - 1855)

However, as the days go by, you start asking yourself: “Why data distributions are so important?”. And then the Ghost of Christmas Past appears on the scene: Johann Carl Friedrich Gauß —or Gauss.

Carl Friedrich Gauss was a German Mathematician; he was born on 30 April 1777 in Brunswick and died on 23 February 1855 in Göttingen and therefore, in some way, he was an heir of the improving trend in the fields of Science a Mathematics, known as the Scientific Revolution and the Age of Enlightenment.

Ability for abstraction was on the rise on Mathematics, and thus its capacity to experiment with more and more complex events, even with those that are terribly uncertain.

Yes, Gauss made a large amount of contributions in many fields, but we are mainly interested in a particular one: The Normal, Gaussian or Laplace-Gauss Distribution.

It is said that Normal Distribution is behind many real situations: height of the population, the results of tossing a coin, student’s average, blood pressure, etc. In a very simplified form and assuming an ideal situation, Normal Distribution states the following: given a mean, you can expect that it is more probable to observe values closer to the mean. In other words: the mean is representative of the data.

Equation 1. Gaussian Function: \[f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] Where \(\mu\) is the mean, \(x\) is the observed value and \(\sigma\) is the standard deviation.

Equation 1.1 Gaussian Function: \[f(x) = \frac{e^{\frac{-(x-\mu)^2}{2\sigma^2}}}{\sigma\sqrt{2\pi}}\]

The possible range of values is also distributed symmetrically. For example, assume you have an array of data, you compute the mean and, surprise surprise, the value is 0 and the standard deviation is 1. You don’t have the entire population, you just have a sample; don’t worry, you can expect that the probabilities of the possible values you can collect will follow a Normal Distribution that it is defined by the Gaussian function (equation 1).1 You can find some variations of the formula. It has been chosen a normalized one that is also very common.
Reference

We can simplify a bit the equation and, obviously, we can plot it using R!2 When you talk about probabilities, you expect that 1 will be the sum of theme. Don’t’ try to sum up the results because 1 is the expected are under the curve, not the values of the points. This topic involves integrals and they are not covered here.

__Figure 2. An ideal situation: the Gaussian Bell__ Figure 2. An ideal situation: the Gaussian Bell

library(ggplot2)

#Values
Mean = 0
SD = 1
Values = seq(-3,3, 0.1)

#Function
Gaussian_F = function(Mean, Standard_deviation, Observed_values){
  y = 
    exp(-(Observed_values-Mean)^2/(2*Standard_deviation^2))/
    (Standard_deviation*sqrt(2*pi))
  
  return(y)
}
curve = Gaussian_F(Mean, SD, Values)

#Data
GBell = data.frame(Y = curve, X = Values)

#Plot
ggplot(GBell, aes(x = X, y = Y))+
  geom_line()+
  labs(x="Observed values", y="f(x)")+
  geom_vline(xintercept = 0)+
  annotate(
    "text", x=0.35, y=0.2, label="bold(Mean)", parse=T
    )+
  annotate(
    "text", x=0.35, y=0.15, label="bold(Mode)", parse=T
    )+
  annotate(
    "text", x=0.43, y=0.1, label="bold(Median)", parse=T
    )+
  theme_classic()+
  theme(axis.title.x = element_text(face = "bold"), 
        axis.title.y = element_text(face = "bold"),
        text = element_text(
          family="Times New Roman")
        )

In summary, you can expect that the more data you collect, the more bell-shaped will be its histogram; that is what Central Limit Theorem (CLT) says. Obviously, assuming that the entire population of your variable is behaving under the Normal Distribution, you can also deduce more things like: confidence interval, hypothesis contrast, etc.

Ok, you understand the principles of the normal distribution, the ideas of CLT and you obviously know how to draw histograms. You are prepared to collect data and tell some interesting things about it.

So, you decide to collect data about iris flowers, for example, and you draw a histogram based on the length of the petal differing by specie (figure 4).

__Figure 3. Main Distribution of Petal Length__ Figure 3. Main Distribution of Petal Length

ggplot(iris, aes(iris$Petal.Length, fill=iris$Species))+
  geom_histogram(binwidth = 0.05)+
  labs(x="Petal Length", y = "Frequency", fill="Species")+
  theme_classic()+
  theme(
    axis.title.x = element_text(face = "bold"), 
    axis.title.y = element_text(face = "bold"),
    legend.title = element_text(face = "bold"),
    text = element_text(
          family="Times New Roman")
        )

You detect that the three species have a different behaviour and, to prove it, you take another step forward and you decide to compare their averages and the confidence intervals (95%) because you understand so well the Gaussian thing (figure 4).

__Figure 4. A different mean__ Figure 4. A different mean

library(tidyverse)

#Data
means=iris%>%
  group_by(Species)%>%
  summarise(M = mean(Petal.Length), SD = sd(Petal.Length), N = n())

#Sample size of 50. T Distribution intervals
means$error=qt(0.975, df=50-1)*means$SD/sqrt(means$N)
means$upper=means$M+means$error
means$lower=means$M-means$error

ggplot(data=means)+
  geom_crossbar(
    aes(x=Species, y=M, ymin=lower, ymax=upper, fill=Species), 
    show.legend = F)+
  scale_color_manual(values=c("blue", "green", "yellow"))+
  labs(x="Species", y="Mean and CI 95%")+
  theme_classic()+
  theme(
    axis.title.x = element_text(face = "bold"), 
    axis.title.y = element_text(face = "bold"),
    text = element_text(
          family="Times New Roman")
        )

Congratulations, your hypothesis are correct and also your plots. However, the life can be dark and full of terrors sometimes because some friend of yours made the following remark: “have you ever heard about density plots? Given your continuous data, the size of the sample and its behavior, I think a density estimation would be more appropriate than a traditional histogram.”

At first, you didn’t think about that because, in your case, this is just a matter of beauty and does not affect at all your experiment. However, as the days go by, your curiosity began to grow and you decided to give it a try (figure 5).

ggplot(iris, aes(Petal.Length, fill=Species, colour = Species))+
  geom_density(alpha=0.1, size=0.8)+
  scale_fill_manual(values=c("red", "green", "blue"))+
  scale_color_manual(values=c("red", "green", "blue"))+
  geom_vline(xintercept = c(means$M))+
  annotate("rect",xmin=means$lower, xmax=means$upper, ymin=0, ymax=Inf, alpha=0.2)+
  annotate("text", x = 1.8, y=2.7, label="bold(Mean)", parse=T)+
  annotate("text", x = 1.87, y=2.5, label="CI 95%")+
  labs(x="Petal Length", y = "Density", fill="Species")+
  theme_classic()+
  theme(
    axis.title.x = element_text(face = "bold"), 
    axis.title.y = element_text(face = "bold"),
    legend.title = element_text(face = "bold"),
    text = element_text(
          family="Times New Roman")
        )

Figure 5. Distribution of Petal Length based on a Gausian Kernel Density Estimation

__Figure 5. Distribution of Petal Length based on a Gausian Kernel Density Estimation__

“Ok, it’s true, kernel densities are beautiful!”, you thought. Not just beautiful, the density estimation let you see better the flow of the data because the values are visualized as a continuum and not as a discrete bin. That’s the proper way to work with continuous data! “Maybe, using histograms everywhere, I have just been a fascist all my life”. Obviously not, fascist don’t know what a histogram is.

And, of course, you begin to think about more applications: is it possible to generated new data based on these more ad hoc distributions? What confidence intervals would I get with these probabilities? No, we are not going to talk about these topics because they are so complex.

The density estimation also allows you to see that, even the means are significantly different, between versicolor and virginica there is a region of confusion. Yes, you can tell that with a histogram, but it is not as clear as smoothed density, is it?

Something has changed, you noticed that. Yes, the shapes are closely related to the histogram but y-axis is showing strange values. Well, at least, x-axis seems to be the “same”.

No, y-axis values are not weird: they are just the sum and normalization of the gaussian bell computed around every point. That is essentially a kernel estimation: a sum of different parametric distribution produced by each point given some parameters.3 Note that this is not the full inferred area: the limits of x-axis are the min and max values of the data and, of course, there is some probability that new values exceed those limits. Ggplot just cuts that probabilities by default.

In any case, you know that in statistics when you use some funny technique that involves estimations, transformations and additional parameters, it is just a matter of time before somebody asks you tricky things like: Why Gaussian Kernel and not Epanechnikov? What value did you use as smoothing parameter (bandwidth)? etc.

No, you don’t need to give a scientifically and extremely formal answer to those questions: remember that a kernel estimation is a non-parametric method and, as it is usual in this kind of techniques, the choice of one or other value sometimes depends more on experimentation (based on similar works or general ideas, e.g.) than a mandatory formula. This is called: “The Rule of Thumb”.

Either way, you need to understand the fundamentals behind kernel estimations and, of course, you should feel comfortable with some basic concepts. And that is the main point of this article: give you confidence through a simple explanation and some ‘by hand’ exercises. Let’s start!

SUM THEM ALL!
The Gaussian Kernel

Rosenblatt, M. (1956), “Remarks on some nonparametric estimates of a density function”. Ann. Math. Statist., 27, 832–37

Parzen, E. (1962), “On estimation of a probability density function and mode”. Ann. Math. Statist., 33, 1065–76.

Kernel Density Functions were popularized by Rosenblatt (1956) and Parzen (1962). Given a strange data distribution of a random variable, the main purpose of their investigations was to find a way to estimate a more appropriate probability function for a “non-conformist” data shapes/distributions.

Equation 2 Kernel Density Estimator: \[f(x) = \frac{1}{n}\sum_{i=1}^{n}K_h(x-x_i) = \frac{1}{nh}\sum_{i=1}^{n}K(\frac{x-x_{i}}{h})\]

Formally, a kernel density estimator is defined by equation 2, where \(n\) is the number of data points, h is a bandwidth (smoothing parameter) and \(K\) is the kernel function. Of course, we already know what function we want: the Gaussian Function.4 Wikipedia .

At first glance, this new definition seems easy: you just have to replace the K part of the equation with the Gaussian Function and voilà. However, you know it cannot be so easy.

Some new parameters appeared and mathematical symbols too. First and foremost, let’s talk about bandwidth. Bandwidth is essentially your brand new standard deviation. Yes, it also replaces the sigma parameter in the K part.

It plays a major role in the equation as it controls for the smoothness of the kernel curve. So, basically, this value is going to be the responsible of the appropriateness of the estimation and how spread it is going to be.

No, there is not a magic equation that will tell you which value is the best. There are just some “rules of thumb” as the Silverman’s rule of thumb.5 Silverman, B. W., 1986, Density Estimation for Statistics and Data Analysis, London: Chapman & Hall.. In this exercise we keep things easy: bandwidth will be 1 by default.

Secondly, a summation symbol has appeared and it seems to suggest that the result of the K part (the Gaussian Function in our case) must be summed. This can be a little confusing: how many gaussian bells do I have? Where to find them? What is the result supposed to be?

That is the funny thing about kernel estimations: you compute some “ideal” distribution around a data point (a gaussian bell, e.g.) and then you sum them all. The result is your kernel density estimation.

Let’s see one example!6 It is important to emphasize that this is part is strongly inspired by the following Homework Help article: Python Tutorial

There is something weird around the point: one-point example

Ok, first of all, let’s remember that that the K part of the equation is the chosen probability function and sigma (standard deviation) is now the bandwidth.

Another doubt appears on the scene: if we are going to compute the gauss distribution of every point individually, where are the mean and observed values that we have seen before? Easy, the mean is now the point (observations) and the observed value becomes just a desired range of numbers. Think of it as the range of the horizontal axis of a plot.

Whatever the case, let’s start with one-point example. We choose a random point of Petal Length, a sequence of values (range) and then we plot the gaussian area around it.

You may be thinking: what is the correct range given a point? Don’t worry about that, just one obvious and logical rule: the value of the point must be in the range. Another question: if you want to plot the point as well, what is its y-value?

All these questions are answered by the same element: the equation is normalized. Because of that, the range you choose can be as wide as you want, the area will become 0 at some point. The y-value of the points is the pick of the bell; for example, when bandwidth or standard deviation is 1, the pick is going to be ~0.4.

#A point
point = iris$Petal.Length[7]
point
## [1] 1.4
#Some range
range = seq(-3, 6, 0.01)

#A simple bandwidth
b = 1

#The estimation
GaussianPoint = data.frame(
  X = range,
  Y = 
    exp(-(range-point)^2/(2*b^2))/(b*sqrt(2*pi))/
    length(point)
  )

#A song of Plots and Points
ggplot()+
  geom_area(data = GaussianPoint, aes(x=X, y=Y), fill = "blue", alpha=0.5)+
  geom_point(aes(x=1.4, y=max(GaussianPoint$Y)))+
    theme_classic()+
  theme(
    axis.title.x = element_text(face = "bold"), 
    axis.title.y = element_text(face = "bold"),
    legend.title = element_text(face = "bold"),
    text = element_text(
          family="Times New Roman", size=15)
        )

Figure 6. Gaussian Density of a point

__Figure 6. Gaussian Density of a point__

Two-points example

With two points, the fun begins. Now we have to use all the formula: we are going to compute the gaussian bell of the two points, sum them up and then normalize them with the first part of equation 3. It is that simple (figure 7).

Just one little difference about the code: the first part of the equation, the one that ensures that integrate is 1, is now delayed until the sum is done.

#A point
point1 = iris$Petal.Length[7]
point2 = iris$Petal.Length[65]
point1
## [1] 1.4
point2
## [1] 3.6
#An exaggerated range
range = seq(-10, 10, 0.01)

#A simple bandwidth
b = 1

#Number of points
n = 2

#The estimation of points
GaussianPoints = data.frame(
  X = range,
  Y1 = exp(-(range-point1)^2/(2*b^2))/(b*sqrt(2*pi)),
  Y2 = exp(-(range-point2)^2/(2*b^2))/(b*sqrt(2*pi))
  )

#Kernel density
GaussianPoints = GaussianPoints%>%
  mutate(Kernel = (Y1+Y2)/n)


#Two-point plot
ggplot()+
  geom_area(data = GaussianPoints, aes(x=X, y=Y1, fill = "Point 1"), alpha=0.5)+
  geom_area(data = GaussianPoints, aes(x=X, y=Y2, fill = "Point 2"), alpha=0.5)+
  geom_area(data = GaussianPoints, aes(x=X, y=Kernel, fill = "Kernel"), alpha=0.7)+
  scale_fill_manual(values = c("red", "yellow", "blue"))+
  geom_point(aes(x=1.4, y=max(GaussianPoints$Y1)))+
  geom_point(aes(x=3.6, y=max(GaussianPoints$Y2)))+
  labs(x = "Range", y="Densities", fill="Estimations")+
    theme_classic()+
  theme(
    axis.title.x = element_text(face = "bold"), 
    axis.title.y = element_text(face = "bold"),
    legend.title = element_text(face = "bold"),
    text = element_text(
          family="Times New Roman", size=15)
        )

Figure 7. Gaussian Density of two points

__Figure 7. Gaussian Density of two points__

Of course, you have noticed a problem: as ggplot does, I don’t care about the area that is beyond the value of the points. As we are not going to consider integration and probabilities, we can afford the luxury of cutting the x-axis (figure 7.1). However, with just 2 data points, the plot looks weird.

#Two-point plot
ggplot()+
  geom_area(data = GaussianPoints, aes(x=X, y=Y1, fill = "Point 1"), alpha=0.5)+
  geom_area(data = GaussianPoints, aes(x=X, y=Y2, fill = "Point 2"), alpha=0.5)+
  geom_area(data = GaussianPoints, aes(x=X, y=Kernel, fill = "Kernel"), alpha=0.7)+
  scale_fill_manual(values = c("red", "yellow", "blue"))+
  geom_point(aes(x=1.4, y=max(GaussianPoints$Y1)))+
  geom_point(aes(x=3.6, y=max(GaussianPoints$Y2)))+
  xlim(1.4,3.6)+
  labs(x = "Range", y="Densities", fill="Estimations")+
    theme_classic()+
  theme(
    axis.title.x = element_text(face = "bold"), 
    axis.title.y = element_text(face = "bold"),
    legend.title = element_text(face = "bold"),
    text = element_text(
          family="Times New Roman", size=15)
        )

Figure 7.1. Gaussian Density of two points

__Figure 7.1. Gaussian Density of two points__

All in all, you’re just another point in the wall

Finally, we are in the most interesting part of all: we already know how to compute the Kernel Density of two points; for three points is the same, so let’s code a function that could do these calculations for any number of points and let’s compare the results with the ggplot ones.

As we want to visualize the Gaussian Bell of the points and the final Kernel Density both, the function will return 3 data frames in a list: a first one with the individual bells, a second one with the Kernel Density and, finally, a last one for the points and the max values. The first one is structured in long format as it is easier to plot and transform it.

#The function
Gauss_K = function(Values, Range, h=1){
  library(dplyr) 

  #Result
  densities = data.frame()
  #Temporal variable
  Temp = data.frame()
  #One by one variable
  V = vector()

  for (i in 1:length(Values))
    {
      #The value
      V = Values[i]
      #Gaussian Function
      Temp =
        data.frame(
          Density = exp(-(Range-V)^2/(2*h^2))/(h*sqrt(2*pi)),
          Range = Range,
          Value = as.factor(paste0("X",i))
    )
    densities = rbind(densities, Temp)
  }
  Densities1 = densities
  Densities2 = densities%>%
    #Sum K by range value!
    group_by(Range)%>%
    summarise(Bell_sum = sum(Density))%>%
    #Normalization
    mutate(Kernel_Density = Bell_sum /length(Values))
  #Data Frame with points
  Points = Densities1%>%
    group_by(Value)%>%
    summarise(Y = max(Density))%>%
    mutate(Value = Values)

  
return(
  list(Densities1, Densities2, Points)
  )
}

Now that we have the function, let’s try 5 random points of iris data set (figure 8).

set.seed(1234)
S = sample(iris$Petal.Length, 5)
S
## [1] 1.5 3.5 6.0 5.1 5.6
R = seq(min(S)-5, max(S)+5, 0.01)
Points5 = Gauss_K(S, R, h = 1)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Just for fun...
library(png)
pic1 = readPNG("chairs.png")
pic2 = readPNG("vexels.png")
pic1 = grid::rasterGrob(pic1, interpolate = T)
pic2 = grid::rasterGrob(pic2, interpolate = T)
ggplot(Points5[[2]], aes(x=Range, y = Kernel_Density))+
  geom_area(data = Points5[[1]], aes(x= Range, y=Density, fill = Value), 
  alpha = 0.3, show.legend = F, position = "identity")+
  geom_area(fill = "blue")+
  geom_point(data = Points5[[3]], aes(x= Value, y= Y))+
  labs(x="Range", y="Density")+
    theme_classic()+
  theme(
    axis.title.x = element_text(face = "bold"), 
    axis.title.y = element_text(face = "bold"),
    legend.title = element_text(face = "bold"),
    text = element_text(
          family="Times New Roman", size=15)
        )+
  annotation_custom(pic1, xmin=1.1, xmax=3.5, ymin=0.001, ymax=0.3)+
  annotation_custom(pic2, xmin=4.7, xmax=6.1, ymin=0.13, ymax=0.4)

Figure 8. The Kernel Jazz

__Figure 8. The Kernel Jazz__

So beautiful, isn’t it?

Our happy ending: and what about ggplot?

To conclude, it may be of some interest to compare ggplot estimation and our function.

We must keep in mind just one simple thing: ggplot doesn’t allow you to choose a bandwidth value, you just can “adjust” (multiplication) it by some multiplier. By default, ggplot uses Silverman’s rule of thumb and it is defined by the following function: bw.nrd0()

In any case, let’s plot the density of all 150 points of petal length with our function and geom_density. We are not considering the species now and neither Silverman’s rule (figure 9).

data = iris$Petal.Length
Range = seq(-5, 10, 0.1)
OurKernel = Gauss_K(data, Range, h = 1)
ggplot()+
  #Our estimation
  
  geom_line(data=OurKernel[[1]], aes(x=Range, y=Density, col=Value), 
            show.legend = F,alpha=0.3, position = "identity")+
  scale_fill_manual(values=sample(rainbow(1000),500))+
  
  geom_area(data = OurKernel[[2]], aes(x=Range, y=Kernel_Density, 
                                       fill = "Our Density"), 
            alpha=0.8)+
  
  geom_point(data = OurKernel[[3]], aes(x = Value, y = Y))+
  
  #Ggplot function
  geom_density(data = data.frame(data), aes(x = data, 
                                            fill="GGplot Density"), 
               alpha = 0.5)+
  
  scale_fill_manual(values = c("blue", "red"))+

  labs(x = "Range", y="Densities", fill="Estimations")+
    theme_classic()+
  theme(
    axis.title.x = element_text(face = "bold"), 
    axis.title.y = element_text(face = "bold"),
    legend.title = element_text(face = "bold"),
    text = element_text(
          family="Times New Roman", size=15)
        )

Figure 9. Comparison

__Figure 9. Comparison__

Now, if you want to be 100% sure that the calculations are correct, you can use the same bandwidth of ggplot. The results must be exactly the same (figure 10).

data = iris$Petal.Length
Range = seq(-5, 10, 0.1)
OurKernel = Gauss_K(data, Range, h = bw.nrd0(data))
ggplot()+
  #Our estimation
  
  geom_line(data=OurKernel[[1]], aes(x=Range, y=Density, col=Value), 
            show.legend = F,alpha=0.3, position = "identity")+
  scale_fill_manual(values=sample(rainbow(1000),500))+
  
  geom_area(data = OurKernel[[2]], aes(x=Range, y=Kernel_Density, 
                                       fill = "Our Density"), 
            alpha=0.8)+
  
  geom_point(data = OurKernel[[3]], aes(x = Value, y = Y))+
  
  #Ggplot function
  geom_density(data = data.frame(data), aes(x = data, 
                                            fill="GGplot Density"), 
               alpha = 0.5)+
  scale_fill_manual(values = c("blue", "red"))+
  
  xlim(-3,10)+
  labs(x = "Range", y="Densities", fill="Estimations")+
    theme_classic()+
  theme(
    axis.title.x = element_text(face = "bold"), 
    axis.title.y = element_text(face = "bold"),
    legend.title = element_text(face = "bold"),
    text = element_text(
          family="Times New Roman", size=15)
        )

Figure 10. Like two pears in a pod

__Figure 10. Like two pears in a pod__
AND THAT’S ALL FOLKS!