2025-11-16

Overview

Kernel Density Estimation (KDE) is a non-parametric method for estimating probability density functions.

Key Advantages:

  • Creates smooth, continuous density curves
  • No assumptions about underlying distribution
  • Flexible and intuitive

Applications:

  • Machine learning & pattern recognition
  • Signal processing & anomaly detection
  • Data visualization & exploratory analysis

The Mathematics: KDE Formula

The KDE estimate at point \(x\) is given by:

\[ \hat{f}_h(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{x - X_i}{h}\right) \]

Where:

  • \(K(\cdot)\) is the kernel function (typically Gaussian)
  • \(h\) is the bandwidth parameter (controls smoothness)
  • \(n\) is the number of observations
  • \(X_i\) are the data points

Kernel Function & Bandwidth Selection

Gaussian Kernel (most common):

\[ K(u) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{u^2}{2}\right) \]

Silverman’s Rule of Thumb for Bandwidth:

\[ h = 1.06 \, \sigma \, n^{-1/5} \]

Critical Parameter: Bandwidth choice dramatically affects results

  • Too small → noisy, overfitted estimates
  • Too large → oversmoothed, misses features

R Setup & Configuration

# Load required packages
library(ggplot2)   # Grammar of graphics plotting
library(plotly)    # Interactive 3D visualizations
library(MASS)      # Statistical functions (kde2d)
library(viridis)   # Perceptually uniform color scales

# Set seed for reproducibility
set.seed(2025)

These packages provide comprehensive tools for density estimation, static plotting, and interactive visualization.

Dataset Preparation

1D Dataset: Old Faithful geyser waiting times

2D Dataset: Synthetic mixture of two Gaussian clusters

# 1D: Geyser waiting times
x1 <- faithful$waiting

# 2D: First cluster (lower left)
x2 <- MASS::mvrnorm(480, mu = c(1.5, 1.0),
                    Sigma = matrix(c(0.08, 0.02, 0.02, 0.05), 2))

# 2D: Second cluster (upper right)  
x3 <- MASS::mvrnorm(320, mu = c(2.8, 2.5),
                    Sigma = matrix(c(0.06, -0.01, -0.01, 0.04), 2))

# Combine into dataframe
xy_df <- as.data.frame(rbind(x2, x3))
colnames(xy_df) <- c("x", "y")

1D KDE: Code

ggplot(data.frame(x1), aes(x = x1)) +
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 20, fill = "#8C1D40", alpha = 0.3) +
  geom_density(linewidth = 1.3, color = "#2C3E50") +
  labs(title = "1D Kernel Density Estimation",
       subtitle = "Old Faithful Waiting Times",
       x = "Waiting Time (minutes)", 
       y = "Density") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold", size = 14))

1D KDE: Plot

2D KDE: Code

ggplot(xy_df, aes(x = x, y = y)) +
  stat_density_2d(aes(fill = after_stat(level)), 
                  geom = "polygon", color = "white", linewidth = 0.3) +
  geom_point(alpha = 0.4, size = 1.2, color = "grey20") +
  scale_fill_viridis_c(option = "plasma", name = "Density") +
  labs(title = "2D Kernel Density Estimation",
       subtitle = "Two-Cluster Gaussian Mixture") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold", size = 14),
        legend.position = "right")

2D KDE: Plot

3D KDE: Code

# Compute 2D kernel density on grid
k <- kde2d(xy_df$x, xy_df$y, n = 100)

# Create interactive 3D surface
plot_ly(
  x = k$x,
  y = k$y,
  z = k$z,
  type = "surface",
  colorscale = "Viridis"
) %>%
  layout(
    title = "3D KDE Surface Plot",
    scene = list(
      xaxis = list(title = "X"),
      yaxis = list(title = "Y"),
      zaxis = list(title = "Density")
    )
  )

3D KDE: Interactive Plot

Key Takeaways

Conceptual:

  • Non-parametric density estimation
  • Smooth alternative to histograms
  • Bandwidth is the crucial parameter
  • Generalizes to multiple dimensions

Practical:

  • ggplot2 for publication-quality 2D plots
  • plotly for interactive 3D exploration
  • MASS::kde2d() for efficient computation
  • Widely used in ML and data science

Applications: Anomaly detection • Feature engineering • Probabilistic modeling • Data exploration

Thank You!

Thank You!

Syna Malhan • 2025-11-16