Lazy Learning: Classification Using Nearest Neighbors (k-NN)

1. Introduction

Lazy learning is a type of instance-based learning, where:

The model does not generalize the data at training time.
Computation is delayed until a prediction is needed.
The model memorizes the training data and searches for similar observations when classifying new instances.

A common lazy learning algorithm is the k-Nearest Neighbors (k-NN) classifier.

2. Understanding k-Nearest Neighbors (k-NN)

k-NN is a non-parametric, instance-based learning algorithm.
It classifies new data points based on the majority class of their k nearest neighbors in the training data.

2.1 Mathematical Formulation

Given a training dataset:

\[ D = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n) \} \]

where:

\(x_i\) is a feature vector.
\(y_i\) is the class label.

For a new data point \(x^*\), the predicted class \(\hat{y}^*\) is determined as:

\[ \hat{y}^* = \text{mode} \{ y_i \mid x_i \in N_k(x^*) \} \]

where:

\(N_k(x^*)\) represents the k nearest neighbors of \(x^*\).
The distance metric (e.g., Euclidean, Manhattan) determines the nearest neighbors.

3. Choosing a Distance Metric

The choice of distance metric significantly affects the model’s performance.

3.1 Common Distance Metrics

1. Euclidean Distance (Default)

\[ d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} \]

2. Manhattan Distance

\[ d(x, y) = \sum_{i=1}^{n} |x_i - y_i| \]

3. Minkowski Distance (Generalized form)

\[ d(x, y) = \left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{1/p} \]

where \(p\) determines the norm:

If \(p = 1\), it reduces to Manhattan distance.
If \(p = 2\), it reduces to Euclidean distance.

4. Cosine Similarity

\[ d(x, y) = \frac{x \cdot y}{\|x\| \|y\|} \]

This measures the angle between two vectors rather than their magnitude.

4. Dataset 1: Classification of Flower Species (Iris Dataset)

The Iris dataset contains 150 observations of flowers with four features:

Sepal Length
Sepal Width
Petal Length
Petal Width

It classifies the flowers into three species: Setosa, Versicolor, and Virginica.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.1

library(caret)

## Warning: package 'caret' was built under R version 4.4.1

## Loading required package: lattice

library(class)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.4.1

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

# Load dataset
data("iris")
set.seed(123)

# Shuffle data
iris <- iris[sample(nrow(iris)), ]

# Normalize features
normalize <- function(x) { (x - min(x)) / (max(x) - min(x)) }
iris_norm <- as.data.frame(lapply(iris[1:4], normalize))
iris_norm$Species <- iris$Species

# Train-test split (80-20)
train_index <- sample(1:nrow(iris), 0.8 * nrow(iris))
train_data <- iris_norm[train_index, ]
test_data <- iris_norm[-train_index, ]

ML Techniques with R Part 2: Lazy Learning: k-Nearest Neighbors (k-NN)

Dr Debashis Chatterjee

2025-02-25