Lazy Learning: Classification Using Nearest Neighbors (k-NN)

1. Introduction

Lazy learning is a type of instance-based learning, where:

  • The model does not generalize the data at training time.
  • Computation is delayed until a prediction is needed.
  • The model memorizes the training data and searches for similar observations when classifying new instances.

A common lazy learning algorithm is the k-Nearest Neighbors (k-NN) classifier.


2. Understanding k-Nearest Neighbors (k-NN)

  • k-NN is a non-parametric, instance-based learning algorithm.
  • It classifies new data points based on the majority class of their k nearest neighbors in the training data.

2.1 Mathematical Formulation

Given a training dataset:

\[ D = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n) \} \]

where:

  • \(x_i\) is a feature vector.
  • \(y_i\) is the class label.

For a new data point \(x^*\), the predicted class \(\hat{y}^*\) is determined as:

\[ \hat{y}^* = \text{mode} \{ y_i \mid x_i \in N_k(x^*) \} \]

where:

  • \(N_k(x^*)\) represents the k nearest neighbors of \(x^*\).
  • The distance metric (e.g., Euclidean, Manhattan) determines the nearest neighbors.

3. Choosing a Distance Metric

The choice of distance metric significantly affects the model’s performance.

3.1 Common Distance Metrics

1. Euclidean Distance (Default)

\[ d(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} \]

2. Manhattan Distance

\[ d(x, y) = \sum_{i=1}^{n} |x_i - y_i| \]

3. Minkowski Distance (Generalized form)

\[ d(x, y) = \left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{1/p} \]

where \(p\) determines the norm:

  • If \(p = 1\), it reduces to Manhattan distance.
  • If \(p = 2\), it reduces to Euclidean distance.

4. Cosine Similarity

\[ d(x, y) = \frac{x \cdot y}{\|x\| \|y\|} \]

This measures the angle between two vectors rather than their magnitude.


4. Dataset 1: Classification of Flower Species (Iris Dataset)

The Iris dataset contains 150 observations of flowers with four features:

It classifies the flowers into three species: Setosa, Versicolor, and Virginica.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.1
library(caret)
## Warning: package 'caret' was built under R version 4.4.1
## Loading required package: lattice
library(class)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.1
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Load dataset
data("iris")
set.seed(123)

# Shuffle data
iris <- iris[sample(nrow(iris)), ]

# Normalize features
normalize <- function(x) { (x - min(x)) / (max(x) - min(x)) }
iris_norm <- as.data.frame(lapply(iris[1:4], normalize))
iris_norm$Species <- iris$Species

# Train-test split (80-20)
train_index <- sample(1:nrow(iris), 0.8 * nrow(iris))
train_data <- iris_norm[train_index, ]
test_data <- iris_norm[-train_index, ]