K-Nearest Neighbours (KNN) is a simple, non-parametric method used for classification (and sometimes regression). It assigns a new observation to a class based on the majority class among its k closest observations in the training data.
The idea behind kNN is that observations that are close to each other in the feature space are likely to belong to the same class. The “distance” between observations is typically measured using Euclidean distance, although other distance metrics can also be used.
In the following section, we will focus on kNN as a classification method and demonstrate how it can be implemented in R. We will also discuss how to choose an appropriate value of k and how to evaluate the performance of the model.
In R, K-Nearest Neighbours (KNN) classification is commonly
implemented using the knn() function from the
class package.
Syntax:
knn(train, test, cl, k = 1)Where:
- train: The matrix or data frame of training predictors
- test: The matrix or data frame of test predictors
- cl: A vector of class labels for the training data
- k: Specifies the number of nearest neighbours to consider
The output (pred) is a vector of predicted class labels
for the test observations, based on majority voting among the k nearest
neighbours.
First we need to read in our data into R. Throughtout this example we
will use the wine data. These data are the results of a
chemical analysis of wines grown in the same region in Italy but derived
from three different cultivars. The analysis determined the quantities
of 13 constituents found in each of the three types of wines.
The attributes are:
The wine data is in a .txt format, so to read in the
data we can use the read.table() function in R.
wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")
colnames(wine) <- c("Cultivar","Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid phenols","Proanthocyanins","Color intensity","Hue","OD280/OD315 of diluted wines","Proline")
dim(wine)
#> [1] 178 14
head(wine, 5)
#> Cultivar Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols
#> 1 1 14.23 1.71 2.43 15.6 127 2.80
#> 2 1 13.20 1.78 2.14 11.2 100 2.65
#> 3 1 13.16 2.36 2.67 18.6 101 2.80
#> 4 1 14.37 1.95 2.50 16.8 113 3.85
#> 5 1 13.24 2.59 2.87 21.0 118 2.80
#> Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue
#> 1 3.06 0.28 2.29 5.64 1.04
#> 2 2.76 0.26 1.28 4.38 1.05
#> 3 3.24 0.30 2.81 5.68 1.03
#> 4 3.49 0.24 2.18 7.80 0.86
#> 5 2.69 0.39 1.82 4.32 1.04
#> OD280/OD315 of diluted wines Proline
#> 1 3.92 1065
#> 2 3.40 1050
#> 3 3.17 1185
#> 4 3.45 1480
#> 5 2.93 735The wine dataset contains 178 observations of 14
variables, including the 13 measured quantities of chemicals and the
variable Cultivar, which indicates the type of grape from which the wine
was produced.
The measured attributes have very different ranges, so the data should be standardized before performing hierarchical clustering to ensure that all variables contribute equally to the analysis.
To evaluate the performance of a kNN classifier, we first need to split the data into a training set and a test set. The training set is used to build the model, while the test set is used to assess how well the model performs on unseen data.
In R, this can be done using the createDataPartition()
function from the caret package. This function ensures that
the split is done in a way that preserves the class distribution in both
sets (i.e., a stratified sample).
A common choice is an 80–20 split, where 80% of the data is used for training and 20% for testing.
library(caret)
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 4.5.3
#> Loading required package: lattice
set.seed(255)
wine_stand$Cultivar <- factor(wine$Cultivar)
trainIndex <- createDataPartition(wine_stand$Cultivar,
times=1,
p = .8,
list = FALSE)
train <- wine_stand[trainIndex, ]
test <- wine_stand[-trainIndex, ]Choosing an appropriate value of k is crucial because it controls the bias–variance trade-off:
In practice, the best value of k is often chosen by evaluating model performance on a test set or using cross-validation. A simple approach is to try several values of k and select the one that gives the highest classification accuracy (or lowest error rate).
In R, this can be done by looping over different values of k and comparing predictions:
library(class)
k_values <- seq(1, 10, by = 1)
accuracy <- sapply(k_values, function(k) {
pred <- knn(train, test, cl = train$Cultivar, k = k)
mean(pred == test$Cultivar)
})
accuracy
#> [1] 0.9117647 0.9117647 0.9705882 0.9705882 0.9705882 1.0000000 0.9705882
#> [8] 0.9705882 0.9705882 0.9705882The highest accuracy is achieved when choosing 6 neighbours.
Instead of manually trying different values of k, we can use
cross-validation to automatically select an optimal value for the kNN
model. The caret package provides a convenient way to do
this using the train() function, which evaluates multiple
models and compares their performance.
In the code below, we fit a kNN model while tuning the number of neighbours (k) from 1 to 10:
set.seed(89)
knnModel <- train(
Cultivar ~ .,
data = wine_stand,
method = "knn",
trControl = trainControl(method = "cv"),
tuneGrid = data.frame(k = 1:10)
)
knnModel
#> k-Nearest Neighbors
#>
#> 178 samples
#> 13 predictor
#> 3 classes: '1', '2', '3'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 160, 160, 161, 160, 161, 160, ...
#> Resampling results across tuning parameters:
#>
#> k Accuracy Kappa
#> 1 0.9549020 0.9321199
#> 2 0.9493464 0.9240232
#> 3 0.9549020 0.9322007
#> 4 0.9437908 0.9155658
#> 5 0.9607843 0.9411451
#> 6 0.9437908 0.9155211
#> 7 0.9663399 0.9493629
#> 8 0.9718954 0.9577350
#> 9 0.9718954 0.9577350
#> 10 0.9722222 0.9584087
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 10.
plot(knnModel)The resulting object (knnModel) contains the optimal value of k, the performance results for all tested values, and the final fitted model. This makes the procedure both efficient and more reliable than selecting k manually based on a single train–test split.
The output shows that for k=10, the accuracy is highest.
After selecting the optimal number of neighbours (k) using cross-validation, we can fit the final kNN model using this chosen value. This step uses the full training dataset and fixes k at its optimal value to make predictions on new, unseen observations.
After fitting a kNN model, it is important to formally evaluate how well it performs on unseen data. The main goal is to assess how accurately the model can classify new observations into the correct groups.
A standard approach is to use a confusion matrix, which compares the predicted class labels with the true class labels from the test set. This allows us to see not only the overall accuracy, but also which classes are most often confused with each other:
table(Predicted = wine.knn, Actual = test$Cultivar)
#> Actual
#> Predicted 1 2 3
#> 1 11 0 0
#> 2 0 13 0
#> 3 0 1 9From the confusion matrix, we can compute the classification accuracy, which is the proportion of correctly classified observations:
A more comprehensive way to evaluate the performance of a kNN
classifier is by using the confusionMatrix() function from
the caret package. This function provides a detailed
summary of classification performance by comparing the predicted class
labels with the true class labels.
library(caret)
confusionMatrix(wine.knn,test$Cultivar)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 1 2 3
#> 1 11 0 0
#> 2 0 13 0
#> 3 0 1 9
#>
#> Overall Statistics
#>
#> Accuracy : 0.9706
#> 95% CI : (0.8467, 0.9993)
#> No Information Rate : 0.4118
#> P-Value [Acc > NIR] : 3.92e-12
#>
#> Kappa : 0.9554
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: 1 Class: 2 Class: 3
#> Sensitivity 1.0000 0.9286 1.0000
#> Specificity 1.0000 1.0000 0.9600
#> Pos Pred Value 1.0000 1.0000 0.9000
#> Neg Pred Value 1.0000 0.9524 1.0000
#> Prevalence 0.3235 0.4118 0.2647
#> Detection Rate 0.3235 0.3824 0.2647
#> Detection Prevalence 0.3235 0.3824 0.2941
#> Balanced Accuracy 1.0000 0.9643 0.9800The output includes the confusion matrix, which shows how many observations are correctly and incorrectly classified for each class. It also provides overall performance measures such as accuracy, which represents the proportion of correctly classified observations, and the Kappa statistic, which adjusts accuracy for agreement occurring by chance.
In addition, class-specific metrics such as sensitivity and specificity may be reported, giving further insight into how well the model performs for each individual class. Together, these results provide a detailed evaluation of the predictive performance of the kNN model.