K-Nearest Neighbours (KNN) is a simple, non-parametric method used for classification (and sometimes regression). It assigns a new observation to a class based on the majority class among its k closest observations in the training data.

The idea behind kNN is that observations that are close to each other in the feature space are likely to belong to the same class. The “distance” between observations is typically measured using Euclidean distance, although other distance metrics can also be used.

In the following section, we will focus on kNN as a classification method and demonstrate how it can be implemented in R. We will also discuss how to choose an appropriate value of k and how to evaluate the performance of the model.

KNN in R

In R, K-Nearest Neighbours (KNN) classification is commonly implemented using the knn() function from the class package.

Syntax:
knn(train, test, cl, k = 1)

Where:

train: The matrix or data frame of training predictors

test: The matrix or data frame of test predictors

cl: A vector of class labels for the training data

k: Specifies the number of nearest neighbours to consider

The output (pred) is a vector of predicted class labels for the test observations, based on majority voting among the k nearest neighbours.

Read data

First we need to read in our data into R. Throughtout this example we will use the wine data. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

The attributes are:

Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue - OD280/OD315 of diluted wines
Proline

The wine data is in a .txt format, so to read in the data we can use the read.table() function in R.

wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")

colnames(wine) <- c("Cultivar","Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid phenols","Proanthocyanins","Color intensity","Hue","OD280/OD315 of diluted wines","Proline")

dim(wine)
#> [1] 178  14

head(wine, 5)
#>   Cultivar Alcohol Malic acid  Ash Alcalinity of ash Magnesium Total phenols
#> 1        1   14.23       1.71 2.43              15.6       127          2.80
#> 2        1   13.20       1.78 2.14              11.2       100          2.65
#> 3        1   13.16       2.36 2.67              18.6       101          2.80
#> 4        1   14.37       1.95 2.50              16.8       113          3.85
#> 5        1   13.24       2.59 2.87              21.0       118          2.80
#>   Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity  Hue
#> 1       3.06                 0.28            2.29            5.64 1.04
#> 2       2.76                 0.26            1.28            4.38 1.05
#> 3       3.24                 0.30            2.81            5.68 1.03
#> 4       3.49                 0.24            2.18            7.80 0.86
#> 5       2.69                 0.39            1.82            4.32 1.04
#>   OD280/OD315 of diluted wines Proline
#> 1                         3.92    1065
#> 2                         3.40    1050
#> 3                         3.17    1185
#> 4                         3.45    1480
#> 5                         2.93     735

The wine dataset contains 178 observations of 14 variables, including the 13 measured quantities of chemicals and the variable Cultivar, which indicates the type of grape from which the wine was produced.

The measured attributes have very different ranges, so the data should be standardized before performing hierarchical clustering to ensure that all variables contribute equally to the analysis.

wine_stand <- as.data.frame(scale(wine[,2:14])) # standardize data by subtracting the mean and deviding by the sd

Split data into training and test set

To evaluate the performance of a kNN classifier, we first need to split the data into a training set and a test set. The training set is used to build the model, while the test set is used to assess how well the model performs on unseen data.

In R, this can be done using the createDataPartition() function from the caret package. This function ensures that the split is done in a way that preserves the class distribution in both sets (i.e., a stratified sample).

A common choice is an 80–20 split, where 80% of the data is used for training and 20% for testing.

library(caret)
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 4.5.3
#> Loading required package: lattice

set.seed(255)

wine_stand$Cultivar <- factor(wine$Cultivar)

trainIndex <- createDataPartition(wine_stand$Cultivar, 
                                  times=1, 
                                  p = .8, 
                                  list = FALSE)

train <- wine_stand[trainIndex, ]
test <- wine_stand[-trainIndex, ]

Deciding the number of neighbours

Choosing an appropriate value of k is crucial because it controls the bias–variance trade-off:

Small values of k can lead to a very flexible model that is sensitive to noise (high variance)
Large values of k produce smoother decision boundaries but may oversimplify the structure (high bias).

In practice, the best value of k is often chosen by evaluating model performance on a test set or using cross-validation. A simple approach is to try several values of k and select the one that gives the highest classification accuracy (or lowest error rate).

In R, this can be done by looping over different values of k and comparing predictions:

library(class)
k_values <- seq(1, 10, by = 1)

accuracy <- sapply(k_values, function(k) {
  pred <- knn(train, test, cl = train$Cultivar, k = k)
  mean(pred == test$Cultivar)
})

accuracy
#>  [1] 0.9117647 0.9117647 0.9705882 0.9705882 0.9705882 1.0000000 0.9705882
#>  [8] 0.9705882 0.9705882 0.9705882

The highest accuracy is achieved when choosing 6 neighbours.

Instead of manually trying different values of k, we can use cross-validation to automatically select an optimal value for the kNN model. The caret package provides a convenient way to do this using the train() function, which evaluates multiple models and compares their performance.

In the code below, we fit a kNN model while tuning the number of neighbours (k) from 1 to 10:


set.seed(89)

knnModel <- train(
  Cultivar ~ ., 
  data = wine_stand, 
  method = "knn", 
  trControl = trainControl(method = "cv"), 
  tuneGrid = data.frame(k = 1:10)
)

knnModel
#> k-Nearest Neighbors 
#> 
#> 178 samples
#>  13 predictor
#>   3 classes: '1', '2', '3' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold) 
#> Summary of sample sizes: 160, 160, 161, 160, 161, 160, ... 
#> Resampling results across tuning parameters:
#> 
#>   k   Accuracy   Kappa    
#>    1  0.9549020  0.9321199
#>    2  0.9493464  0.9240232
#>    3  0.9549020  0.9322007
#>    4  0.9437908  0.9155658
#>    5  0.9607843  0.9411451
#>    6  0.9437908  0.9155211
#>    7  0.9663399  0.9493629
#>    8  0.9718954  0.9577350
#>    9  0.9718954  0.9577350
#>   10  0.9722222  0.9584087
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 10.

plot(knnModel)

The data are split internally using cross-validation (method = “cv”).
A separate kNN model is fitted for each value of k in the grid.
The performance of each model is evaluated and compared.
The value of k that gives the best cross-validated performance is selected.

The resulting object (knnModel) contains the optimal value of k, the performance results for all tested values, and the final fitted model. This makes the procedure both efficient and more reliable than selecting k manually based on a single train–test split.

The output shows that for k=10, the accuracy is highest.

Perform KNN

After selecting the optimal number of neighbours (k) using cross-validation, we can fit the final kNN model using this chosen value. This step uses the full training dataset and fixes k at its optimal value to make predictions on new, unseen observations.


set.seed(89)

wine.knn <- knn(train,test,cl=train$Cultivar, k = 10)   

wine.knn
#>  [1] 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3
#> Levels: 1 2 3

Model evaluation

After fitting a kNN model, it is important to formally evaluate how well it performs on unseen data. The main goal is to assess how accurately the model can classify new observations into the correct groups.

A standard approach is to use a confusion matrix, which compares the predicted class labels with the true class labels from the test set. This allows us to see not only the overall accuracy, but also which classes are most often confused with each other:

table(Predicted = wine.knn, Actual = test$Cultivar)
#>          Actual
#> Predicted  1  2  3
#>         1 11  0  0
#>         2  0 13  0
#>         3  0  1  9

From the confusion matrix, we can compute the classification accuracy, which is the proportion of correctly classified observations:

mean(wine.knn == test$Cultivar)
#> [1] 0.9705882

A more comprehensive way to evaluate the performance of a kNN classifier is by using the confusionMatrix() function from the caret package. This function provides a detailed summary of classification performance by comparing the predicted class labels with the true class labels.

library(caret)

confusionMatrix(wine.knn,test$Cultivar)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  1  2  3
#>          1 11  0  0
#>          2  0 13  0
#>          3  0  1  9
#> 
#> Overall Statistics
#>                                           
#>                Accuracy : 0.9706          
#>                  95% CI : (0.8467, 0.9993)
#>     No Information Rate : 0.4118          
#>     P-Value [Acc > NIR] : 3.92e-12        
#>                                           
#>                   Kappa : 0.9554          
#>                                           
#>  Mcnemar's Test P-Value : NA              
#> 
#> Statistics by Class:
#> 
#>                      Class: 1 Class: 2 Class: 3
#> Sensitivity            1.0000   0.9286   1.0000
#> Specificity            1.0000   1.0000   0.9600
#> Pos Pred Value         1.0000   1.0000   0.9000
#> Neg Pred Value         1.0000   0.9524   1.0000
#> Prevalence             0.3235   0.4118   0.2647
#> Detection Rate         0.3235   0.3824   0.2647
#> Detection Prevalence   0.3235   0.3824   0.2941
#> Balanced Accuracy      1.0000   0.9643   0.9800

The output includes the confusion matrix, which shows how many observations are correctly and incorrectly classified for each class. It also provides overall performance measures such as accuracy, which represents the proportion of correctly classified observations, and the Kappa statistic, which adjusts accuracy for agreement occurring by chance.

In addition, class-specific metrics such as sensitivity and specificity may be reported, giving further insight into how well the model performs for each individual class. Together, these results provide a detailed evaluation of the predictive performance of the kNN model.

K Nearest Neighbour Analysis

dr. Annelies Agten

2026-04-18