Multivariate Statistics - StatUa

KNN - Exercise solutions

Questions:

Look at the range of body_mass_g (3000–6000) vs. bill_depth_mm (13–21).

If you don’t scale the data, which variable will dominate the “distance” calculation?

Scale your numeric predictors before running the algorithm.

Run the k-NN algorithm using different values for k (e.g., k=1, k=5, k=50).

What happens to the “smoothness” of your classification as k increases?

Use a loop or a plot to find the k that results in the lowest error rate on the test set.

k-NN is non-linear and non-parametric. Does it perform better than the linear LDA in areas where species “overlap” in the data?

If you choose k=1, you are sensitive to every single outlier in the data. If you choose k=344 (the whole dataset), every penguin will be classified as the most frequent species (Adelie)

Justify your “best possible solution” for k.

data(penguins) 

df <- na.omit(penguins) 

df <- df[,c(1,3:6)]

KNN relies on distance calculations. If variables are on different scales, the variable with the largest range dominates. In this dataset, body mass would overpower other variables unless scaling is applied. This would make KNN effectively ignore smaller-scale variables. Thus, scaling is not optional—it is essential.


library(caret)
#> Loading required package: lattice
#> 
#> Attaching package: 'caret'
#> The following object is masked from 'package:purrr':
#> 
#>     lift

set.seed(345) 

df_scaled <- scale(df[,2:5])

df_scaled <- data.frame(df$species, df_scaled)

trainIndex <- createDataPartition(df_scaled$df.species, p = 0.8, list = FALSE) 

train <- df_scaled[trainIndex, ] 

test <- df_scaled[-trainIndex, ]

Effect of k on model behavior:

Small k (e.g., 1):

Very flexible
Captures noise
High variance

Large k:

Smooth decision boundaries
Ignores local structure
High bias

# Try different k values 

k_values <- c(1, 5, 50) 
results <- list() 
for (k in k_values) { pred <- knn(train = train[,2:5], test = test[,2:5], 
                                  cl = train$df.species, k = k) 

acc <- confusionMatrix(pred, test$df.species)$overall["Accuracy"] 

results[[paste0("k=", k)]] <- acc }

results
#> $`k=1`
#>  Accuracy 
#> 0.9384615 
#> 
#> $`k=5`
#>  Accuracy 
#> 0.9846154 
#> 
#> $`k=50`
#>  Accuracy 
#> 0.9538462

In the code below we use a loop to find the k that results in the lowest error rate on the test set. As k increases, the classification boundary becomes smoother and less sensitive to individual observations.

The optimal k is typically chosen by minimizing test error. This reflects the best trade-off between overfitting and underfitting.

However, the “best” k is not absolute. It depends on the dataset, the noise level, the goals of the analysis, etc…

k_seq <- 1:50 

accuracy <- numeric(length(k_seq)) 

for (i in seq_along(k_seq)) { pred <- knn(train[,2:5], test[,2:5], train$df.species, 
                                          k = k_seq[i]) 

accuracy[i] <- confusionMatrix(pred, test$df.species)$overall["Accuracy"] } 

# Plot 
plot(k_seq, accuracy, type = "b", pch = 19, xlab = "k", ylab = "Accuracy", main = "KNN Performance")


best_k <- k_seq[which.max(accuracy)] 

best_k
#> [1] 3

If we choose a large k, the results become very smooth, however, this might oversimplify the problem and create high bias. In extreme cases, we see:

k = 1: overfitting
k = N: trivial model (majority class)

The optimal k balances these two extremes (bias–variance trade-off). There is no “correct” k—only the best choice given the data, assumptions, and goals.

pred_best <- knn(train[,2:5], test[,2:5], train$df.species, k = best_k)

confusionMatrix(pred_best, test$df.species)
#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Adelie Chinstrap Gentoo
#>   Adelie        29         1      0
#>   Chinstrap      0        12      0
#>   Gentoo         0         0     23
#> 
#> Overall Statistics
#>                                           
#>                Accuracy : 0.9846          
#>                  95% CI : (0.9172, 0.9996)
#>     No Information Rate : 0.4462          
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.9757          
#>                                           
#>  Mcnemar's Test P-Value : NA              
#> 
#> Statistics by Class:
#> 
#>                      Class: Adelie Class: Chinstrap Class: Gentoo
#> Sensitivity                 1.0000           0.9231        1.0000
#> Specificity                 0.9722           1.0000        1.0000
#> Pos Pred Value              0.9667           1.0000        1.0000
#> Neg Pred Value              1.0000           0.9811        1.0000
#> Prevalence                  0.4462           0.2000        0.3538
#> Detection Rate              0.4462           0.1846        0.3538
#> Detection Prevalence        0.4615           0.1846        0.3538
#> Balanced Accuracy           0.9861           0.9615        1.0000

If we compare KNN to LDA, we see that KNN may perform better in regions where species overlap, but it is also more sensitive to noise and requires tuning. In general:

KNN:

non-linear
flexible
data-driven

LDA:

linear
assumption-based
more stable

KNN - exercise solutions

dr. Annelies Agten

2026-04-27

Multivariate Statistics - StatUa

KNN - Exercise solutions