Questions:
Look at the range of body_mass_g (3000–6000) vs. bill_depth_mm (13–21).
If you don’t scale the data, which variable will dominate the “distance” calculation?
Scale your numeric predictors before running the algorithm.
Run the k-NN algorithm using different values for k (e.g., k=1, k=5, k=50).
What happens to the “smoothness” of your classification as k increases?
Use a loop or a plot to find the k that results in the lowest error rate on the test set.
k-NN is non-linear and non-parametric. Does it perform better than the linear LDA in areas where species “overlap” in the data?
If you choose k=1, you are sensitive to every single outlier in the data. If you choose k=344 (the whole dataset), every penguin will be classified as the most frequent species (Adelie)
Justify your “best possible solution” for k.
KNN relies on distance calculations. If variables are on different scales, the variable with the largest range dominates. In this dataset, body mass would overpower other variables unless scaling is applied. This would make KNN effectively ignore smaller-scale variables. Thus, scaling is not optional—it is essential.
library(caret)
#> Loading required package: lattice
#>
#> Attaching package: 'caret'
#> The following object is masked from 'package:purrr':
#>
#> lift
set.seed(345)
df_scaled <- scale(df[,2:5])
df_scaled <- data.frame(df$species, df_scaled)
trainIndex <- createDataPartition(df_scaled$df.species, p = 0.8, list = FALSE)
train <- df_scaled[trainIndex, ]
test <- df_scaled[-trainIndex, ]Effect of k on model behavior:
Small k (e.g., 1):
Large k:
# Try different k values
k_values <- c(1, 5, 50)
results <- list()
for (k in k_values) { pred <- knn(train = train[,2:5], test = test[,2:5],
cl = train$df.species, k = k)
acc <- confusionMatrix(pred, test$df.species)$overall["Accuracy"]
results[[paste0("k=", k)]] <- acc }
results
#> $`k=1`
#> Accuracy
#> 0.9384615
#>
#> $`k=5`
#> Accuracy
#> 0.9846154
#>
#> $`k=50`
#> Accuracy
#> 0.9538462In the code below we use a loop to find the k that results in the lowest error rate on the test set. As k increases, the classification boundary becomes smoother and less sensitive to individual observations.
The optimal k is typically chosen by minimizing test error. This reflects the best trade-off between overfitting and underfitting.
However, the “best” k is not absolute. It depends on the dataset, the noise level, the goals of the analysis, etc…
k_seq <- 1:50
accuracy <- numeric(length(k_seq))
for (i in seq_along(k_seq)) { pred <- knn(train[,2:5], test[,2:5], train$df.species,
k = k_seq[i])
accuracy[i] <- confusionMatrix(pred, test$df.species)$overall["Accuracy"] }
# Plot
plot(k_seq, accuracy, type = "b", pch = 19, xlab = "k", ylab = "Accuracy", main = "KNN Performance") If we choose a large k, the results become very smooth, however, this might oversimplify the problem and create high bias. In extreme cases, we see:
The optimal k balances these two extremes (bias–variance trade-off). There is no “correct” k—only the best choice given the data, assumptions, and goals.
pred_best <- knn(train[,2:5], test[,2:5], train$df.species, k = best_k)
confusionMatrix(pred_best, test$df.species)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Adelie Chinstrap Gentoo
#> Adelie 29 1 0
#> Chinstrap 0 12 0
#> Gentoo 0 0 23
#>
#> Overall Statistics
#>
#> Accuracy : 0.9846
#> 95% CI : (0.9172, 0.9996)
#> No Information Rate : 0.4462
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9757
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: Adelie Class: Chinstrap Class: Gentoo
#> Sensitivity 1.0000 0.9231 1.0000
#> Specificity 0.9722 1.0000 1.0000
#> Pos Pred Value 0.9667 1.0000 1.0000
#> Neg Pred Value 1.0000 0.9811 1.0000
#> Prevalence 0.4462 0.2000 0.3538
#> Detection Rate 0.4462 0.1846 0.3538
#> Detection Prevalence 0.4615 0.1846 0.3538
#> Balanced Accuracy 0.9861 0.9615 1.0000If we compare KNN to LDA, we see that KNN may perform better in regions where species overlap, but it is also more sensitive to noise and requires tuning. In general:
KNN:
LDA: