Classify the sample by a distance metric measuring its geographic neighborhood.
Choice of metric depends on predictor characteristics.
Prior centering and scaling balance contribution of predictors with various scales on distance calculation.
Small k leads to over-fitting.
Numerical instability: as K increases, probability of ties also increases.
Using cross-validation or resampling to locate optimal K value.
\[ (\sum^P_{j=1}(x_{aj}-x_{bj})^q)^\frac{1}{q} \]
Predictors with larger scale contribute more to the distance, so prior center and scale is needed.
Visualisation: RMSE cross-validation profile
library(AppliedPredictiveModeling)
data(solubility)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
knnDescr <- solTrainXtrans[,-nearZeroVar(solTrainXtrans)]
knnTune <- train(knnDescr,
solTrainY,
method = "knn",
preProc = c("center","scale"),
tuneGrid = data.frame(.k=1:20),
trControl = trainControl(method = "cv"))
knnTune
## k-Nearest Neighbors
##
## 951 samples
## 225 predictors
##
## Pre-processing: centered (225), scaled (225)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 857, 855, 856, 856, 855, 856, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared RMSE SD Rsquared SD
## 1 1.234888 0.6751049 0.07727235 0.03698818
## 2 1.099533 0.7250253 0.08528516 0.05109635
## 3 1.061845 0.7413050 0.07553589 0.03927394
## 4 1.041760 0.7503342 0.07436228 0.03749301
## 5 1.044258 0.7474510 0.08648688 0.04660717
## 6 1.050852 0.7422064 0.08887342 0.04637823
## 7 1.046381 0.7433005 0.08994355 0.04716710
## 8 1.041637 0.7462864 0.09067192 0.04216248
## 9 1.060943 0.7371048 0.08653929 0.03717906
## 10 1.066790 0.7340358 0.08780004 0.03359751
## 11 1.067829 0.7332611 0.08361811 0.03156330
## 12 1.079111 0.7274550 0.08517134 0.03280445
## 13 1.089884 0.7218891 0.08342094 0.03109404
## 14 1.095974 0.7186049 0.08794008 0.03266009
## 15 1.097977 0.7185932 0.09165360 0.03543528
## 16 1.111528 0.7120296 0.09417452 0.03593776
## 17 1.116417 0.7098383 0.10197045 0.03956210
## 18 1.120359 0.7081447 0.09545899 0.03590309
## 19 1.124029 0.7063487 0.09588382 0.03629311
## 20 1.128590 0.7049276 0.09543241 0.03753892
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 8.