For classfication

K-Nearest Neighbors

Classify the sample by a distance metric measuring its geographic neighborhood.

Choice of metric depends on predictor characteristics.

Prior centering and scaling balance contribution of predictors with various scales on distance calculation.


Tuning parameter

Small k leads to over-fitting.

Numerical instability: as K increases, probability of ties also increases.

Using cross-validation or resampling to locate optimal K value.


For regression

Minkowski distance

\[ (\sum^P_{j=1}(x_{aj}-x_{bj})^q)^\frac{1}{q} \]

Predictors with larger scale contribute more to the distance, so prior center and scale is needed.

  1. Computation time.
  2. The disconnect between local structure and the predictive ability. kNN performs poorly at the presence of noise-laden predictor irrelevant to the response. Solutions: 1. Pre-removing. 2. Put weights based on their distance to the new sample.

Visualisation: RMSE cross-validation profile

library(AppliedPredictiveModeling)
data(solubility)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
knnDescr <- solTrainXtrans[,-nearZeroVar(solTrainXtrans)]
knnTune <- train(knnDescr,
                 solTrainY,
                 method = "knn",
                 preProc = c("center","scale"),
                 tuneGrid = data.frame(.k=1:20),
                 trControl = trainControl(method = "cv"))
knnTune
## k-Nearest Neighbors 
## 
## 951 samples
## 225 predictors
## 
## Pre-processing: centered (225), scaled (225) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 857, 855, 856, 856, 855, 856, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared   RMSE SD     Rsquared SD
##    1  1.234888  0.6751049  0.07727235  0.03698818 
##    2  1.099533  0.7250253  0.08528516  0.05109635 
##    3  1.061845  0.7413050  0.07553589  0.03927394 
##    4  1.041760  0.7503342  0.07436228  0.03749301 
##    5  1.044258  0.7474510  0.08648688  0.04660717 
##    6  1.050852  0.7422064  0.08887342  0.04637823 
##    7  1.046381  0.7433005  0.08994355  0.04716710 
##    8  1.041637  0.7462864  0.09067192  0.04216248 
##    9  1.060943  0.7371048  0.08653929  0.03717906 
##   10  1.066790  0.7340358  0.08780004  0.03359751 
##   11  1.067829  0.7332611  0.08361811  0.03156330 
##   12  1.079111  0.7274550  0.08517134  0.03280445 
##   13  1.089884  0.7218891  0.08342094  0.03109404 
##   14  1.095974  0.7186049  0.08794008  0.03266009 
##   15  1.097977  0.7185932  0.09165360  0.03543528 
##   16  1.111528  0.7120296  0.09417452  0.03593776 
##   17  1.116417  0.7098383  0.10197045  0.03956210 
##   18  1.120359  0.7081447  0.09545899  0.03590309 
##   19  1.124029  0.7063487  0.09588382  0.03629311 
##   20  1.128590  0.7049276  0.09543241  0.03753892 
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 8.