K-Nearest Neighbors (A very simple Example)

Erik Rodríguez Pacheco

It is a nonparametric method used for classification and regression, the basic idea is that a new case will be classified according to the class having their K - Nearest Neighbors. It is a simple, intuitive and easy to implement concept is therefore commonly used method.

To illustrate its use, we will use a data set that is in R and is very appropriate for the operation of classification models in general, the Iris dataset

1- We define the data set to work, here we could import a csv file or excel file or connect by odbc to a database.

Data <- iris
head(Data, 10)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

2-Using random sampling, we define a learning table for the model and a test table to verify their predictive quality

Sample <- sample(1:150, 50)
testing <- Data[Sample, ]
learning <- Data[-Sample, ]
dim(Data)
## [1] 150   5
dim(learning)
## [1] 100   5
dim(testing)
## [1] 50  5

3- We build the model by feeding with the training data, it indicates the maximum value of K that the model can be
used and it determines the optimum. (It is important to note that the model should be calibrated to get the best result, it will run with the default options)

suppressWarnings(suppressMessages(library(kknn)))
model <- train.kknn(Species ~ ., data = learning, kmax = 9)
model
## 
## Call:
## train.kknn(formula = Species ~ ., data = learning, kmax = 9)
## 
## Type of response variable: nominal
## Minimal misclassification: 0.04
## Best kernel: optimal
## Best k: 9

4- Prediction is performed with the model we just built on the test data, to determine how many times this succeeds in predicting


prediction <- predict(model, testing[, -5])
prediction
##  [1] virginica  setosa     setosa     virginica  setosa     versicolor
##  [7] versicolor virginica  setosa     setosa     setosa     versicolor
## [13] versicolor setosa     setosa     virginica  versicolor versicolor
## [19] setosa     virginica  virginica  setosa     versicolor virginica 
## [25] versicolor setosa     virginica  setosa     virginica  setosa    
## [31] setosa     setosa     setosa     virginica  setosa     versicolor
## [37] versicolor setosa     versicolor virginica  versicolor virginica 
## [43] versicolor virginica  virginica  virginica  virginica  setosa    
## [49] virginica  setosa    
## Levels: setosa versicolor virginica

5- To begin analyzing the quality of the model can build a confusion matrix. Each column of the matrix represents the number of predictions of each class, while each row represents the instances in the actual class.


CM <- table(testing[, 5], prediction)
CM
##             prediction
##              setosa versicolor virginica
##   setosa         20          0         0
##   versicolor      0         13         2
##   virginica       0          0        15

As can be seen, the model has done quite well. The diagonal mark quality prediction.

accuracy <- (sum(diag(CM)))/sum(CM)
accuracy
## [1] 0.96

The model has a high overall accuracy, however, the indicator itself is not sufficient to validate the usefulness of a model, it is necessary to calculate other indices as positive precision, negative precision, etc

In the best case, make a cross validation analysis is highly recommended

If we plot the model gives us information about the quality of the classification based on the number of neighbors.

plot(model)

plot of chunk unnamed-chunk-7