It is a nonparametric method used for classification and regression, the basic idea is that a new case will be classified according to the class having their K - Nearest Neighbors. It is a simple, intuitive and easy to implement concept is therefore commonly used method.
To illustrate its use, we will use a data set that is in R and is very appropriate for the operation of classification models in general, the Iris dataset
1- We define the data set to work, here we could import a csv file or excel file or connect by odbc to a database.
Data <- iris
head(Data, 10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
2-Using random sampling, we define a learning table for the model and a test table to verify their predictive quality
Sample <- sample(1:150, 50)
testing <- Data[Sample, ]
learning <- Data[-Sample, ]
dim(Data)
## [1] 150 5
dim(learning)
## [1] 100 5
dim(testing)
## [1] 50 5
3- We build the model by feeding with the training data, it indicates the maximum value of K that the model can be
used and it determines the optimum. (It is important to note that the model should be calibrated to get the
best result, it will run with the default options)
suppressWarnings(suppressMessages(library(kknn)))
model <- train.kknn(Species ~ ., data = learning, kmax = 9)
model
##
## Call:
## train.kknn(formula = Species ~ ., data = learning, kmax = 9)
##
## Type of response variable: nominal
## Minimal misclassification: 0.04
## Best kernel: optimal
## Best k: 9
4- Prediction is performed with the model we just built on the test data, to determine how many times this succeeds in predicting
prediction <- predict(model, testing[, -5])
prediction
## [1] virginica setosa setosa virginica setosa versicolor
## [7] versicolor virginica setosa setosa setosa versicolor
## [13] versicolor setosa setosa virginica versicolor versicolor
## [19] setosa virginica virginica setosa versicolor virginica
## [25] versicolor setosa virginica setosa virginica setosa
## [31] setosa setosa setosa virginica setosa versicolor
## [37] versicolor setosa versicolor virginica versicolor virginica
## [43] versicolor virginica virginica virginica virginica setosa
## [49] virginica setosa
## Levels: setosa versicolor virginica
5- To begin analyzing the quality of the model can build a confusion matrix. Each column of the matrix represents the number of predictions of each class, while each row represents the instances in the actual class.
CM <- table(testing[, 5], prediction)
CM
## prediction
## setosa versicolor virginica
## setosa 20 0 0
## versicolor 0 13 2
## virginica 0 0 15
As can be seen, the model has done quite well. The diagonal mark quality prediction.
accuracy <- (sum(diag(CM)))/sum(CM)
accuracy
## [1] 0.96
The model has a high overall accuracy, however, the indicator itself is not sufficient to validate the usefulness of a model, it is necessary to calculate other indices as positive precision, negative precision, etc
In the best case, make a cross validation analysis is highly recommended
If we plot the model gives us information about the quality of the classification based on the number of neighbors.
plot(model)