In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.[1] In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:
In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.
data(iris)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species nsl nsw npl
## setosa :50 Min. :0.0000 Min. :0.0000 Min. :0.0000
## versicolor:50 1st Qu.:0.2222 1st Qu.:0.3333 1st Qu.:0.1017
## virginica :50 Median :0.4167 Median :0.4167 Median :0.5678
## Mean :0.4287 Mean :0.4406 Mean :0.4675
## 3rd Qu.:0.5833 3rd Qu.:0.5417 3rd Qu.:0.6949
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## npw
## Min. :0.00000
## 1st Qu.:0.08333
## Median :0.50000
## Mean :0.45806
## 3rd Qu.:0.70833
## Max. :1.00000
library(caTools)
## Warning: package 'caTools' was built under R version 3.4.3
split<-sample.split(iris$Species, SplitRatio = 0.5)
train<-subset(iris, split==TRUE)
test<-subset(iris, split==FALSE)
summary(train)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.00 Min. :1.100 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.70 1st Qu.:1.500 1st Qu.:0.250
## Median :5.800 Median :3.00 Median :4.300 Median :1.300
## Mean :5.827 Mean :3.04 Mean :3.753 Mean :1.175
## 3rd Qu.:6.450 3rd Qu.:3.35 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.20 Max. :6.700 Max. :2.500
## Species nsl nsw npl
## setosa :25 Min. :0.0000 Min. :0.0000 Min. :0.01695
## versicolor:25 1st Qu.:0.2222 1st Qu.:0.2917 1st Qu.:0.08475
## virginica :25 Median :0.4167 Median :0.4167 Median :0.55932
## Mean :0.4241 Mean :0.4333 Mean :0.46667
## 3rd Qu.:0.5972 3rd Qu.:0.5625 3rd Qu.:0.69492
## Max. :1.0000 Max. :0.9167 Max. :0.96610
## npw
## Min. :0.0000
## 1st Qu.:0.0625
## Median :0.5000
## Mean :0.4478
## 3rd Qu.:0.7083
## Max. :1.0000
trains<-train[,6:9] #training parameters used to predict train class
train_label<-train[,5] #train class
summary(train_label)
## setosa versicolor virginica
## 25 25 25
tests<-test[,6:9] #test parameters used to predict test class
test_label<-test[,5] #test class
summary(test_label)
## setosa versicolor virginica
## 25 25 25
library(class)
pred<-knn(trains,tests,train_label,k=5) #(training parameters,test parameters for which class predicted, class to be predicted)
table(pred,test_label)
## test_label
## pred setosa versicolor virginica
## setosa 25 0 0
## versicolor 0 24 2
## virginica 0 1 23
plot(pred,test_label, col=c("cyan","violet","green"))
legend(0.6,0.8, legend=c("setosa","versicolor","virginica"), fill=c("cyan","violet","green")) #0.6 and 0.8 are x and y coordinates for legend
Scale: Finds standard deviation for an attribute and divides attribute by it Center: Subtracts mean of attribute TuneLength: Integer indicating granularity of data (depth to which broken into constituents)
library(caret)
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.3
ctrl <- trainControl(method="repeatedcv",number=10, repeats = 6) #repeated cross validation
knnFit <- train(Species ~ npl+nsl+npw+nsw, data = iris, method = "knn", trControl = ctrl, preProcess = c("center","scale"),tuneLength = 20)
knnFit
## k-Nearest Neighbors
##
## 150 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## Pre-processing: centered (4), scaled (4)
## Resampling: Cross-Validated (10 fold, repeated 6 times)
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9477778 0.9216667
## 7 0.9577778 0.9366667
## 9 0.9555556 0.9333333
## 11 0.9577778 0.9366667
## 13 0.9611111 0.9416667
## 15 0.9611111 0.9416667
## 17 0.9555556 0.9333333
## 19 0.9544444 0.9316667
## 21 0.9466667 0.9200000
## 23 0.9466667 0.9200000
## 25 0.9444444 0.9166667
## 27 0.9433333 0.9150000
## 29 0.9366667 0.9050000
## 31 0.9277778 0.8916667
## 33 0.9177778 0.8766667
## 35 0.9044444 0.8566667
## 37 0.8966667 0.8450000
## 39 0.8944444 0.8416667
## 41 0.8866667 0.8300000
## 43 0.8833333 0.8250000
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 15.
plot(knnFit)
pred<-knn(trains,tests,train_label,k=13) #(training parameters,test parameters for which class predicted, class to be predicted)
table(pred,test_label)
## test_label
## pred setosa versicolor virginica
## setosa 25 0 0
## versicolor 0 23 2
## virginica 0 2 23
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.