K_Nearest_Neighbours

R Markdown

In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.[1] In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.

data(iris)
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Including Plots

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species        nsl              nsw              npl        
##  setosa    :50   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  versicolor:50   1st Qu.:0.2222   1st Qu.:0.3333   1st Qu.:0.1017  
##  virginica :50   Median :0.4167   Median :0.4167   Median :0.5678  
##                  Mean   :0.4287   Mean   :0.4406   Mean   :0.4675  
##                  3rd Qu.:0.5833   3rd Qu.:0.5417   3rd Qu.:0.6949  
##                  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       npw         
##  Min.   :0.00000  
##  1st Qu.:0.08333  
##  Median :0.50000  
##  Mean   :0.45806  
##  3rd Qu.:0.70833  
##  Max.   :1.00000

library(caTools)

## Warning: package 'caTools' was built under R version 3.4.3

split<-sample.split(iris$Species, SplitRatio = 0.5)
train<-subset(iris, split==TRUE)
test<-subset(iris, split==FALSE)
summary(train)

##   Sepal.Length    Sepal.Width    Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.00   Min.   :1.100   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.70   1st Qu.:1.500   1st Qu.:0.250  
##  Median :5.800   Median :3.00   Median :4.300   Median :1.300  
##  Mean   :5.827   Mean   :3.04   Mean   :3.753   Mean   :1.175  
##  3rd Qu.:6.450   3rd Qu.:3.35   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.20   Max.   :6.700   Max.   :2.500  
##        Species        nsl              nsw              npl         
##  setosa    :25   Min.   :0.0000   Min.   :0.0000   Min.   :0.01695  
##  versicolor:25   1st Qu.:0.2222   1st Qu.:0.2917   1st Qu.:0.08475  
##  virginica :25   Median :0.4167   Median :0.4167   Median :0.55932  
##                  Mean   :0.4241   Mean   :0.4333   Mean   :0.46667  
##                  3rd Qu.:0.5972   3rd Qu.:0.5625   3rd Qu.:0.69492  
##                  Max.   :1.0000   Max.   :0.9167   Max.   :0.96610  
##       npw        
##  Min.   :0.0000  
##  1st Qu.:0.0625  
##  Median :0.5000  
##  Mean   :0.4478  
##  3rd Qu.:0.7083  
##  Max.   :1.0000

trains<-train[,6:9] #training parameters used to predict train class
train_label<-train[,5] #train class
summary(train_label)

##     setosa versicolor  virginica 
##         25         25         25

tests<-test[,6:9] #test parameters used to predict test class
test_label<-test[,5] #test class
summary(test_label)

##     setosa versicolor  virginica 
##         25         25         25

library(class)
pred<-knn(trains,tests,train_label,k=5) #(training parameters,test parameters for which class predicted, class to be predicted)
table(pred,test_label)

##             test_label
## pred         setosa versicolor virginica
##   setosa         25          0         0
##   versicolor      0         24         2
##   virginica       0          1        23

plot(pred,test_label, col=c("cyan","violet","green"))
legend(0.6,0.8, legend=c("setosa","versicolor","virginica"), fill=c("cyan","violet","green")) #0.6 and 0.8 are x and y coordinates for legend

Scale: Finds standard deviation for an attribute and divides attribute by it Center: Subtracts mean of attribute TuneLength: Integer indicating granularity of data (depth to which broken into constituents)

library(caret)

## Warning: package 'caret' was built under R version 3.4.3

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 3.4.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.4.3

ctrl <- trainControl(method="repeatedcv",number=10, repeats = 6) #repeated cross validation
knnFit <- train(Species ~ npl+nsl+npw+nsw, data = iris, method = "knn", trControl = ctrl, preProcess = c("center","scale"),tuneLength = 20)
knnFit

## k-Nearest Neighbors 
## 
## 150 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold, repeated 6 times) 
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9477778  0.9216667
##    7  0.9577778  0.9366667
##    9  0.9555556  0.9333333
##   11  0.9577778  0.9366667
##   13  0.9611111  0.9416667
##   15  0.9611111  0.9416667
##   17  0.9555556  0.9333333
##   19  0.9544444  0.9316667
##   21  0.9466667  0.9200000
##   23  0.9466667  0.9200000
##   25  0.9444444  0.9166667
##   27  0.9433333  0.9150000
##   29  0.9366667  0.9050000
##   31  0.9277778  0.8916667
##   33  0.9177778  0.8766667
##   35  0.9044444  0.8566667
##   37  0.8966667  0.8450000
##   39  0.8944444  0.8416667
##   41  0.8866667  0.8300000
##   43  0.8833333  0.8250000
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 15.

plot(knnFit)

pred<-knn(trains,tests,train_label,k=13) #(training parameters,test parameters for which class predicted, class to be predicted)
table(pred,test_label)

##             test_label
## pred         setosa versicolor virginica
##   setosa         25          0         0
##   versicolor      0         23         2
##   virginica       0          2        23

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

K_Nearest_Neighbours

Arnav Bhardwaj

27 February 2018

R Markdown

Including Plots