KNN example in R

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

K-nearest neighbor classifier is one of the simplest to use, and hence, is widely used for classifying dynamic datasets. I downloaded German Credit data from http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 to demonstrate an example of knn in R.

setwd("C:/Users/cool/Desktop")
gc <- read.csv("germancredit.csv")
head (gc)

##   Default checkingstatus1 duration history purpose amount savings employ
## 1       0             A11        6     A34     A43   1169     A65    A75
## 2       1             A12       48     A32     A43   5951     A61    A73
## 3       0             A14       12     A34     A46   2096     A61    A74
## 4       0             A11       42     A32     A42   7882     A61    A74
## 5       1             A11       24     A33     A40   4870     A61    A73
## 6       0             A14       36     A32     A46   9055     A65    A73
##   installment status others residence property age otherplans housing
## 1           4    A93   A101         4     A121  67       A143    A152
## 2           2    A92   A101         2     A121  22       A143    A152
## 3           2    A93   A101         3     A121  49       A143    A152
## 4           2    A93   A103         4     A122  45       A143    A153
## 5           3    A93   A101         4     A124  53       A143    A153
## 6           2    A93   A101         4     A124  35       A143    A153
##   cards  job liable tele foreign
## 1     2 A173      1 A192    A201
## 2     1 A173      1 A191    A201
## 3     1 A172      2 A191    A201
## 4     1 A173      2 A191    A201
## 5     2 A173      2 A191    A201
## 6     1 A172      2 A192    A201

## Taking back-up of the input file, in case the original data is required later

gc.bkup <- gc

## Convert the dependent var to factor. Normalize the numeric variables  

gc$Default <- factor(gc$Default)

num.vars <- sapply(gc, is.numeric)
gc[num.vars] <- lapply(gc[num.vars], scale)

## Selecting only 3 numeric variables for this demostration, just to keep things simple

myvars <- c("duration", "amount", "installment")
gc.subset <- gc[myvars]

summary(gc.subset)

##      duration.V1          amount.V1         installment.V1   
##  Min.   :-1.401713   Min.   :-1.070329   Min.   :-1.7636311  
##  1st Qu.:-0.738298   1st Qu.:-0.675145   1st Qu.:-0.8697481  
##  Median :-0.240737   Median :-0.337176   Median : 0.0241348  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.0000000  
##  3rd Qu.: 0.256825   3rd Qu.: 0.248338   3rd Qu.: 0.9180178  
##  Max.   : 4.237315   Max.   : 5.368103   Max.   : 0.9180178

## Let's predict on a test set of 100 observations. Rest to be used as train set.

set.seed(123) 
test <- 1:100
train.gc <- gc.subset[-test,]
test.gc <- gc.subset[test,]

train.def <- gc$Default[-test]
test.def <- gc$Default[test]

## Let's use k values (no of NNs) as 1, 5 and 20 to see how they perform in terms of correct proportion of classification and success rate. The optimum k value can be chosen based on the outcomes as below...

library(class)

knn.1 <-  knn(train.gc, test.gc, train.def, k=1)
knn.5 <-  knn(train.gc, test.gc, train.def, k=5)
knn.20 <- knn(train.gc, test.gc, train.def, k=20)

## Let's calculate the proportion of correct classification for k = 1, 5 & 20 

100 * sum(test.def == knn.1)/100  # For knn = 1

## [1] 68

100 * sum(test.def == knn.5)/100  # For knn = 5

## [1] 74

100 * sum(test.def == knn.20)/100 # For knn = 20

## [1] 81

## If we look at the above proportions, it's quite evident that K = 1 correctly classifies 68% of the outcomes, K = 5 correctly classifies 74% and K = 20 does it for 81% of the outcomes. 

## We should also look at the success rate against the value of increasing K.

table(knn.1 ,test.def)

##      test.def
## knn.1  0  1
##     0 54 11
##     1 21 14

## For K = 1, among 65 customers, 54 or 83%, is success rate. Let's look at k = 5 now

table(knn.5 ,test.def)

##      test.def
## knn.5  0  1
##     0 62 13
##     1 13 12

## For K = 5, among 76 customers, 63 or 82%, is success rate.Let's look at K = 20 now

table(knn.20 ,test.def)

##       test.def
## knn.20  0  1
##      0 69 13
##      1  6 12

##For K = 20, among 88 customers, 71 or 80%, is success rate.

## It seems increasing K increases the classification but reduces success rate. It is worse to class a customer as good when it is bad, than it is to class a customer as bad when it is good. 
## By looking at above success rates, K = 1 or K = 5 can be taken as optimum K.
## We can make a plot of the data with the training set in hollow shapes and the new ones filled in. 
## Plot for K = 1 can be created as follows - 

plot(train.gc[,c("amount","duration")],
           col=c(4,3,6,2)[gc.bkup[-test, "installment"]],
           pch=c(1,2)[as.numeric(train.def)],
           main="Predicted Default, by 1 Nearest Neighbors",cex.main=.95)
     
     points(test.gc[,c("amount","duration")],
            bg=c(4,3,6,2)[gc.bkup[-test,"installment"]],
            pch=c(21,24)[as.numeric(knn.1)],cex=1.2,col=grey(.7))
     
     legend("bottomright",pch=c(1,16,2,17),bg=c(1,1,1,1),
            legend=c("data 0","pred 0","data 1","pred 1"),
            title="default",bty="n",cex=.8)
     
     legend("topleft",fill=c(4,3,6,2),legend=c(1,2,3,4),
            title="installment %", horiz=TRUE,bty="n",col=grey(.7),cex=.8)

## Plots are good way to represent data visually, but here it looks like an overkill as there are too many data on the plot.

Note that the above model is just a demostration of the knn in R. The model can be further improved by including rest of the significant variables, including categorical variables also. Package ‘knncat’ should be used to classify using both categorical and continuous variables.

KNN example in R

Ranjit Mishra

Tuesday, November 03, 2015