K Nearest Neighbor

K Nearest Neighbors is non parametric supervised machine learning algorithm used for classification and regression. It calculates similarity amongst observations based on a distance function (usually Euclidean) and preferred when data is continuos.

A new point X is classified on the basis of its K nearest neighbors by distance. The assumption here is that the new obseravtions will behave like its closest neighbors. Let’s learn through an example. We will use publically available Presidential Debates dataset and predict if the candidate will win/lose the speech.

Basic Settings and Data Import

#- set working directory 
setwd("C:/Users/awani/Desktop")
options(scipen = 999)

#- Load required libraries
library(caret)
library(knitr)
library(ggplot2)
library(tidyr)
library(dplyr)
library(ROCR)

#- Import Data
data = read.csv("US Presidential Data.csv", stringsAsFactors = F)

Exploratory Data Analysis

# - Univariate Analysis

# Win and loss in data set - Fair representation of both outcomes
kable(table(data$Win.Loss),
      col.names = c("Win", "Frequency"), align = 'l')

Win	Frequency
0	595
1	929

# independent variables

ggplot(gather(data[,2:ncol(data)]), aes(value)) + 
    geom_histogram(bins = 5, fill = "blue", alpha = 0.6) + 
    facet_wrap(~key, scales = 'free_x')

We can notice fair representation of both events in the dataset eliminating any need of oversampling. Also, the distribution plots look normal.

Data Manipulation

Essential step here is to first split the data in training and test sets. I am using a 60-40 split for this purpose.

str(data)

## 'data.frame':    1524 obs. of  14 variables:
##  $ Win.Loss      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Optimism      : num  0.105 0.115 0.113 0.107 0.106 ...
##  $ Pessimism     : num  0.0505 0.0592 0.0493 0.0463 0.0517 ...
##  $ PastUsed      : num  0.438 0.291 0.416 0.463 0.334 ...
##  $ FutureUsed    : num  0.495 0.621 0.517 0.467 0.582 ...
##  $ PresentUsed   : num  0.067 0.0874 0.0672 0.0698 0.0836 ...
##  $ OwnPartyCount : int  2 1 1 1 3 0 6 2 2 1 ...
##  $ OppPartyCount : int  2 4 1 3 4 0 4 4 5 2 ...
##  $ NumericContent: num  0.00188 0.00142 0.00213 0.00187 0.00223 ...
##  $ Extra         : num  4.04 3.45 3.46 4.2 4.66 ...
##  $ Emoti         : num  4.05 3.63 4.04 4.66 4.02 ...
##  $ Agree         : num  3.47 3.53 3.28 4.01 3.28 ...
##  $ Consc         : num  2.45 2.4 2.16 2.8 2.42 ...
##  $ Openn         : num  2.55 2.83 2.46 3.07 2.84 ...

# convert WinLoss to factor
data$Actual = data$Win.Loss
data$Win.Loss = as.factor(data$Win.Loss)
levels(data$Win.Loss) = make.names(levels(factor(data$Win.Loss)))

# split data in training and test set.

Index = sample(1:nrow(data), size = round(0.6*nrow(data)), replace=FALSE)
train = data[Index ,]
test = data[-Index ,]

rm(Index)

KNN Model

Now when the data is prepared, we run the KNN with the training dataset. We will then use the trained model to predict outcomes in test dataset.

#- set seed
set.seed(123)

#- Define controls
x = trainControl(method = "repeatedcv",
                 number = 10,
                 repeats = 3,
                 classProbs = TRUE,
                 summaryFunction = twoClassSummary)

#- train model
knn = train(Win.Loss~. , data = train[,1:14], method = "knn",
               preProcess = c("center","scale"),
               trControl = x,
               metric = "ROC",
               tuneLength = 10)

# print model results
knn

## k-Nearest Neighbors 
## 
## 914 samples
##  13 predictor
##   2 classes: 'X0', 'X1' 
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 823, 822, 823, 823, 822, 822, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    5  0.8477925  0.6712963  0.8423377
##    7  0.8475974  0.6583333  0.8351190
##    9  0.8460884  0.6500000  0.8453247
##   11  0.8497216  0.6564815  0.8459632
##   13  0.8472992  0.6500000  0.8417749
##   15  0.8459002  0.6416667  0.8430087
##   17  0.8468114  0.6370370  0.8519913
##   19  0.8453774  0.6287037  0.8549784
##   21  0.8436296  0.6240741  0.8561797
##   23  0.8407947  0.6231481  0.8483658
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.

plot(knn)

Prediction and Diagnostics

The confusion matix shows that the model is doing a good job of classification. Overall accuracy is 76% while senstivity is 63% and specificity is 85%.

test$Predicted = predict(knn, test, "prob")[,2]

#- Area Under Curve
plot(performance(prediction(test$Predicted, test$Actual),
            "tpr", "fpr"))

# use probability cut off 0.5 for classification
test$Predicted = ifelse(test$Predicted > 0.5, 1,0)

#- confusion matrix
confusionMatrix(factor(test$Predicted),
                factor(test$Actual))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 145  53
##          1  90 322
##                                               
##                Accuracy : 0.7656              
##                  95% CI : (0.7299, 0.7987)    
##     No Information Rate : 0.6148              
##     P-Value [Acc > NIR] : 0.000000000000001656
##                                               
##                   Kappa : 0.4901              
##  Mcnemar's Test P-Value : 0.002608            
##                                               
##             Sensitivity : 0.6170              
##             Specificity : 0.8587              
##          Pos Pred Value : 0.7323              
##          Neg Pred Value : 0.7816              
##              Prevalence : 0.3852              
##          Detection Rate : 0.2377              
##    Detection Prevalence : 0.3246              
##       Balanced Accuracy : 0.7378              
##                                               
##        'Positive' Class : 0                   
##

Reference : https://www.r-bloggers.com/k-nearest-neighbor-step-by-step-tutorial/