K Nearest Neighbors is non parametric supervised machine learning algorithm used for classification and regression. It calculates similarity amongst observations based on a distance function (usually Euclidean) and preferred when data is continuos.
A new point X is classified on the basis of its K nearest neighbors by distance. The assumption here is that the new obseravtions will behave like its closest neighbors. Let’s learn through an example. We will use publically available Presidential Debates dataset and predict if the candidate will win/lose the speech.
#- set working directory
setwd("C:/Users/awani/Desktop")
options(scipen = 999)
#- Load required libraries
library(caret)
library(knitr)
library(ggplot2)
library(tidyr)
library(dplyr)
library(ROCR)
#- Import Data
data = read.csv("US Presidential Data.csv", stringsAsFactors = F)
# - Univariate Analysis
# Win and loss in data set - Fair representation of both outcomes
kable(table(data$Win.Loss),
col.names = c("Win", "Frequency"), align = 'l')
| Win | Frequency |
|---|---|
| 0 | 595 |
| 1 | 929 |
# independent variables
ggplot(gather(data[,2:ncol(data)]), aes(value)) +
geom_histogram(bins = 5, fill = "blue", alpha = 0.6) +
facet_wrap(~key, scales = 'free_x')
We can notice fair representation of both events in the dataset eliminating any need of oversampling. Also, the distribution plots look normal.
Essential step here is to first split the data in training and test sets. I am using a 60-40 split for this purpose.
str(data)
## 'data.frame': 1524 obs. of 14 variables:
## $ Win.Loss : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Optimism : num 0.105 0.115 0.113 0.107 0.106 ...
## $ Pessimism : num 0.0505 0.0592 0.0493 0.0463 0.0517 ...
## $ PastUsed : num 0.438 0.291 0.416 0.463 0.334 ...
## $ FutureUsed : num 0.495 0.621 0.517 0.467 0.582 ...
## $ PresentUsed : num 0.067 0.0874 0.0672 0.0698 0.0836 ...
## $ OwnPartyCount : int 2 1 1 1 3 0 6 2 2 1 ...
## $ OppPartyCount : int 2 4 1 3 4 0 4 4 5 2 ...
## $ NumericContent: num 0.00188 0.00142 0.00213 0.00187 0.00223 ...
## $ Extra : num 4.04 3.45 3.46 4.2 4.66 ...
## $ Emoti : num 4.05 3.63 4.04 4.66 4.02 ...
## $ Agree : num 3.47 3.53 3.28 4.01 3.28 ...
## $ Consc : num 2.45 2.4 2.16 2.8 2.42 ...
## $ Openn : num 2.55 2.83 2.46 3.07 2.84 ...
# convert WinLoss to factor
data$Actual = data$Win.Loss
data$Win.Loss = as.factor(data$Win.Loss)
levels(data$Win.Loss) = make.names(levels(factor(data$Win.Loss)))
# split data in training and test set.
Index = sample(1:nrow(data), size = round(0.6*nrow(data)), replace=FALSE)
train = data[Index ,]
test = data[-Index ,]
rm(Index)
Now when the data is prepared, we run the KNN with the training dataset. We will then use the trained model to predict outcomes in test dataset.
#- set seed
set.seed(123)
#- Define controls
x = trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
#- train model
knn = train(Win.Loss~. , data = train[,1:14], method = "knn",
preProcess = c("center","scale"),
trControl = x,
metric = "ROC",
tuneLength = 10)
# print model results
knn
## k-Nearest Neighbors
##
## 914 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 823, 822, 823, 823, 822, 822, ...
## Resampling results across tuning parameters:
##
## k ROC Sens Spec
## 5 0.8477925 0.6712963 0.8423377
## 7 0.8475974 0.6583333 0.8351190
## 9 0.8460884 0.6500000 0.8453247
## 11 0.8497216 0.6564815 0.8459632
## 13 0.8472992 0.6500000 0.8417749
## 15 0.8459002 0.6416667 0.8430087
## 17 0.8468114 0.6370370 0.8519913
## 19 0.8453774 0.6287037 0.8549784
## 21 0.8436296 0.6240741 0.8561797
## 23 0.8407947 0.6231481 0.8483658
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.
plot(knn)
The confusion matix shows that the model is doing a good job of classification. Overall accuracy is 76% while senstivity is 63% and specificity is 85%.
test$Predicted = predict(knn, test, "prob")[,2]
#- Area Under Curve
plot(performance(prediction(test$Predicted, test$Actual),
"tpr", "fpr"))
# use probability cut off 0.5 for classification
test$Predicted = ifelse(test$Predicted > 0.5, 1,0)
#- confusion matrix
confusionMatrix(factor(test$Predicted),
factor(test$Actual))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 145 53
## 1 90 322
##
## Accuracy : 0.7656
## 95% CI : (0.7299, 0.7987)
## No Information Rate : 0.6148
## P-Value [Acc > NIR] : 0.000000000000001656
##
## Kappa : 0.4901
## Mcnemar's Test P-Value : 0.002608
##
## Sensitivity : 0.6170
## Specificity : 0.8587
## Pos Pred Value : 0.7323
## Neg Pred Value : 0.7816
## Prevalence : 0.3852
## Detection Rate : 0.2377
## Detection Prevalence : 0.3246
## Balanced Accuracy : 0.7378
##
## 'Positive' Class : 0
##
Reference : https://www.r-bloggers.com/k-nearest-neighbor-step-by-step-tutorial/