K-Nearest Neighbour

KNN is non-parametric-

Non-parametric means not making any assumptions on the underlying data distribution.
Non-parametric methods do not have fixed numbers of parameters in the model.
Similarly in KNN, model parameters actually grows with the training data set-
you can imagine each training case as a "parameter" in the model

Can KNN be used for regression-

Yes, K-nearest neighbor can be used for regression.
In other words, K-nearest neighbor algorithm can be applied,when dependent variable is continuous
In this case, the predicted value is the average of the values of its k nearest neighbors

Pros and Cons of K-NN

Pros
- Easy to understand
- No assumptions about data
- Can be applied to both classification and regression
- Works easily on multi-class problems
Cons
- Memory Intensive / Computationally expensive
- Sensitive to scale of data
- Not work well on rare event (skewed) target variable
- Struggle when high number of independent variables
For any given problem, a small value of k will lead to a large variance in predictions
Alternatively, setting k to a large value may lead to a large model bias

How to handle categorical variables in KNN-

Create dummy variables out of a categorical variable and include them instead of original categorical variable. Unlike regression, create k dummies instead of (k-1). For example, a categorical variable named “Department” has 5 unique levels / categories. So we will create 5 dummy variables. Each dummy variable has 1 against its department and else 0

How to find best K value-

Cross-validation is a smart way to find out the optimal K value. It estimates the validation error rate by holding out a subset of the training set from the model building process.
Cross-validation (let’s say 10 fold validation) involves randomly dividing the training set into 10 groups, or folds, of approximately equal size. 90% data is used to train the model and remaining 10% to validate it. The misclassification rate is then computed on the 10% validation data. This procedure repeats 10 times. Different group of observations are treated as a validation set each of the 10 times. It results to 10 estimates of the validation error which are then averaged out.

K Nearest Neighbor in R

We are going to use historical data of past win/loss statistics and the corresponding speeches. This dataset comprises of 1524 observations on 14 variables. Dependent variable is win/loss where 1 indicates win and 0 indicates loss. The independent variables are:
1. Proportion of words in the speech showing
a. Optimism
b. Pessimism
c. the use of Past
d. the use of Present
e. the use of Future

Number of time he/she mentions his/her own party
Number of time he/she mentions his/her opposite parties.
Some measure indicating the content of speech showing

Openness
Conscientiousness
Extraversion
Agreeableness
Neuroticism
emotionality

Lets Read the dataset

data1 = read.csv("US Presidential Data.csv")
head(data1)

##   Win.Loss   Optimism  Pessimism  PastUsed FutureUsed PresentUsed
## 1        1 0.10450450 0.05045045 0.4381443  0.4948454  0.06701031
## 2        1 0.11457521 0.05923617 0.2912621  0.6213592  0.08737864
## 3        1 0.11257190 0.04930156 0.4159664  0.5168067  0.06722689
## 4        1 0.10723350 0.04631980 0.4634921  0.4666667  0.06984127
## 5        1 0.10582640 0.05172414 0.3342618  0.5821727  0.08356546
## 6        1 0.07586207 0.03448276 0.2800000  0.5200000  0.20000000
##   OwnPartyCount OppPartyCount NumericContent Extra Emoti Agree Consc Openn
## 1             2             2    0.001877543 4.041 4.049 3.469 2.450 2.548
## 2             1             4    0.001418909 3.446 3.633 3.528 2.402 2.831
## 3             1             1    0.002131163 3.463 4.039 3.284 2.159 2.465
## 4             1             3    0.001871715 4.195 4.661 4.007 2.801 3.067
## 5             3             4    0.002229220 4.658 4.023 3.283 2.415 2.836
## 6             0             0    0.003290827 2.843 3.563 3.075 1.769 1.479

Load required Libraries for K-NN

library(caret)

## Warning: package 'caret' was built under R version 3.4.4

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.4.4

library(e1071)

## Warning: package 'e1071' was built under R version 3.4.4

# Transforming the dependent variable to a factor
data1$Win.Loss = as.factor(data1$Win.Loss)

Partitioning the data into training and Validation sets

In order to partition the data into training and validation sets we use createDataPartition() function in caret.
Firstly we set the seed to be 101 so that the same results can be obtained. In the createDataPartition() the first argument is the dependent variable , p denotes how much data we want in the training set; here we take 70% of the data in training set and rest in cross validation set, list = F denotes that the indices we obtain should be in form of a vector.

#Partitioning the data into training and validation data
set.seed(101)
index = createDataPartition(data1$Win.Loss, p = 0.7, list = F )
train = data1[index,]
validation = data1[-index,]

Setting levels for both training and validation data

levels(train$Win.Loss) <- make.names(levels(factor(train$Win.Loss)))
levels(validation$Win.Loss) <- make.names(levels(factor(validation$Win.Loss)))

Here we are using repeated cross validation method using trainControl . Number denotes either the number of folds and ‘repeats’ is for repeated ‘r’ fold cross validation. In this case, 3 separate 10-fold validations are used.

# Setting up train controls
repeats = 3
numbers = 10
tunel = 10

set.seed(1234)
x = trainControl(method = "repeatedcv",
                 number = numbers,
                 repeats = repeats,
                 classProbs = TRUE,
                 summaryFunction = twoClassSummary)

Using train() function we run our knn; Win.Loss is dependent variable, the full stop after tilde denotes all the independent variables are there. In ‘data=’ we pass our training set, ‘method=’ denotes which technique we want to deploy, setting preProcess to center and scale tells us that we are standardizing our independent variables

center : subtract mean from values.
scale : divide values by standard deviation.

trControl demands our ‘x’ which was obtained via train( ) and tunelength is always an integer which is used to tune our algorithm.

model1 <- train(Win.Loss~. , data = train, method = "knn",
               preProcess = c("center","scale"),
               trControl = x,
               metric = "ROC",
               tuneLength = tunel)

# Summary of model
model1

## k-Nearest Neighbors 
## 
## 1068 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 962, 961, 961, 961, 961, 961, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    5  0.8364872  0.6900890  0.8412665
##    7  0.8471507  0.6684475  0.8494250
##    9  0.8534144  0.6587689  0.8525019
##   11  0.8532324  0.6540457  0.8602020
##   13  0.8526851  0.6531940  0.8683994
##   15  0.8509041  0.6491096  0.8607071
##   17  0.8494333  0.6411537  0.8560995
##   19  0.8470142  0.6267325  0.8612432
##   21  0.8436754  0.6146922  0.8637529
##   23  0.8423458  0.6042973  0.8714375
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

plot(model1)

Finally to make predictions on our validation set, we use predict function in which the first argument is the formula to be applied and second argument is the new data on which we want the predictions.

# Validation
valid_pred <- predict(model1,validation, type = "prob")

#Storing Model Performance Scores
library(ROCR)

## Warning: package 'ROCR' was built under R version 3.4.4

## Loading required package: gplots

## Warning: package 'gplots' was built under R version 3.4.4

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

pred_val <-prediction(valid_pred[,2],validation$Win.Loss)

# Calculating Area under Curve (AUC)
perf_val <- performance(pred_val,"auc")
perf_val

## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.8670378
## 
## 
## Slot "alpha.values":
## list()

# Plot AUC
perf_val <- performance(pred_val, "tpr", "fpr")
plot(perf_val, col = "green", lwd = 1.5)

#Calculating KS statistics
ks <- max(attr(perf_val, "y.values")[[1]] - (attr(perf_val, "x.values")[[1]]))
ks

## [1] 0.5516126

The Area under curve (AUC) on validation dataset is 0.8642.

K-Nearest Neighbour - Tutorial

Kamlesh Jha

July 7, 2018

KNN is non-parametric-

Can KNN be used for regression-

Pros and Cons of K-NN

How to handle categorical variables in KNN-

How to find best K value-

K Nearest Neighbor in R

Lets Read the dataset

Load required Libraries for K-NN

Partitioning the data into training and Validation sets

Setting levels for both training and validation data

K-Nearest Neighbour - Tutorial

Kamlesh Jha

July 7, 2018

Some Queries related to K-NN Algorithm

KNN is non-parametric-

Can KNN be used for regression-

Pros and Cons of K-NN

How to handle categorical variables in KNN-

How to find best K value-

K Nearest Neighbor in R

Lets Read the dataset

Load required Libraries for K-NN

Partitioning the data into training and Validation sets

Setting levels for both training and validation data