This article is divided 2 parts
The 1st part: One interesting example of KNN application (http://people.revoledu.com/kardi/tutorial/KNN/KNN_Numerical-example.html)
The 2nd part: One example of building KNN classifier in R
(https://www.r-bloggers.com/k-nearest-neighbor-step-by-step-tutorial/)
Using Euclidian method, calculate square distance of each out of 4 samples to query instance. Then, we have distance of 1 point query instance to point A = 16 (sample 1), point B (sample 2) = 25, point C (sample 3) = 9, point D (sample 4) = 13
Standardization: - When independent variables in training data are measured in different units, we have to standardize variables before calculating distance. Eg of 1 standardization method: (X-mean)/sd KNN is non-parametric - Not make any assumptions on data distribution - Not fixed number of parameters #### KNN & K-mean are different - K-mean: unsupervised learning technique (not labelled means no dependent variable). K-mean is clustering technique, try to split data points into K-clustes - KNN: supervised learning algorithm. Try to determine the classification of a point #### Find best K value - Use cross-validation. - Divide training set into 10 folds at the equal size. 90% data is used to train the model and remaining 10% to validate. - Missclassification rate is then computed on 10% validation data. - This procedure repeats 10 times -> we have 10 validation errors => then, averaged out.
#install.packages("caret")
library (caret)
#install.packages("e1071")
library(e1071)
#trainControl, createDataPartition of caret package
data1 <- read.csv ("/Users/lytran/Desktop/R_cheetsheet/Data for practice/US Presidential Data.csv") #Read data
class(data1$Win.Loss) #Because dependent variable is integer -> need to transform to a factor
## [1] "integer"
data1$Win.Loss <- as.factor(data1$Win.Loss)
head(data1)
## Win.Loss Optimism Pessimism PastUsed FutureUsed PresentUsed
## 1 1 0.10450450 0.05045045 0.4381443 0.4948454 0.06701031
## 2 1 0.11457521 0.05923617 0.2912621 0.6213592 0.08737864
## 3 1 0.11257190 0.04930156 0.4159664 0.5168067 0.06722689
## 4 1 0.10723350 0.04631980 0.4634921 0.4666667 0.06984127
## 5 1 0.10582640 0.05172414 0.3342618 0.5821727 0.08356546
## 6 1 0.07586207 0.03448276 0.2800000 0.5200000 0.20000000
## OwnPartyCount OppPartyCount NumericContent Extra Emoti Agree Consc Openn
## 1 2 2 0.001877543 4.041 4.049 3.469 2.450 2.548
## 2 1 4 0.001418909 3.446 3.633 3.528 2.402 2.831
## 3 1 1 0.002131163 3.463 4.039 3.284 2.159 2.465
## 4 1 3 0.001871715 4.195 4.661 4.007 2.801 3.067
## 5 3 4 0.002229220 4.658 4.023 3.283 2.415 2.836
## 6 0 0 0.003290827 2.843 3.563 3.075 1.769 1.479
levels(data1$Win.Loss) <- make.names(levels(data1$Win.Loss)) # Win.Loss are coded 1,0. Later, when we do prediction, these levels will be used as variable names for prediction. So, we need to make names.
#Partition data into training and validation data
set.seed(101)
index <- createDataPartition(data1$Win.Loss, p = 0.7, list = F)
train <- data1[index,]
test <- data1[-index,]
#Use Cross validation.Let this result find and automatically apply K value on model.
set.seed(1234)
#number: number of folds
#repeats: to repeated validate for folds.
x <- trainControl(method = "repeatedcv", number = 10, repeats = 10,
classProbs = TRUE, summaryFunction = twoClassSummary)
#Train model
#preProcess : to standardize independent variables
#center: values - mean = m
#scale: m/sd
model1 <- train(Win.Loss~., data = train, method = "knn",
preProcess = c('center', 'scale'),
trControl = x, metric = 'ROC', tuneLength = 10)
model1
## k-Nearest Neighbors
##
## 1068 samples
## 13 predictor
## 2 classes: 'X0', 'X1'
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 962, 961, 961, 961, 961, 961, ...
## Resampling results across tuning parameters:
##
## k ROC Sens Spec
## 5 0.8389224 0.6872822 0.8383939
## 7 0.8490032 0.6742973 0.8479371
## 9 0.8539545 0.6628107 0.8499394
## 11 0.8545280 0.6551916 0.8570023
## 13 0.8542918 0.6489489 0.8646830
## 15 0.8527016 0.6445819 0.8615967
## 17 0.8509577 0.6392973 0.8614452
## 19 0.8492015 0.6275494 0.8623846
## 21 0.8457687 0.6133449 0.8657436
## 23 0.8431844 0.6015970 0.8703566
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.
plot(model1) #ROC shows best K = 11
#Validate model on test data
valid <- predict(model1, test, type = 'prob')
#Check the accuracy based on AUC (area under curve)
#install.packages("ROCR")
library(ROCR)
pred_val <- prediction(valid[,2], test$Win.Loss)
perf_val <- performance (pred_val, "auc")
perf_val # accuracy of the model is 0.859207
## An object of class "performance"
## Slot "x.name":
## [1] "None"
##
## Slot "y.name":
## [1] "Area under the ROC curve"
##
## Slot "alpha.name":
## [1] "none"
##
## Slot "x.values":
## list()
##
## Slot "y.values":
## [[1]]
## [1] 0.8642288
##
##
## Slot "alpha.values":
## list()
#plot AUC
perf_val <- performance(pred_val, 'tpr', 'fpr')
plot(perf_val, col = 'green')