Notice

This article is divided 2 parts

1st part_Question

  • We gathered opinions of 4 people about if the tissue is good or bad based on 2 independent variables (Strength and Acid Durability)
  • Now we have 1 tissue with a specific strength (Strength = 3) and acid durability (Acid durability = 7) -> this tissue is Query-instance.
  • Question: How can we know if that tissue is good or bad without doing the survey.

1st part_Solution

  • Determine parameter K = number of nearest neighbors. The parameter K represent for the number of nearest neighbors that we believe that they have similar features and can let them in 1 group.
  • Because we have 4 samples, suppose use K = 3 (We believe that 3 samples can be in 1 group, whereas 1 sample left belongs to another group)
  1. Using Euclidian method, calculate square distance of each out of 4 samples to query instance. Then, we have distance of 1 point query instance to point A = 16 (sample 1), point B (sample 2) = 25, point C (sample 3) = 9, point D (sample 4) = 13

  2. Ranking distances:
  • From query instance to point C : 1 — point C: 1 client reviewed “Good”
  • From query instance to point D : 2 — point D: 1 client reviewed “Bad”
  • From query instance to point A : 3 — point A: 1 client reviewed “Good”
  • From query instance to point B : 4 — point B: 1 client reviewed “Bad”
  • Because we chose K = 3 -> this query instance has 3 nearest points/neighbors: C,D,A
  • Look at category of its neighbors (C,D,A) determine. We have 2 goods and 1 bad -> this tissue can be categorized as Good

2nd part

Assumptions of KNN

Standardization: - When independent variables in training data are measured in different units, we have to standardize variables before calculating distance. Eg of 1 standardization method: (X-mean)/sd KNN is non-parametric - Not make any assumptions on data distribution - Not fixed number of parameters #### KNN & K-mean are different - K-mean: unsupervised learning technique (not labelled means no dependent variable). K-mean is clustering technique, try to split data points into K-clustes - KNN: supervised learning algorithm. Try to determine the classification of a point #### Find best K value - Use cross-validation. - Divide training set into 10 folds at the equal size. 90% data is used to train the model and remaining 10% to validate. - Missclassification rate is then computed on 10% validation data. - This procedure repeats 10 times -> we have 10 validation errors => then, averaged out.

Package for KNN

#install.packages("caret")
library (caret)
#install.packages("e1071")
library(e1071)
#trainControl, createDataPartition of caret package
data1 <- read.csv ("/Users/lytran/Desktop/R_cheetsheet/Data for practice/US Presidential Data.csv") #Read data
class(data1$Win.Loss) #Because dependent variable is integer -> need to transform to a factor
## [1] "integer"
data1$Win.Loss <- as.factor(data1$Win.Loss)
head(data1)
##   Win.Loss   Optimism  Pessimism  PastUsed FutureUsed PresentUsed
## 1        1 0.10450450 0.05045045 0.4381443  0.4948454  0.06701031
## 2        1 0.11457521 0.05923617 0.2912621  0.6213592  0.08737864
## 3        1 0.11257190 0.04930156 0.4159664  0.5168067  0.06722689
## 4        1 0.10723350 0.04631980 0.4634921  0.4666667  0.06984127
## 5        1 0.10582640 0.05172414 0.3342618  0.5821727  0.08356546
## 6        1 0.07586207 0.03448276 0.2800000  0.5200000  0.20000000
##   OwnPartyCount OppPartyCount NumericContent Extra Emoti Agree Consc Openn
## 1             2             2    0.001877543 4.041 4.049 3.469 2.450 2.548
## 2             1             4    0.001418909 3.446 3.633 3.528 2.402 2.831
## 3             1             1    0.002131163 3.463 4.039 3.284 2.159 2.465
## 4             1             3    0.001871715 4.195 4.661 4.007 2.801 3.067
## 5             3             4    0.002229220 4.658 4.023 3.283 2.415 2.836
## 6             0             0    0.003290827 2.843 3.563 3.075 1.769 1.479
levels(data1$Win.Loss) <- make.names(levels(data1$Win.Loss)) # Win.Loss are coded 1,0. Later, when we do prediction, these levels will be used as variable names for prediction. So, we need to make names.  

#Partition data into training and validation data
set.seed(101)
index <- createDataPartition(data1$Win.Loss, p = 0.7, list = F)
train <- data1[index,]
test <- data1[-index,]

#Use Cross validation.Let this result find and automatically apply K value on model.
set.seed(1234)
#number: number of folds
#repeats: to repeated validate for folds.
x <- trainControl(method = "repeatedcv", number = 10, repeats = 10,
                classProbs = TRUE, summaryFunction = twoClassSummary) 
#Train model
#preProcess : to standardize independent variables 
#center: values - mean = m
#scale: m/sd
model1 <- train(Win.Loss~., data = train, method = "knn",
                preProcess = c('center', 'scale'),
                trControl = x, metric = 'ROC', tuneLength = 10)
model1
## k-Nearest Neighbors 
## 
## 1068 samples
##   13 predictor
##    2 classes: 'X0', 'X1' 
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 962, 961, 961, 961, 961, 961, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    5  0.8389224  0.6872822  0.8383939
##    7  0.8490032  0.6742973  0.8479371
##    9  0.8539545  0.6628107  0.8499394
##   11  0.8545280  0.6551916  0.8570023
##   13  0.8542918  0.6489489  0.8646830
##   15  0.8527016  0.6445819  0.8615967
##   17  0.8509577  0.6392973  0.8614452
##   19  0.8492015  0.6275494  0.8623846
##   21  0.8457687  0.6133449  0.8657436
##   23  0.8431844  0.6015970  0.8703566
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.
plot(model1) #ROC shows best K = 11

#Validate model on test data
valid <- predict(model1, test, type = 'prob')

#Check the accuracy based on AUC (area under curve)
#install.packages("ROCR")
library(ROCR)
pred_val <- prediction(valid[,2], test$Win.Loss)
perf_val <- performance (pred_val, "auc")
perf_val # accuracy of the model is 0.859207
## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.8642288
## 
## 
## Slot "alpha.values":
## list()
#plot AUC
perf_val <- performance(pred_val, 'tpr', 'fpr')
plot(perf_val, col = 'green')