we are going to introduce the πΎ -nearest neighbors (KNN) algorithm and show some practical ways of using it in R with the knn function that exists in the class library.we are going to introduce the πΎ -nearest neighbors (KNN) algorithm and show some practical ways of using it in R with the knn function that exists in the class library.
Import libraries
library(class)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
require(mlbench)
## Loading required package: mlbench
library(e1071)
library(base)
Step 1: Data collection
For this lesson, we will be using Sonar data set (signals) from mlbench library. Sonar is a system for the detection of objects under water and for measuring the waterβs depth by emitting sound pulses and detecting. The complete description can be found in mlbench. For our purposes, this is a two-class (class π and class π ) classification task with numeric data.
Letβs look at the first five rows of Sonar-
data(Sonar)
head(Sonar)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111 0.1609
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872 0.4918
## 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194 0.6333
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264 0.0881
## 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459 0.4152
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039 0.2988
## V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22
## 1 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797 0.5783 0.5071
## 2 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818 0.5212 0.4052
## 3 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619 0.7974 0.6737
## 4 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973 0.2741 0.3690
## 5 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636 0.4148 0.4292
## 6 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122 0.2074 0.3985
## V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33
## 1 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857 0.1307 0.2604 0.5121
## 2 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028 0.3788 0.2947 0.1984
## 3 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514 0.8512 0.5045 0.1862
## 4 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559 0.6260 0.7340 0.6120
## 5 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724 0.5103 0.5459 0.2881
## 6 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067 0.5580 0.4778 0.3299
## V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44
## 1 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744 0.0510 0.2834 0.2825 0.4256
## 2 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970 0.1674 0.0583 0.1401 0.1628
## 3 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719 0.4647 0.2587 0.2129 0.2222
## 4 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167 0.6121 0.5006 0.3210 0.3202
## 5 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430 0.1979 0.2444 0.1847 0.0841
## 6 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296 0.2707 0.2650 0.0723 0.1238
## V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55
## 1 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324 0.0232 0.0027 0.0065 0.0159 0.0072
## 2 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061 0.0125 0.0084 0.0089 0.0048 0.0094
## 3 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106 0.0033 0.0232 0.0166 0.0095 0.0180
## 4 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294 0.0241 0.0121 0.0036 0.0150 0.0085
## 5 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046 0.0156 0.0031 0.0054 0.0105 0.0110
## 6 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081 0.0104 0.0045 0.0014 0.0038 0.0013
## V56 V57 V58 V59 V60 Class
## 1 0.0167 0.0180 0.0084 0.0090 0.0032 R
## 2 0.0191 0.0140 0.0049 0.0052 0.0044 R
## 3 0.0244 0.0316 0.0164 0.0095 0.0078 R
## 4 0.0073 0.0050 0.0044 0.0040 0.0117 R
## 5 0.0015 0.0072 0.0048 0.0107 0.0094 R
## 6 0.0089 0.0057 0.0027 0.0051 0.0062 R
Step 2: Prepare and explore data
It is A data frame with 208 observations on 61 variables, all numerical and one (the Class) nominal.
cat("number of rows and columns are:", nrow(Sonar), ncol(Sonar))
## number of rows and columns are: 208 61
Lets check how many π classes and π classes Sonar data contain? and check whether Sonar data contains any NA in its columns.
table(Sonar$Class)
##
## M R
## 111 97
apply(Sonar, 2, function(x) sum(is.na(x)))
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
## 0 0 0 0 0 0 0 0 0 0 0 0 0
## V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26
## 0 0 0 0 0 0 0 0 0 0 0 0 0
## V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39
## 0 0 0 0 0 0 0 0 0 0 0 0 0
## V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52
## 0 0 0 0 0 0 0 0 0 0 0 0 0
## V53 V54 V55 V56 V57 V58 V59 V60 Class
## 0 0 0 0 0 0 0 0 0
Here, we want to manually take samples from our data to split Sonar into training and test sets.
SEED <- 123
set.seed(SEED)
data <- Sonar[sample(nrow(Sonar)), ] # shuffle data first
bound <- floor(0.7 * nrow(data))
df_train <- data[1:bound, ]
df_test <- data[(bound + 1):nrow(data), ]
cat("number of training and test samples are ", nrow(df_train), nrow(df_test))
## number of training and test samples are 145 63
Letβs examine if the train and test samples have properly splitted with the almost the same portion of Class labels
cat("number of training classes: \n", table(df_train$Class)/nrow(df_train))
## number of training classes:
## 0.5172414 0.4827586
cat("\n")
cat("number of test classes: \n", table(df_test$Class)/nrow(df_test))
## number of test classes:
## 0.5714286 0.4285714
Letβs create dataframes of train and test to simplify our task:
X_train <- subset(df_train, select=-Class)
y_train <- df_train$Class
X_test <- subset(df_test, select=-Class) # exclude Class for prediction
y_test <- df_test$Class
Step 3:Training a model on data
model_knn <- knn(train=X_train,
test=X_test,
cl=y_train, # class labels
k=3)
model_knn
## [1] M R M M M M M R M R M R M M M M M R R M M R M R M R M R M M R M R M R M M R
## [39] M M R M M R M M M M M R M M R R M R R R M R R M R
## Levels: M R
Step 4: Evaluate the model performance
As you can see, model_knn with π=3 provides the above predictions for the test set X_test. Then, we can see how many classes have been correctly or incorrectly classified by comparing to the true labels as follows-
conf_mat <- table(y_test, model_knn)
conf_mat
## model_knn
## y_test M R
## M 31 5
## R 7 20
To compute the accuracy, we sum up all the correctly classified observations (located in diagonal) and divide it by the total number of classes
cat("Test accuracy: ", sum(diag(conf_mat))/sum(conf_mat))
## Test accuracy: 0.8095238
To assess whether π=3 is a good choice and see whether π=3 leads to overfitting /underfitting the data, we could use knn.cv which does the leave-one-out cross-validations for training set (i.e., it singles out a training sample one at a time and tries to view it as a new example and see what class label it assigns).
Below are the predicted classes for the training set using the leave-one-out cross-validation. Now, letβs examine its accuracy
knn_loocv <- knn.cv(train=X_train, cl=y_train, k=3)
knn_loocv
## [1] R M M R M M R M R M M R R R M M M M R R R R M M M R M M M M R M R M M M M
## [38] R R M R M M M M R M R R M R M M R M R M R R M M M R R M M M M R M R M R R
## [75] R R M M R M M M R M M M R R R M M R R R M R M M M M R M M R R M M R M R M
## [112] M M R R R M M M M M M R R R R M M R R R R M M M M M R M R M R M R R
## Levels: M R
Lets create a confusion matrix to compute the accuracy of the training labels y_train and the cross-validated predictions knn_loocv, same as the above. What can you find from comparing the LOOCV accuracy and the test accuracy above?
conf_mat_cv <- table(y_train, knn_loocv)
conf_mat_cv
## knn_loocv
## y_train M R
## M 64 11
## R 18 52
cat("LOOCV accuracy: ", sum(diag(conf_mat_cv)) / sum(conf_mat_cv))
## LOOCV accuracy: 0.8
The difference between the cross-validated accuracy and the test accuracy shows that, π=3 leads to overfitting. Perhaps we should change π to lessen the overfitting.
Step 5: Improve the performance of the model
As noted earlier, we have not standardized (as part of preprocessing) our training and test sets. In the rest of the tutorial, we will see the effect of choosing a suitable π through repeated cross-validations using caret library.
In a cross-validation procedure:
The data is divided into the finite number of mutually exclusive subsets Through each iteration, a subset is set aside, and the remaining subsets are used as the training set The subset that was set aside is used as the test set (prediction) This is a method of cross-referencing the model built using its own data.
SEED <- 2016
set.seed(SEED)
# create the training data 70% of the overall Sonar data.
in_train <- createDataPartition(Sonar$Class, p=0.7, list=FALSE) # create training indices
ndf_train <- Sonar[in_train, ]
ndf_test <- Sonar[-in_train, ]
Here, we specify the cross-validation method we want to use to find the best π in grid search. Later, we use the built-in plot function to assess the changes in accuracy for different choices of π.
# lets create a function setup to do 5-fold cross-validation with 2 repeat.
ctrl <- trainControl(method="repeatedcv", number=5, repeats=2)
nn_grid <- expand.grid(k=c(1,3,5,7))
nn_grid
## k
## 1 1
## 2 3
## 3 5
## 4 7
set.seed(SEED)
best_knn <- train(Class~., data=ndf_train,
method="knn",
trControl=ctrl,
preProcess = c("center", "scale"), # standardize
tuneGrid=nn_grid)
best_knn
## k-Nearest Neighbors
##
## 146 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## Pre-processing: centered (60), scaled (60)
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 117, 116, 116, 117, 118, 117, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.8568637 0.7098282
## 3 0.8147537 0.6214765
## 5 0.7978489 0.5866507
## 7 0.7326437 0.4527131
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.
So seemingly, π=1 has the highest accuracy from repeated cross-validation.
Letβs try to do dimensionality reduction as part of preprocess to achieve higher testing accuracy than above. This may not have a definite solution and it depends on how hard you try!
Use the above best_knn to make predictions on the test set (remember to remove the Class for prediction). Then create the much better version of confusion matrix with confusionMatrix function from caret and examine the accuracy and its %95 confidence interval.
In fact, the above result indicates π=1 (as could be guessed) is also overfitting, though it might be a better option than π=3. Since the initial dimension of our data is high ( 61 is considered high!), then you might have suspected the better approach, is to preform dimensionality reduction as part of preprocessing.
SEED <- 123
set.seed(SEED)
ctrl <- trainControl(method="repeatedcv", number=5, repeats=5)
nn_grid <- expand.grid(k=c(1, 3, 5, 7))
best_knn_reduced <- train( Class~., data=ndf_train, method="knn",
trControl=ctrl, preProcess=c("center", "scale","YeoJohnson"))
X_test <- subset(ndf_test, select=-Class)
pred_reduced <- predict(best_knn_reduced, newdata=X_test, model="best")
conf_mat_best_reduced <- confusionMatrix(ndf_test$Class, pred_reduced)
conf_mat_best_reduced
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R
## M 29 4
## R 8 21
##
## Accuracy : 0.8065
## 95% CI : (0.6863, 0.8958)
## No Information Rate : 0.5968
## P-Value [Acc > NIR] : 0.0003688
##
## Kappa : 0.608
##
## Mcnemar's Test P-Value : 0.3864762
##
## Sensitivity : 0.7838
## Specificity : 0.8400
## Pos Pred Value : 0.8788
## Neg Pred Value : 0.7241
## Prevalence : 0.5968
## Detection Rate : 0.4677
## Detection Prevalence : 0.5323
## Balanced Accuracy : 0.8119
##
## 'Positive' Class : M
##
END