The K-Nearest Neighbors Classification algorithm classifies observations by allowing the K observations in the training set that are nearest to the new observation to “vote” on the class of the new observation. More specifically, the algorithm works as follows:
K observations in the training set that are nearest to the new observation.K nearest neighbors.We will demonstrate this classification algorithm through examples.
library(ggplot2)
library(gridExtra)
library(caret)
library(class)
We will use ggplot2 and gridExtra for plotting. The carat packages will be used for splitting our data and for feature scaling. The functionality for performing KNN classification is provided by the class package.
For this example, we will be working with the Iris Dataset. This data set is a real-world “toy” dataset that is often used to demonstrate concepts in data science. The iris dataset contains information about several flowers selected from three different species of iris: versicolor, setosa, and virginica.
The dataset contains the following five pieces of information for 150 flowers:
iris_tr <- read.table('data/iris.txt', sep='\t', header=TRUE)
summary(iris_tr)
sepal_length sepal_width petal_length petal_width species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
p1 <- ggplot(iris_tr, aes(x=sepal_length, y=sepal_width, col=species)) +
geom_point(alpha=0.8)
p2 <- ggplot(iris_tr, aes(x=petal_length, y=petal_width, col=species)) +
geom_point(alpha=0.8)
grid.arrange(p1, p2, ncol=2)
We will now use the knn function from the class package to create a 3-Nearest Neighbors model for the iris dataset.
iris_tr_feat <- iris_tr[,1:4]
set.seed(1)
train_pred <- knn(iris_tr_feat, iris_tr_feat, iris_tr$species, k=3)
train_pred[1:10]
[1] setosa versicolor virginica setosa versicolor versicolor versicolor
[8] virginica versicolor setosa
Levels: setosa versicolor virginica
Let’s calculate the model’s training accuracy.
accuracy <- mean(train_pred == iris_tr$species)
cat("Training Accuracy: ", accuracy, sep='')
Training Accuracy: 0.96
The selection of K=3 in the previous model was somewhat arbitrary. In the figure below, we will plot the KNN model’s training accuracy for several values of K.
train_acc <- c()
for (i in 1:150){
set.seed(1)
train_pred <- knn(iris_tr_feat, iris_tr_feat, iris_tr$species, k=i)
train_acc <- c(train_acc, mean(train_pred == iris_tr$species))
}
plot(1:150, train_acc, pch='.', ylim=c(0.3, 1), col='salmon')
lines(1:150, train_acc, lwd=2, col='salmon')
KNN model’s will (almost) always get 100% accuracy on the training set when K=1. For this reason, it is not helpful to use training performance to select the best value of K for our dataset. To select the value of K for which the model is most likely to generalize well, we need to create a validation set.
iris_va <- read.table('data/iris_valid.txt', sep='\t', header=TRUE)
summary(iris_va)
sepal_length sepal_width petal_length petal_width species
Min. :4.600 Min. :2.100 Min. :1.30 Min. :0.200 setosa :10
1st Qu.:5.200 1st Qu.:2.825 1st Qu.:1.55 1st Qu.:0.400 versicolor:11
Median :5.900 Median :3.000 Median :4.30 Median :1.400 virginica : 9
Mean :5.883 Mean :3.033 Mean :3.73 Mean :1.193
3rd Qu.:6.375 3rd Qu.:3.200 3rd Qu.:4.80 3rd Qu.:1.800
Max. :7.800 Max. :3.900 Max. :6.70 Max. :2.300
We will now calculate validation accuracy for each of our models, and will use this information to select our final model.
train_acc <- c()
valid_acc <- c()
iris_va_feat <- iris_va[,1:4]
for (i in 1:100){
set.seed(1)
train_pred <- knn(iris_tr_feat, iris_tr_feat, iris_tr$species, k=i)
train_acc <- c(train_acc, mean(train_pred == iris_tr$species))
set.seed(1)
valid_pred <- knn(iris_tr_feat, iris_va_feat, iris_tr$species, k=i)
valid_acc <- c(valid_acc, mean(valid_pred == iris_va$species))
}
plot(1:100, train_acc, pch='.', ylim=c(0.8, 1), col='salmon')
lines(1:100, train_acc, lwd=2, col='salmon')
lines(1:100, valid_acc, lwd=2, col='cornflowerblue')
legend(38, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)
The largest validation accuracy obtained by any model was 93.33%.
max(valid_acc)
[1] 0.9333333
The maximum validation accuracy was obtained by the K=11 model.
which.max(valid_acc)
[1] 11
For this example, we will be working with the Pima Diabetes Dataset. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes. All patients are females at least 21 years old of Pima Indian heritage.
The columns in this dataset are described below.
pima <- read.table("data/diabetes.csv", sep=",", header=TRUE)
pima$Outcome <- factor(pima$Outcome)
summary(pima)
Pregnancies Glucose BloodPressure SkinThickness
Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
Insulin BMI DiabetesPedigreeFunction Age
Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
Outcome
0:500
1:268
We will use createDataPartition to create a stratified partition of our data into training and validation sets.
set.seed(1)
train.index <- createDataPartition(pima$Outcome, p = .7, list=FALSE)
train <- pima[ train.index,]
valid <- pima[-train.index,]
summary(train$Outcome)
0 1
350 188
summary(valid$Outcome)
0 1
150 80
For practice, we will create and evaluate a 3-Nearest Neighbors model.
train_feat <- train[,1:8]
valid_feat <- valid[,1:8]
set.seed(1)
train_pred <- knn(train_feat, train_feat, train$Outcome, k=3)
train_acc <- mean(train_pred == train$Outcome)
set.seed(1)
valid_pred <- knn(train_feat, valid_feat, train$Outcome, k=3)
valid_acc <- mean(valid_pred == valid$Outcome)
cat('Training Accuracy: ', train_acc, '\n',
'Validation Accuracy: ', valid_acc, sep='')
Training Accuracy: 0.8513011
Validation Accuracy: 0.7130435
For many machine learning algorithms, it is important to make sure that the features are on roughly the same scale before training. This is especially true of distance-based algorithms such as K-Nearest Neighbors.
We will consider two types of scaling in this course: Standardization and Min/Max Scaling.
With standardization, each column in the training set is scaled to have a mean of zero and a standard deviation of 1. If \(x_i\) is a single observation of a particular feature, then its scaled value \(z_i\) is given by:
\[ z_i = \frac{x_i - \bar x}{s_x}\] Note that the values \(\bar x\) and \(s_x\) are calculated from the training set only, but are used to scale any observation, whether it is from the training set, validation set, testing set, or an entirely new observation.
With min max scaling, each column in the training set is scaled linearly to have a minimum of zero and a maximum of 1. If \(x_i\) is a single observation of a particular feature, then its scaled value \(w_i\) is given by:
\[ w_i = \frac{x_i - \textrm{min}_x}{\textrm{max}_x - \textrm{min}_x}\] Note that the values \(\textrm{min}_x\) and \(\textrm{max}_x\) are calculated from the training set only, but are used to scale any observation, whether it is from the training set, validation set, testing set, or an entirely new observation.
We can perform feature scaling using the preProcess function from the carat package.
We will start by performing standard scaling.
standard_scaler <- preProcess(train_feat, method=c('center', 'scale'))
train_s_sc <- predict(standard_scaler, train_feat)
valid_s_sc <- predict(standard_scaler, valid_feat)
summary(train_s_sc)
Pregnancies Glucose BloodPressure SkinThickness
Min. :-1.1147 Min. :-3.8189 Min. :-3.6957 Min. :-1.2738
1st Qu.:-0.8191 1st Qu.:-0.6690 1st Qu.:-0.3764 1st Qu.:-1.2738
Median :-0.2280 Median :-0.1020 Median : 0.1590 Median : 0.1330
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.6587 3rd Qu.: 0.6146 3rd Qu.: 0.5873 3rd Qu.: 0.7447
Max. : 3.9100 Max. : 2.3865 Max. : 2.4076 Max. : 4.7818
Insulin BMI DiabetesPedigreeFunction Age
Min. :-0.6843 Min. :-4.06067 Min. :-1.1888 Min. :-1.0355
1st Qu.:-0.6843 1st Qu.:-0.62397 1st Qu.:-0.6953 1st Qu.:-0.7797
Median :-0.5569 Median :-0.01005 Median :-0.2974 Median :-0.3532
Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.4141 3rd Qu.: 0.59122 3rd Qu.: 0.4414 3rd Qu.: 0.6703
Max. : 6.7496 Max. : 3.45830 Max. : 6.0353 Max. : 4.0819
summary(valid_s_sc)
Pregnancies Glucose BloodPressure SkinThickness
Min. :-1.11470 Min. :-3.81894 Min. :-3.69566 Min. :-1.27384
1st Qu.:-0.81913 1st Qu.:-0.70049 1st Qu.:-0.26928 1st Qu.:-1.27384
Median :-0.22800 Median :-0.22799 Median : 0.10548 Median : 0.07186
Mean : 0.07272 Mean :-0.03612 Mean : 0.01354 Mean :-0.05899
3rd Qu.: 0.65871 3rd Qu.: 0.55162 3rd Qu.: 0.58732 3rd Qu.: 0.68353
Max. : 3.31884 Max. : 2.44947 Max. : 2.83588 Max. : 2.02922
Insulin BMI DiabetesPedigreeFunction
Min. :-0.68430 Min. :-4.06067 Min. :-1.15799
1st Qu.:-0.68430 1st Qu.:-0.57966 1st Qu.:-0.65443
Median :-0.27131 Median : 0.04059 Median :-0.26654
Mean : 0.05645 Mean :-0.03668 Mean : 0.08721
3rd Qu.: 0.45802 3rd Qu.: 0.52160 3rd Qu.: 0.56245
Max. : 5.85331 Max. : 4.43298 Max. : 5.75460
Age
Min. :-1.03554
1st Qu.:-0.77967
Median :-0.26792
Mean : 0.02837
3rd Qu.: 0.58499
Max. : 3.31429
We will now perform min/max scaling.
minmax_scaler <- preProcess(train_feat, method=c('range'))
train_mm_sc <- predict(minmax_scaler, train_feat)
valid_mm_sc <- predict(minmax_scaler, valid_feat)
summary(train_mm_sc)
Pregnancies Glucose BloodPressure SkinThickness
Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.05882 1st Qu.:0.5076 1st Qu.:0.5439 1st Qu.:0.0000
Median :0.17647 Median :0.5990 Median :0.6316 Median :0.2323
Mean :0.22185 Mean :0.6154 Mean :0.6055 Mean :0.2104
3rd Qu.:0.35294 3rd Qu.:0.7145 3rd Qu.:0.7018 3rd Qu.:0.3333
Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
Insulin BMI DiabetesPedigreeFunction Age
Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.4571 1st Qu.:0.06832 1st Qu.:0.0500
Median :0.01714 Median :0.5387 Median :0.12340 Median :0.1333
Mean :0.09205 Mean :0.5401 Mean :0.16456 Mean :0.2024
3rd Qu.:0.14775 3rd Qu.:0.6187 3rd Qu.:0.22566 3rd Qu.:0.3333
Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000
summary(valid_mm_sc)
Pregnancies Glucose BloodPressure SkinThickness
Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.05882 1st Qu.:0.5025 1st Qu.:0.5614 1st Qu.:0.0000
Median :0.17647 Median :0.5787 Median :0.6228 Median :0.2222
Mean :0.23632 Mean :0.6096 Mean :0.6077 Mean :0.2006
3rd Qu.:0.35294 3rd Qu.:0.7043 3rd Qu.:0.7018 3rd Qu.:0.3232
Max. :0.88235 Max. :1.0102 Max. :1.0702 Max. :0.5455
Insulin BMI DiabetesPedigreeFunction Age
Min. :0.00000 Min. :0.0000 Min. :0.00427 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.4630 1st Qu.:0.07398 1st Qu.:0.0500
Median :0.05556 Median :0.5455 Median :0.12767 Median :0.1500
Mean :0.09965 Mean :0.5352 Mean :0.17664 Mean :0.2079
3rd Qu.:0.15366 3rd Qu.:0.6094 3rd Qu.:0.24242 3rd Qu.:0.3167
Max. :0.87943 Max. :1.1296 Max. :0.96114 Max. :0.8500
We will now create our 3-Nearest Neighbors model on the scaled data for the sake of comparison. We will start with the standardized data.
set.seed(1)
train_pred <- knn(train_s_sc, train_s_sc, train$Outcome, k=3)
train_acc <- mean(train_pred == train$Outcome)
set.seed(1)
valid_pred <- knn(train_s_sc, valid_s_sc, train$Outcome, k=3)
valid_acc <- mean(valid_pred == valid$Outcome)
cat('Training Accuracy: ', train_acc, '\n',
'Validation Accuracy: ', valid_acc, sep='')
Training Accuracy: 0.8531599
Validation Accuracy: 0.6956522
We will now consider the 3-nearest neighbors algorithm on the min-max scaled data.
set.seed(1)
train_pred <- knn(train_mm_sc, train_mm_sc, train$Outcome, k=3)
train_acc <- mean(train_pred == train$Outcome)
set.seed(1)
valid_pred <- knn(train_mm_sc, valid_mm_sc, train$Outcome, k=3)
valid_acc <- mean(valid_pred == valid$Outcome)
cat('Training Accuracy: ', train_acc, '\n',
'Validation Accuracy: ', valid_acc, sep='')
Training Accuracy: 0.8494424
Validation Accuracy: 0.726087
We will now build several KNN models. For each K from 1 to 100, we will calculate training and validation accuracy for the KNN model, using both the scaled and the unscaled data.
set.seed(1)
train_acc <- c()
valid_acc <- c()
train_acc_s_sc <- c()
valid_acc_s_sc <- c()
train_acc_mm_sc <- c()
valid_acc_mm_sc <- c()
k_range <- 1:100
for (i in k_range){
# Unscaled
set.seed(1)
train_pred <- knn(train_feat, train_feat, train$Outcome, k=i)
train_acc <- c(train_acc, mean(train_pred == train$Outcome))
set.seed(1)
valid_pred <- knn(train_feat, valid_feat, train$Outcome, k=i)
valid_acc <- c(valid_acc, mean(valid_pred == valid$Outcome))
# Standard Scaling
set.seed(1)
train_pred <- knn(train_s_sc, train_s_sc, train$Outcome, k=i)
train_acc_s_sc <- c(train_acc_s_sc, mean(train_pred == train$Outcome))
set.seed(1)
valid_pred <- knn(train_s_sc, valid_s_sc, train$Outcome, k=i)
valid_acc_s_sc <- c(valid_acc_s_sc, mean(valid_pred == valid$Outcome))
# MinMax Scaling
set.seed(1)
train_pred <- knn(train_mm_sc, train_mm_sc, train$Outcome, k=i)
train_acc_mm_sc <- c(train_acc_mm_sc, mean(train_pred == train$Outcome))
set.seed(1)
valid_pred <- knn(train_mm_sc, valid_mm_sc, train$Outcome, k=i)
valid_acc_mm_sc <- c(valid_acc_mm_sc, mean(valid_pred == valid$Outcome))
}
max(valid_acc)
[1] 0.7869565
max(valid_acc_s_sc)
[1] 0.773913
max(valid_acc_mm_sc)
[1] 0.7608696
plot(k_range, train_acc, pch='.', ylim=c(0.65, 1), col='salmon')
lines(k_range, train_acc, lwd=2, col='salmon')
lines(k_range, valid_acc, lwd=2, col='cornflowerblue')
legend(75, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)
plot(k_range, train_acc_s_sc, pch='.', ylim=c(0.65, 1), col='salmon')
lines(k_range, train_acc_s_sc, lwd=2, col='salmon')
lines(k_range, valid_acc_s_sc, lwd=2, col='cornflowerblue')
legend(75, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)
plot(k_range, train_acc_mm_sc, pch='.', ylim=c(0.65, 1), col='salmon')
lines(k_range, train_acc_mm_sc, lwd=2, col='salmon')
lines(k_range, valid_acc_mm_sc, lwd=2, col='cornflowerblue')
legend(75, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)
Our best validation performance was obtained using unscaled data. Let’s determine the value of K used to create this particular model.
which.max(valid_acc)
[1] 11
We will now recalculate the training and validation accuracies for this model.
set.seed(1)
train_pred <- knn(train, train, train$Outcome, k=11)
train_acc <- mean(train_pred == train$Outcome)
set.seed(1)
valid_pred <- knn(train, valid, train$Outcome, k=11)
valid_acc <- mean(valid_pred == valid$Outcome)
cat('Training Accuracy: ', train_acc, '\n',
'Validation Accuracy: ', valid_acc, sep='')
Training Accuracy: 0.7788104
Validation Accuracy: 0.7869565
We can use the table function to create a quick confusion matrix.
table(valid$Outcome, valid_pred)
valid_pred
0 1
0 134 16
1 33 47
We will use the confusion matrix to view several classification metrics for our final model.
confusionMatrix(valid_pred, valid$Outcome)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 134 33
1 16 47
Accuracy : 0.787
95% CI : (0.7283, 0.838)
No Information Rate : 0.6522
P-Value [Acc > NIR] : 5.84e-06
Kappa : 0.5059
Mcnemar's Test P-Value : 0.02227
Sensitivity : 0.8933
Specificity : 0.5875
Pos Pred Value : 0.8024
Neg Pred Value : 0.7460
Prevalence : 0.6522
Detection Rate : 0.5826
Detection Prevalence : 0.7261
Balanced Accuracy : 0.7404
'Positive' Class : 0
If we believe that a certain feature will be more or less important than other features in our model, then we can apply a weight to its scaled values in order to have it emphasized more highly when calculating distances.
set.seed(1)
train_acc_w <- c()
valid_acc_w <- c()
w <- c(1, 1, 0, 0, 0, 1, 1, 2)
train_w <- t(t(train_s_sc)*w)
valid_w <- t(t(valid_s_sc)*w)
k_range <- 1:100
for (i in k_range){
# Scaled
set.seed(1)
train_pred <- knn(train_w, train_w, train$Outcome, k=i)
train_acc_w <- c(train_acc_w, mean(train_pred == train$Outcome))
set.seed(1)
valid_pred <- knn(train_w, valid_w, train$Outcome, k=i)
valid_acc_w <- c(valid_acc_w, mean(valid_pred == valid$Outcome))
}
max(valid_acc_w)
[1] 0.7913043
plot(k_range, train_acc_w, pch='.', ylim=c(0.65, 1), col='salmon')
lines(k_range, train_acc_w, lwd=2, col='salmon')
lines(k_range, valid_acc_w, lwd=2, col='cornflowerblue')
legend(75, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)
max(valid_acc_w)
[1] 0.7913043
which.max(valid_acc_w)
[1] 24
Pros
Cons