Robbie Beane
The K-Nearest Neighbors Classification algorithm classifies observations by allowing the K observations in the training set that are nearest to the new observation to “vote” on the class of the new observation. More specifically, the algorithm works as follows:
K observations in the training set that are nearest to the new observation.K nearest neighbors.We will demonstrate this classification algorithm through examples.
We will use ggplot2 and gridExtra for plotting. The carat packages will be used for splitting our data and for feature scaling. The functionality for performing KNN classification is provided by the class package.
For this example, we will be working with the Iris Dataset. This data set is a real-world “toy” dataset that is often used to demonstrate concepts in data science. The iris dataset contains information about several flowers selected from three different species of iris: versicolor, setosa, and virginica.
The dataset contains the following five pieces of information for 150 flowers:
## sepal_length sepal_width petal_length petal_width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
p1 <- ggplot(iris_tr, aes(x=sepal_length, y=sepal_width, col=species)) +
geom_point(alpha=0.8)
p2 <- ggplot(iris_tr, aes(x=petal_length, y=petal_width, col=species)) +
geom_point(alpha=0.8)
grid.arrange(p1, p2, ncol=2)We will now use the knn function from the class package to create a 3-Nearest Neighbors model for the iris dataset.
iris_tr_feat <- iris_tr[,1:4]
set.seed(1)
train_pred <- knn(iris_tr_feat, iris_tr_feat, iris_tr$species, k=3)
train_pred[1:10]## [1] setosa versicolor virginica setosa versicolor versicolor
## [7] versicolor virginica versicolor setosa
## Levels: setosa versicolor virginica
Let’s calculate the model’s training accuracy.
## Training Accuracy: 0.96
The selection of K=3 in the previous model was somewhat arbitrary. In the figure below, we will plot the KNN model’s training accuracy for several values of K.
train_acc <- c()
for (i in 1:150){
set.seed(1)
train_pred <- knn(iris_tr_feat, iris_tr_feat, iris_tr$species, k=i)
train_acc <- c(train_acc, mean(train_pred == iris_tr$species))
}
plot(1:150, train_acc, pch='.', ylim=c(0.3, 1), col='salmon')
lines(1:150, train_acc, lwd=2, col='salmon')KNN model’s will (almost) always get 100% accuracy on the training set when K=1. For this reason, it is not helpful to use training performance to select the best value of K for our dataset. To select the value of K for which the model is most likely to generalize well, we need to create a validation set.
## sepal_length sepal_width petal_length petal_width
## Min. :4.600 Min. :2.100 Min. :1.30 Min. :0.200
## 1st Qu.:5.200 1st Qu.:2.825 1st Qu.:1.55 1st Qu.:0.400
## Median :5.900 Median :3.000 Median :4.30 Median :1.400
## Mean :5.883 Mean :3.033 Mean :3.73 Mean :1.193
## 3rd Qu.:6.375 3rd Qu.:3.200 3rd Qu.:4.80 3rd Qu.:1.800
## Max. :7.800 Max. :3.900 Max. :6.70 Max. :2.300
## species
## setosa :10
## versicolor:10
## virginica :10
##
##
##
We will now calculate validation accuracy for each of our models, and will use this information to select our final model.
train_acc <- c()
valid_acc <- c()
iris_va_feat <- iris_va[,1:4]
for (i in 1:100){
set.seed(1)
train_pred <- knn(iris_tr_feat, iris_tr_feat, iris_tr$species, k=i)
train_acc <- c(train_acc, mean(train_pred == iris_tr$species))
set.seed(1)
valid_pred <- knn(iris_tr_feat, iris_va_feat, iris_tr$species, k=i)
valid_acc <- c(valid_acc, mean(valid_pred == iris_va$species))
}
plot(1:100, train_acc, pch='.', ylim=c(0.8, 1), col='salmon')
lines(1:100, train_acc, lwd=2, col='salmon')
lines(1:100, valid_acc, lwd=2, col='cornflowerblue')
legend(38, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)The largest validation accuracy obtained by any model was 93.33%.
## [1] 0.9666667
The maximum validation accuracy was obtained by the K=11 model.
## [1] 2
For this example, we will be working with the Pima Diabetes Dataset. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes. All patients are females at least 21 years old of Pima Indian heritage.
The columns in this dataset are described below.
pima <- read.table("data/diabetes.csv", sep=",", header=TRUE)
pima$Outcome <- factor(pima$Outcome)
summary(pima)## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## 0:500
## 1:268
##
##
##
##
We will use createDataPartition to create a stratified partition of our data into training and validation sets.
set.seed(1)
train.index <- createDataPartition(pima$Outcome, p = .7, list=FALSE)
train <- pima[ train.index,]
valid <- pima[-train.index,]
summary(train$Outcome)## 0 1
## 350 188
## 0 1
## 150 80
For practice, we will create and evaluate a 3-Nearest Neighbors model.
train_feat <- train[,1:8]
valid_feat <- valid[,1:8]
set.seed(1)
train_pred <- knn(train_feat, train_feat, train$Outcome, k=3)
train_acc <- mean(train_pred == train$Outcome)
set.seed(1)
valid_pred <- knn(train_feat, valid_feat, train$Outcome, k=3)
valid_acc <- mean(valid_pred == valid$Outcome)
cat('Training Accuracy: ', train_acc, '\n',
'Validation Accuracy: ', valid_acc, sep='')## Training Accuracy: 0.8513011
## Validation Accuracy: 0.7130435
For many machine learning algorithms, it is important to make sure that the features are on roughly the same scale before training. This is especially true of distance-based algorithms such as K-Nearest Neighbors.
We will consider two types of scaling in this course: Standardization and Min/Max Scaling.
With standardization, each column in the training set is scaled to have a mean of zero and a standard deviation of 1. If \(x_i\) is a single observation of a particular feature, then its scaled value \(z_i\) is given by:
\[ z_i = \frac{x_i - \bar x}{s_x}\] Note that the values \(\bar x\) and \(s_x\) are calculated from the training set only, but are used to scale any observation, whether it is from the training set, validation set, testing set, or an entirely new observation.
With min max scaling, each column in the training set is scaled linearly to have a minimum of zero and a maximum of 1. If \(x_i\) is a single observation of a particular feature, then its scaled value \(w_i\) is given by:
\[ w_i = \frac{x_i - \textrm{min}_x}{\textrm{max}_x - \textrm{min}_x}\] Note that the values \(\textrm{min}_x\) and \(\textrm{max}_x\) are calculated from the training set only, but are used to scale any observation, whether it is from the training set, validation set, testing set, or an entirely new observation.
We can perform feature scaling using the preProcess function from the carat package.
We will start by performing standard scaling.
standard_scaler <- preProcess(train_feat, method=c('center', 'scale'))
train_s_sc <- predict(standard_scaler, train_feat)
valid_s_sc <- predict(standard_scaler, valid_feat)
summary(train_s_sc)## Pregnancies Glucose BloodPressure SkinThickness
## Min. :-1.1147 Min. :-3.8189 Min. :-3.6957 Min. :-1.2738
## 1st Qu.:-0.8191 1st Qu.:-0.6690 1st Qu.:-0.3764 1st Qu.:-1.2738
## Median :-0.2280 Median :-0.1020 Median : 0.1590 Median : 0.1330
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6587 3rd Qu.: 0.6146 3rd Qu.: 0.5873 3rd Qu.: 0.7447
## Max. : 3.9100 Max. : 2.3865 Max. : 2.4076 Max. : 4.7818
## Insulin BMI DiabetesPedigreeFunction
## Min. :-0.6843 Min. :-4.06067 Min. :-1.1888
## 1st Qu.:-0.6843 1st Qu.:-0.62397 1st Qu.:-0.6953
## Median :-0.5569 Median :-0.01005 Median :-0.2974
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.4141 3rd Qu.: 0.59122 3rd Qu.: 0.4414
## Max. : 6.7496 Max. : 3.45830 Max. : 6.0353
## Age
## Min. :-1.0355
## 1st Qu.:-0.7797
## Median :-0.3532
## Mean : 0.0000
## 3rd Qu.: 0.6703
## Max. : 4.0819
## Pregnancies Glucose BloodPressure
## Min. :-1.11470 Min. :-3.81894 Min. :-3.69566
## 1st Qu.:-0.81913 1st Qu.:-0.70049 1st Qu.:-0.26928
## Median :-0.22800 Median :-0.22799 Median : 0.10548
## Mean : 0.07272 Mean :-0.03612 Mean : 0.01354
## 3rd Qu.: 0.65871 3rd Qu.: 0.55162 3rd Qu.: 0.58732
## Max. : 3.31884 Max. : 2.44947 Max. : 2.83588
## SkinThickness Insulin BMI
## Min. :-1.27384 Min. :-0.68430 Min. :-4.06067
## 1st Qu.:-1.27384 1st Qu.:-0.68430 1st Qu.:-0.57966
## Median : 0.07186 Median :-0.27131 Median : 0.04059
## Mean :-0.05899 Mean : 0.05645 Mean :-0.03668
## 3rd Qu.: 0.68353 3rd Qu.: 0.45802 3rd Qu.: 0.52160
## Max. : 2.02922 Max. : 5.85331 Max. : 4.43298
## DiabetesPedigreeFunction Age
## Min. :-1.15799 Min. :-1.03554
## 1st Qu.:-0.65443 1st Qu.:-0.77967
## Median :-0.26654 Median :-0.26792
## Mean : 0.08721 Mean : 0.02837
## 3rd Qu.: 0.56245 3rd Qu.: 0.58499
## Max. : 5.75460 Max. : 3.31429
We will now perform min/max scaling.
minmax_scaler <- preProcess(train_feat, method=c('range'))
train_mm_sc <- predict(minmax_scaler, train_feat)
valid_mm_sc <- predict(minmax_scaler, valid_feat)
summary(train_mm_sc)## Pregnancies Glucose BloodPressure SkinThickness
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.05882 1st Qu.:0.5076 1st Qu.:0.5439 1st Qu.:0.0000
## Median :0.17647 Median :0.5990 Median :0.6316 Median :0.2323
## Mean :0.22185 Mean :0.6154 Mean :0.6055 Mean :0.2104
## 3rd Qu.:0.35294 3rd Qu.:0.7145 3rd Qu.:0.7018 3rd Qu.:0.3333
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Insulin BMI DiabetesPedigreeFunction
## Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.4571 1st Qu.:0.06832
## Median :0.01714 Median :0.5387 Median :0.12340
## Mean :0.09205 Mean :0.5401 Mean :0.16456
## 3rd Qu.:0.14775 3rd Qu.:0.6187 3rd Qu.:0.22566
## Max. :1.00000 Max. :1.0000 Max. :1.00000
## Age
## Min. :0.0000
## 1st Qu.:0.0500
## Median :0.1333
## Mean :0.2024
## 3rd Qu.:0.3333
## Max. :1.0000
## Pregnancies Glucose BloodPressure SkinThickness
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.05882 1st Qu.:0.5025 1st Qu.:0.5614 1st Qu.:0.0000
## Median :0.17647 Median :0.5787 Median :0.6228 Median :0.2222
## Mean :0.23632 Mean :0.6096 Mean :0.6077 Mean :0.2006
## 3rd Qu.:0.35294 3rd Qu.:0.7043 3rd Qu.:0.7018 3rd Qu.:0.3232
## Max. :0.88235 Max. :1.0102 Max. :1.0702 Max. :0.5455
## Insulin BMI DiabetesPedigreeFunction
## Min. :0.00000 Min. :0.0000 Min. :0.00427
## 1st Qu.:0.00000 1st Qu.:0.4630 1st Qu.:0.07398
## Median :0.05556 Median :0.5455 Median :0.12767
## Mean :0.09965 Mean :0.5352 Mean :0.17664
## 3rd Qu.:0.15366 3rd Qu.:0.6094 3rd Qu.:0.24242
## Max. :0.87943 Max. :1.1296 Max. :0.96114
## Age
## Min. :0.0000
## 1st Qu.:0.0500
## Median :0.1500
## Mean :0.2079
## 3rd Qu.:0.3167
## Max. :0.8500
We will now create our 3-Nearest Neighbors model on the scaled data for the sake of comparison. We will start with the standardized data.
set.seed(1)
train_pred <- knn(train_s_sc, train_s_sc, train$Outcome, k=3)
train_acc <- mean(train_pred == train$Outcome)
set.seed(1)
valid_pred <- knn(train_s_sc, valid_s_sc, train$Outcome, k=3)
valid_acc <- mean(valid_pred == valid$Outcome)
cat('Training Accuracy: ', train_acc, '\n',
'Validation Accuracy: ', valid_acc, sep='')## Training Accuracy: 0.8531599
## Validation Accuracy: 0.6956522
We will now consider the 3-nearest neighbors algorithm on the min-max scaled data.
set.seed(1)
train_pred <- knn(train_mm_sc, train_mm_sc, train$Outcome, k=3)
train_acc <- mean(train_pred == train$Outcome)
set.seed(1)
valid_pred <- knn(train_mm_sc, valid_mm_sc, train$Outcome, k=3)
valid_acc <- mean(valid_pred == valid$Outcome)
cat('Training Accuracy: ', train_acc, '\n',
'Validation Accuracy: ', valid_acc, sep='')## Training Accuracy: 0.8494424
## Validation Accuracy: 0.726087
We will now build several KNN models. For each K from 1 to 100, we will calculate training and validation accuracy for the KNN model, using both the scaled and the unscaled data.
set.seed(1)
train_acc <- c()
valid_acc <- c()
train_acc_s_sc <- c()
valid_acc_s_sc <- c()
train_acc_mm_sc <- c()
valid_acc_mm_sc <- c()
k_range <- 1:100
for (i in k_range){
# Unscaled
set.seed(1)
train_pred <- knn(train_feat, train_feat, train$Outcome, k=i)
train_acc <- c(train_acc, mean(train_pred == train$Outcome))
set.seed(1)
valid_pred <- knn(train_feat, valid_feat, train$Outcome, k=i)
valid_acc <- c(valid_acc, mean(valid_pred == valid$Outcome))
# Standard Scaling
set.seed(1)
train_pred <- knn(train_s_sc, train_s_sc, train$Outcome, k=i)
train_acc_s_sc <- c(train_acc_s_sc, mean(train_pred == train$Outcome))
set.seed(1)
valid_pred <- knn(train_s_sc, valid_s_sc, train$Outcome, k=i)
valid_acc_s_sc <- c(valid_acc_s_sc, mean(valid_pred == valid$Outcome))
# MinMax Scaling
set.seed(1)
train_pred <- knn(train_mm_sc, train_mm_sc, train$Outcome, k=i)
train_acc_mm_sc <- c(train_acc_mm_sc, mean(train_pred == train$Outcome))
set.seed(1)
valid_pred <- knn(train_mm_sc, valid_mm_sc, train$Outcome, k=i)
valid_acc_mm_sc <- c(valid_acc_mm_sc, mean(valid_pred == valid$Outcome))
}
max(valid_acc)## [1] 0.7869565
## [1] 0.773913
## [1] 0.7608696
plot(k_range, train_acc, pch='.', ylim=c(0.65, 1), col='salmon')
lines(k_range, train_acc, lwd=2, col='salmon')
lines(k_range, valid_acc, lwd=2, col='cornflowerblue')
legend(75, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)plot(k_range, train_acc_s_sc, pch='.', ylim=c(0.65, 1), col='salmon')
lines(k_range, train_acc_s_sc, lwd=2, col='salmon')
lines(k_range, valid_acc_s_sc, lwd=2, col='cornflowerblue')
legend(75, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)plot(k_range, train_acc_mm_sc, pch='.', ylim=c(0.65, 1), col='salmon')
lines(k_range, train_acc_mm_sc, lwd=2, col='salmon')
lines(k_range, valid_acc_mm_sc, lwd=2, col='cornflowerblue')
legend(75, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)Our best validation performance was obtained using unscaled data. Let’s determine the value of K used to create this particular model.
## [1] 11
We will now recalculate the training and validation accuracies for this model.
set.seed(1)
train_pred <- knn(train, train, train$Outcome, k=11)
train_acc <- mean(train_pred == train$Outcome)
set.seed(1)
valid_pred <- knn(train, valid, train$Outcome, k=11)
valid_acc <- mean(valid_pred == valid$Outcome)
cat('Training Accuracy: ', train_acc, '\n',
'Validation Accuracy: ', valid_acc, sep='')## Training Accuracy: 0.7788104
## Validation Accuracy: 0.7869565
We can use the table function to create a quick confusion matrix.
## valid_pred
## 0 1
## 0 134 16
## 1 33 47
We will use the confusion matrix to view several classification metrics for our final model.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 134 33
## 1 16 47
##
## Accuracy : 0.787
## 95% CI : (0.7283, 0.838)
## No Information Rate : 0.6522
## P-Value [Acc > NIR] : 5.84e-06
##
## Kappa : 0.5059
##
## Mcnemar's Test P-Value : 0.02227
##
## Sensitivity : 0.8933
## Specificity : 0.5875
## Pos Pred Value : 0.8024
## Neg Pred Value : 0.7460
## Prevalence : 0.6522
## Detection Rate : 0.5826
## Detection Prevalence : 0.7261
## Balanced Accuracy : 0.7404
##
## 'Positive' Class : 0
##
If we believe that a certain feature will be more or less important than other features in our model, then we can apply a weight to its scaled values in order to have it emphasized more highly when calculating distances.
set.seed(1)
train_acc_w <- c()
valid_acc_w <- c()
w <- c(1, 1, 0, 0, 0, 1, 1, 2)
train_w <- t(t(train_s_sc)*w)
valid_w <- t(t(valid_s_sc)*w)
k_range <- 1:100
for (i in k_range){
# Scaled
set.seed(1)
train_pred <- knn(train_w, train_w, train$Outcome, k=i)
train_acc_w <- c(train_acc_w, mean(train_pred == train$Outcome))
set.seed(1)
valid_pred <- knn(train_w, valid_w, train$Outcome, k=i)
valid_acc_w <- c(valid_acc_w, mean(valid_pred == valid$Outcome))
}
max(valid_acc_w)## [1] 0.7913043
plot(k_range, train_acc_w, pch='.', ylim=c(0.65, 1), col='salmon')
lines(k_range, train_acc_w, lwd=2, col='salmon')
lines(k_range, valid_acc_w, lwd=2, col='cornflowerblue')
legend(75, 1, legend=c("Training Acc", "Validation Acc"),
col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)## [1] 0.7913043
## [1] 24
Pros
Cons