K-Nearest Neighbors Classification

The K-Nearest Neighbors Classification algorithm classifies observations by allowing the K observations in the training set that are nearest to the new observation to “vote” on the class of the new observation. More specifically, the algorithm works as follows:

Calculate the distance between the new observation and EACH obsevation in the training set. Distances are calculated within the feature space.
Select the K observations in the training set that are nearest to the new observation.
Determine the majority class within the K nearest neighbors.
Classify the new observation as this majority class.
If there is a tie in the vote, resolve this according to some pre-determine scheme.

We will demonstrate this classification algorithm through examples.

Load Packages

library(ggplot2)
library(gridExtra)
library(caret)
library(class)

We will use ggplot2 and gridExtra for plotting. The carat packages will be used for splitting our data and for feature scaling. The functionality for performing KNN classification is provided by the class package.

Example 1: Iris Dataset

For this example, we will be working with the Iris Dataset. This data set is a real-world “toy” dataset that is often used to demonstrate concepts in data science. The iris dataset contains information about several flowers selected from three different species of iris: versicolor, setosa, and virginica.

The dataset contains the following five pieces of information for 150 flowers:

The sepal length of the flower.
The sepal width of the flower.
The petal length of the flower.
The petal width of the flower.
The species of the flower.

Load and Explore the Data

iris_tr <- read.table('data/iris.txt', sep='\t', header=TRUE)
summary(iris_tr)

##   sepal_length    sepal_width     petal_length    petal_width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

p1 <- ggplot(iris_tr, aes(x=sepal_length, y=sepal_width, col=species)) +
  geom_point(alpha=0.8)

p2 <- ggplot(iris_tr, aes(x=petal_length, y=petal_width, col=species)) +
  geom_point(alpha=0.8)

grid.arrange(p1, p2, ncol=2)

Build a 3-Nearest Neighbors Model

We will now use the knn function from the class package to create a 3-Nearest Neighbors model for the iris dataset.

iris_tr_feat <- iris_tr[,1:4]
set.seed(1)
train_pred <- knn(iris_tr_feat, iris_tr_feat, iris_tr$species, k=3)
train_pred[1:10]

##  [1] setosa     versicolor virginica  setosa     versicolor versicolor
##  [7] versicolor virginica  versicolor setosa    
## Levels: setosa versicolor virginica

Let’s calculate the model’s training accuracy.

accuracy <- mean(train_pred == iris_tr$species)
cat("Training Accuracy: ", accuracy, sep='')

## Training Accuracy: 0.96

Effects of Changing K

The selection of K=3 in the previous model was somewhat arbitrary. In the figure below, we will plot the KNN model’s training accuracy for several values of K.

train_acc <- c()

for (i in 1:150){
  set.seed(1)
  train_pred <- knn(iris_tr_feat, iris_tr_feat, iris_tr$species, k=i)
  train_acc <- c(train_acc, mean(train_pred == iris_tr$species))
}

plot(1:150, train_acc, pch='.', ylim=c(0.3, 1), col='salmon')
lines(1:150, train_acc, lwd=2, col='salmon')

Validation Set

KNN model’s will (almost) always get 100% accuracy on the training set when K=1. For this reason, it is not helpful to use training performance to select the best value of K for our dataset. To select the value of K for which the model is most likely to generalize well, we need to create a validation set.

iris_va <- read.table('data/iris_valid.txt', sep='\t', header=TRUE)
summary(iris_va)

##   sepal_length    sepal_width     petal_length   petal_width   
##  Min.   :4.600   Min.   :2.100   Min.   :1.30   Min.   :0.200  
##  1st Qu.:5.200   1st Qu.:2.825   1st Qu.:1.55   1st Qu.:0.400  
##  Median :5.900   Median :3.000   Median :4.30   Median :1.400  
##  Mean   :5.883   Mean   :3.033   Mean   :3.73   Mean   :1.193  
##  3rd Qu.:6.375   3rd Qu.:3.200   3rd Qu.:4.80   3rd Qu.:1.800  
##  Max.   :7.800   Max.   :3.900   Max.   :6.70   Max.   :2.300  
##        species  
##  setosa    :10  
##  versicolor:10  
##  virginica :10  
##                 
##                 
##

Selecting K

We will now calculate validation accuracy for each of our models, and will use this information to select our final model.

train_acc <- c()
valid_acc <- c()

iris_va_feat <- iris_va[,1:4]


for (i in 1:100){
  set.seed(1)
  train_pred <- knn(iris_tr_feat, iris_tr_feat, iris_tr$species, k=i)
  train_acc <- c(train_acc, mean(train_pred == iris_tr$species))
  
  set.seed(1)
  valid_pred <- knn(iris_tr_feat, iris_va_feat, iris_tr$species, k=i)
  valid_acc <- c(valid_acc, mean(valid_pred == iris_va$species))
}

plot(1:100, train_acc, pch='.', ylim=c(0.8, 1), col='salmon')
lines(1:100, train_acc, lwd=2, col='salmon')
lines(1:100, valid_acc, lwd=2, col='cornflowerblue')
legend(38, 1, legend=c("Training Acc", "Validation Acc"),
       col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)

Selecting K

The largest validation accuracy obtained by any model was 93.33%.

max(valid_acc)

## [1] 0.9666667

The maximum validation accuracy was obtained by the K=11 model.

which.max(valid_acc)

## [1] 2

Example 2: Pima Diabetes Dataset

For this example, we will be working with the Pima Diabetes Dataset. This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes. All patients are females at least 21 years old of Pima Indian heritage.

The columns in this dataset are described below.

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)

Load the Data

pima <- read.table("data/diabetes.csv", sep=",", header=TRUE)
pima$Outcome <- factor(pima$Outcome)
summary(pima)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##  Outcome
##  0:500  
##  1:268  
##         
##         
##         
##

Split the Data

We will use createDataPartition to create a stratified partition of our data into training and validation sets.

set.seed(1)
train.index <- createDataPartition(pima$Outcome, p = .7, list=FALSE)
train <- pima[ train.index,]
valid  <- pima[-train.index,]

summary(train$Outcome)

##   0   1 
## 350 188

summary(valid$Outcome)

##   0   1 
## 150  80

3-Nearest Neighbors

For practice, we will create and evaluate a 3-Nearest Neighbors model.

train_feat <- train[,1:8] 
valid_feat <- valid[,1:8] 

set.seed(1)
train_pred <- knn(train_feat, train_feat, train$Outcome, k=3)
train_acc <- mean(train_pred == train$Outcome)

set.seed(1)
valid_pred <- knn(train_feat, valid_feat, train$Outcome, k=3)
valid_acc <- mean(valid_pred == valid$Outcome)

cat('Training Accuracy:   ', train_acc, '\n',
    'Validation Accuracy: ', valid_acc, sep='')

## Training Accuracy:   0.8513011
## Validation Accuracy: 0.7130435

Feature Scaling

For many machine learning algorithms, it is important to make sure that the features are on roughly the same scale before training. This is especially true of distance-based algorithms such as K-Nearest Neighbors.

We will consider two types of scaling in this course: Standardization and Min/Max Scaling.

Standardization

With standardization, each column in the training set is scaled to have a mean of zero and a standard deviation of 1. If \(x_i\) is a single observation of a particular feature, then its scaled value \(z_i\) is given by:

\[ z_i = \frac{x_i - \bar x}{s_x}\] Note that the values \(\bar x\) and \(s_x\) are calculated from the training set only, but are used to scale any observation, whether it is from the training set, validation set, testing set, or an entirely new observation.

Min/Max Scaling

With min max scaling, each column in the training set is scaled linearly to have a minimum of zero and a maximum of 1. If \(x_i\) is a single observation of a particular feature, then its scaled value \(w_i\) is given by:

\[ w_i = \frac{x_i - \textrm{min}_x}{\textrm{max}_x - \textrm{min}_x}\] Note that the values \(\textrm{min}_x\) and \(\textrm{max}_x\) are calculated from the training set only, but are used to scale any observation, whether it is from the training set, validation set, testing set, or an entirely new observation.

Performing Feature Scaling

We can perform feature scaling using the preProcess function from the carat package.

We will start by performing standard scaling.

standard_scaler <- preProcess(train_feat, method=c('center', 'scale'))
train_s_sc <- predict(standard_scaler, train_feat)
valid_s_sc <- predict(standard_scaler, valid_feat)

summary(train_s_sc)

##   Pregnancies         Glucose        BloodPressure     SkinThickness    
##  Min.   :-1.1147   Min.   :-3.8189   Min.   :-3.6957   Min.   :-1.2738  
##  1st Qu.:-0.8191   1st Qu.:-0.6690   1st Qu.:-0.3764   1st Qu.:-1.2738  
##  Median :-0.2280   Median :-0.1020   Median : 0.1590   Median : 0.1330  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6587   3rd Qu.: 0.6146   3rd Qu.: 0.5873   3rd Qu.: 0.7447  
##  Max.   : 3.9100   Max.   : 2.3865   Max.   : 2.4076   Max.   : 4.7818  
##     Insulin             BMI           DiabetesPedigreeFunction
##  Min.   :-0.6843   Min.   :-4.06067   Min.   :-1.1888         
##  1st Qu.:-0.6843   1st Qu.:-0.62397   1st Qu.:-0.6953         
##  Median :-0.5569   Median :-0.01005   Median :-0.2974         
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000         
##  3rd Qu.: 0.4141   3rd Qu.: 0.59122   3rd Qu.: 0.4414         
##  Max.   : 6.7496   Max.   : 3.45830   Max.   : 6.0353         
##       Age         
##  Min.   :-1.0355  
##  1st Qu.:-0.7797  
##  Median :-0.3532  
##  Mean   : 0.0000  
##  3rd Qu.: 0.6703  
##  Max.   : 4.0819

summary(valid_s_sc)

##   Pregnancies          Glucose         BloodPressure     
##  Min.   :-1.11470   Min.   :-3.81894   Min.   :-3.69566  
##  1st Qu.:-0.81913   1st Qu.:-0.70049   1st Qu.:-0.26928  
##  Median :-0.22800   Median :-0.22799   Median : 0.10548  
##  Mean   : 0.07272   Mean   :-0.03612   Mean   : 0.01354  
##  3rd Qu.: 0.65871   3rd Qu.: 0.55162   3rd Qu.: 0.58732  
##  Max.   : 3.31884   Max.   : 2.44947   Max.   : 2.83588  
##  SkinThickness         Insulin              BMI          
##  Min.   :-1.27384   Min.   :-0.68430   Min.   :-4.06067  
##  1st Qu.:-1.27384   1st Qu.:-0.68430   1st Qu.:-0.57966  
##  Median : 0.07186   Median :-0.27131   Median : 0.04059  
##  Mean   :-0.05899   Mean   : 0.05645   Mean   :-0.03668  
##  3rd Qu.: 0.68353   3rd Qu.: 0.45802   3rd Qu.: 0.52160  
##  Max.   : 2.02922   Max.   : 5.85331   Max.   : 4.43298  
##  DiabetesPedigreeFunction      Age          
##  Min.   :-1.15799         Min.   :-1.03554  
##  1st Qu.:-0.65443         1st Qu.:-0.77967  
##  Median :-0.26654         Median :-0.26792  
##  Mean   : 0.08721         Mean   : 0.02837  
##  3rd Qu.: 0.56245         3rd Qu.: 0.58499  
##  Max.   : 5.75460         Max.   : 3.31429

Performing Feature Scaling

We will now perform min/max scaling.

minmax_scaler <- preProcess(train_feat, method=c('range'))
train_mm_sc <- predict(minmax_scaler, train_feat)
valid_mm_sc <- predict(minmax_scaler, valid_feat)

summary(train_mm_sc)

##   Pregnancies         Glucose       BloodPressure    SkinThickness   
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.05882   1st Qu.:0.5076   1st Qu.:0.5439   1st Qu.:0.0000  
##  Median :0.17647   Median :0.5990   Median :0.6316   Median :0.2323  
##  Mean   :0.22185   Mean   :0.6154   Mean   :0.6055   Mean   :0.2104  
##  3rd Qu.:0.35294   3rd Qu.:0.7145   3rd Qu.:0.7018   3rd Qu.:0.3333  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     Insulin             BMI         DiabetesPedigreeFunction
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000         
##  1st Qu.:0.00000   1st Qu.:0.4571   1st Qu.:0.06832         
##  Median :0.01714   Median :0.5387   Median :0.12340         
##  Mean   :0.09205   Mean   :0.5401   Mean   :0.16456         
##  3rd Qu.:0.14775   3rd Qu.:0.6187   3rd Qu.:0.22566         
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000         
##       Age        
##  Min.   :0.0000  
##  1st Qu.:0.0500  
##  Median :0.1333  
##  Mean   :0.2024  
##  3rd Qu.:0.3333  
##  Max.   :1.0000

summary(valid_mm_sc)

##   Pregnancies         Glucose       BloodPressure    SkinThickness   
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.05882   1st Qu.:0.5025   1st Qu.:0.5614   1st Qu.:0.0000  
##  Median :0.17647   Median :0.5787   Median :0.6228   Median :0.2222  
##  Mean   :0.23632   Mean   :0.6096   Mean   :0.6077   Mean   :0.2006  
##  3rd Qu.:0.35294   3rd Qu.:0.7043   3rd Qu.:0.7018   3rd Qu.:0.3232  
##  Max.   :0.88235   Max.   :1.0102   Max.   :1.0702   Max.   :0.5455  
##     Insulin             BMI         DiabetesPedigreeFunction
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00427         
##  1st Qu.:0.00000   1st Qu.:0.4630   1st Qu.:0.07398         
##  Median :0.05556   Median :0.5455   Median :0.12767         
##  Mean   :0.09965   Mean   :0.5352   Mean   :0.17664         
##  3rd Qu.:0.15366   3rd Qu.:0.6094   3rd Qu.:0.24242         
##  Max.   :0.87943   Max.   :1.1296   Max.   :0.96114         
##       Age        
##  Min.   :0.0000  
##  1st Qu.:0.0500  
##  Median :0.1500  
##  Mean   :0.2079  
##  3rd Qu.:0.3167  
##  Max.   :0.8500

3-Nearest Neighbors on Scaled Data

We will now create our 3-Nearest Neighbors model on the scaled data for the sake of comparison. We will start with the standardized data.

set.seed(1)
train_pred <- knn(train_s_sc, train_s_sc, train$Outcome, k=3)
train_acc <- mean(train_pred == train$Outcome)

set.seed(1)
valid_pred <- knn(train_s_sc, valid_s_sc, train$Outcome, k=3)
valid_acc <- mean(valid_pred == valid$Outcome)

cat('Training Accuracy:   ', train_acc, '\n',
    'Validation Accuracy: ', valid_acc, sep='')

## Training Accuracy:   0.8531599
## Validation Accuracy: 0.6956522

We will now consider the 3-nearest neighbors algorithm on the min-max scaled data.

set.seed(1)
train_pred <- knn(train_mm_sc, train_mm_sc, train$Outcome, k=3)
train_acc <- mean(train_pred == train$Outcome)

set.seed(1)
valid_pred <- knn(train_mm_sc, valid_mm_sc, train$Outcome, k=3)
valid_acc <- mean(valid_pred == valid$Outcome)

cat('Training Accuracy:   ', train_acc, '\n',
    'Validation Accuracy: ', valid_acc, sep='')

## Training Accuracy:   0.8494424
## Validation Accuracy: 0.726087

Selecting K

We will now build several KNN models. For each K from 1 to 100, we will calculate training and validation accuracy for the KNN model, using both the scaled and the unscaled data.

set.seed(1)

train_acc <- c()
valid_acc <- c()
train_acc_s_sc <- c()
valid_acc_s_sc <- c()
train_acc_mm_sc <- c()
valid_acc_mm_sc <- c()

k_range <- 1:100

for (i in k_range){
  # Unscaled
  set.seed(1)
  train_pred <- knn(train_feat, train_feat, train$Outcome, k=i)
  train_acc <- c(train_acc, mean(train_pred == train$Outcome))
  
  set.seed(1)
  valid_pred <- knn(train_feat, valid_feat, train$Outcome, k=i)
  valid_acc <- c(valid_acc, mean(valid_pred == valid$Outcome))
  
  # Standard Scaling
  set.seed(1)
  train_pred <- knn(train_s_sc, train_s_sc, train$Outcome, k=i)
  train_acc_s_sc <- c(train_acc_s_sc, mean(train_pred == train$Outcome))
  
  set.seed(1)
  valid_pred <- knn(train_s_sc, valid_s_sc, train$Outcome, k=i)
  valid_acc_s_sc <- c(valid_acc_s_sc, mean(valid_pred == valid$Outcome))
  
  # MinMax Scaling
  set.seed(1)
  train_pred <- knn(train_mm_sc, train_mm_sc, train$Outcome, k=i)
  train_acc_mm_sc <- c(train_acc_mm_sc, mean(train_pred == train$Outcome))
  
  set.seed(1)
  valid_pred <- knn(train_mm_sc, valid_mm_sc, train$Outcome, k=i)
  valid_acc_mm_sc <- c(valid_acc_mm_sc, mean(valid_pred == valid$Outcome))
  
}

max(valid_acc)

## [1] 0.7869565

max(valid_acc_s_sc)

## [1] 0.773913

max(valid_acc_mm_sc)

## [1] 0.7608696

Training and Validation Curves for Unscaled Data

plot(k_range, train_acc, pch='.', ylim=c(0.65, 1), col='salmon')
lines(k_range, train_acc, lwd=2, col='salmon')
lines(k_range, valid_acc, lwd=2, col='cornflowerblue')
legend(75, 1, legend=c("Training Acc", "Validation Acc"),
       col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)

Training and Validation Curves for Standard Scaled Data

plot(k_range, train_acc_s_sc, pch='.', ylim=c(0.65, 1), col='salmon')
lines(k_range, train_acc_s_sc, lwd=2, col='salmon')
lines(k_range, valid_acc_s_sc, lwd=2, col='cornflowerblue')
legend(75, 1, legend=c("Training Acc", "Validation Acc"),
       col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)

Training and Validation Curves for Min-Max Scaled Data

plot(k_range, train_acc_mm_sc, pch='.', ylim=c(0.65, 1), col='salmon')
lines(k_range, train_acc_mm_sc, lwd=2, col='salmon')
lines(k_range, valid_acc_mm_sc, lwd=2, col='cornflowerblue')
legend(75, 1, legend=c("Training Acc", "Validation Acc"),
       col=c("salmon", "cornflowerblue"), lty=1, lwd=2, cex=0.8)

Selecting the Final Model

Our best validation performance was obtained using unscaled data. Let’s determine the value of K used to create this particular model.

which.max(valid_acc)

## [1] 11

We will now recalculate the training and validation accuracies for this model.

set.seed(1)
train_pred <- knn(train, train, train$Outcome, k=11)
train_acc <- mean(train_pred == train$Outcome)
  
set.seed(1)
valid_pred <- knn(train, valid, train$Outcome, k=11)
valid_acc <- mean(valid_pred == valid$Outcome)

cat('Training Accuracy:   ', train_acc, '\n',
    'Validation Accuracy: ', valid_acc, sep='')

## Training Accuracy:   0.7788104
## Validation Accuracy: 0.7869565

We can use the table function to create a quick confusion matrix.

table(valid$Outcome, valid_pred)

##    valid_pred
##       0   1
##   0 134  16
##   1  33  47

Evaluating the Final Model

We will use the confusion matrix to view several classification metrics for our final model.

confusionMatrix(valid_pred, valid$Outcome)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 134  33
##          1  16  47
##                                          
##                Accuracy : 0.787          
##                  95% CI : (0.7283, 0.838)
##     No Information Rate : 0.6522         
##     P-Value [Acc > NIR] : 5.84e-06       
##                                          
##                   Kappa : 0.5059         
##                                          
##  Mcnemar's Test P-Value : 0.02227        
##                                          
##             Sensitivity : 0.8933         
##             Specificity : 0.5875         
##          Pos Pred Value : 0.8024         
##          Neg Pred Value : 0.7460         
##              Prevalence : 0.6522         
##          Detection Rate : 0.5826         
##    Detection Prevalence : 0.7261         
##       Balanced Accuracy : 0.7404         
##                                          
##        'Positive' Class : 0              
##

Lesson 4.3 - K-Nearest Neighbors Classification

K-Nearest Neighbors Classification

Load Packages

Example 1: Iris Dataset

Load and Explore the Data

Build a 3-Nearest Neighbors Model

Effects of Changing K

Validation Set

Selecting K

Selecting K

Example 2: Pima Diabetes Dataset

Load the Data

Split the Data

3-Nearest Neighbors

Feature Scaling

Standardization

Min/Max Scaling

Performing Feature Scaling

Performing Feature Scaling

3-Nearest Neighbors on Scaled Data

Selecting K

Training and Validation Curves for Unscaled Data

Training and Validation Curves for Standard Scaled Data

Training and Validation Curves for Min-Max Scaled Data

Selecting the Final Model

Evaluating the Final Model

Weighting Predictors

Weighting Predictors

Pros and Cons of KNN