K-Nearest Neighbour Classification

Mayuri Ingle
01/21/2019

Introduction :

In this project, we will build a kNN (k-Nearest Neighbour) classifier to predict wine quality using red wine quality data set from UCI repository.

Before we start few details about kNN algorithm…
kNN is a non-parametric supervised learning technique in which we try to classify the data point to a given category with the help of training set. In simple words, it captures information of all training cases and classifies new cases based on a similarity.

Predictions are made for a new instance (x) by searching through the entire training set for the K most similar cases (neighbors) and summarizing the output variable for those K cases. In classification this is the mode (or most common) class value.

Pros:- Easy to understand (It’s simplest algorithm in ML).
No assumptions about data (eg, like distribution of data).
Can be applied to both classification and regression.
Works easily on multi-class problems.

Cons:- Memory Intensive / Computationally expensive.
Sensitive to scale of data.(Standardization is very important here)
Not work well on rare event (skewed) target variable.
Struggle when high number of independent variables.
For any given problem, a small value of k will lead to a large variance in predictions.
Alternatively, setting k to a large value may lead to a large model bias.Hence finding optimal value of k is important.

Let’s get started with our classifier….

Inital data analysis and visualization

#Import all required libraries
library(class)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)
library(tidyverse)
## -- Attaching packages ----------- tidyverse 1.2.1 --
## v tibble  2.0.1     v purrr   0.2.5
## v tidyr   0.8.2     v dplyr   0.7.8
## v readr   1.3.1     v stringr 1.3.1
## v tibble  2.0.1     v forcats 0.3.0
## -- Conflicts -------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
# Load red wine quality dataset

redWine <- read_csv("F:\\RFiles\\Datasets\\wineQualityReds.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   fixed.acidity = col_double(),
##   volatile.acidity = col_double(),
##   citric.acid = col_double(),
##   residual.sugar = col_double(),
##   chlorides = col_double(),
##   free.sulfur.dioxide = col_double(),
##   total.sulfur.dioxide = col_double(),
##   density = col_double(),
##   pH = col_double(),
##   sulphates = col_double(),
##   alcohol = col_double(),
##   quality = col_double()
## )
# View structure of dataset
str(redWine)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1599 obs. of  13 variables:
##  $ X1                  : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : num  5 5 5 6 5 5 5 7 7 5 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   X1 = col_double(),
##   ..   fixed.acidity = col_double(),
##   ..   volatile.acidity = col_double(),
##   ..   citric.acid = col_double(),
##   ..   residual.sugar = col_double(),
##   ..   chlorides = col_double(),
##   ..   free.sulfur.dioxide = col_double(),
##   ..   total.sulfur.dioxide = col_double(),
##   ..   density = col_double(),
##   ..   pH = col_double(),
##   ..   sulphates = col_double(),
##   ..   alcohol = col_double(),
##   ..   quality = col_double()
##   .. )

The dataset has 13 columns first as index ,last as quality of wine . Rest all have details of wine composition.

# Removing first column 
redWine <- redWine[,-1]

# Summary of dataset

summary(redWine)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

The summary tells quality of wine is avaaliable in range of 3 to 8.

# visualise the quality variable .

redWine %>% ggplot(aes(x = quality ))+ geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The dataset is quiet imbalanced as observations for wine quality 3,4,8 are very less than 5 and 6.This will get our model trained mainly on majority (5 and 6) quality observations. We will convert the qulaity variable in two categories as “low” if wine quality is less than 5 otherwise “high.”

# Refactoring quality range as low or high
redWine1 <- redWine %>% 
          mutate(quality = ifelse(quality <= 5 ,"Low" , "High"))

redWine1$quality <- as.factor(redWine1$quality)

# Plot new variable
redWine1 %>% ggplot(aes(x = quality))+ geom_bar(width = 0.2)

Now we can say , we have balanced dataset with better ratio of response categories.

Data Preprocessing

We will split our data into train and test data then we will standardised the predictor variables.

# Create data partition
set.seed(123)
 wine_split <- createDataPartition(redWine1$quality, p = 0.7 , list= FALSE)

wine_train <- redWine1[wine_split ,]
wine_test <- redWine1[-wine_split,]

#  Standardise data 

preProcValues <- preProcess(wine_train, method = c("center", "scale"))

wine_trainsc <- predict(preProcValues, wine_train )
wine_testsc <- predict(preProcValues, wine_test )

# Check processed data
summary(wine_trainsc$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.8926 -0.8666 -0.2136  0.0000  0.6258  4.1702
sd(wine_trainsc$alcohol)
## [1] 1

Our dataset variables are standardised around mean zero with standard deviation as 1. Now we can train our model using the data.

kNN classifier using knn method from class library.

Here we will build simple model using knn function from class library where we have to explicity define k value.Generally the k is selected as square root of no. of observation which is 39 (sqrt of 1599 ) here.

# Data processing

wineTrain <- wine_trainsc[,-12]
wineTest <- wine_testsc[,-12]

wineTrain_label <- wine_trainsc[,12 , drop = TRUE]
wineTest_label <- wine_testsc[,12 , drop = TRUE]

# Basic knn model

set.seed(123)

wineknns<- knn(train = wineTrain, test=wineTest, cl=wineTrain_label, k=39)

# Evaluating model performance

confusionMatrix(wineTest_label, wineknns)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High  196  60
##       Low    72 151
##                                          
##                Accuracy : 0.7244         
##                  95% CI : (0.6821, 0.764)
##     No Information Rate : 0.5595         
##     P-Value [Acc > NIR] : 7.08e-14       
##                                          
##                   Kappa : 0.4443         
##  Mcnemar's Test P-Value : 0.3384         
##                                          
##             Sensitivity : 0.7313         
##             Specificity : 0.7156         
##          Pos Pred Value : 0.7656         
##          Neg Pred Value : 0.6771         
##              Prevalence : 0.5595         
##          Detection Rate : 0.4092         
##    Detection Prevalence : 0.5344         
##       Balanced Accuracy : 0.7235         
##                                          
##        'Positive' Class : High           
## 

Accuracy of our model is ~72% , which means there is scope of improvement in our model prediction.

kNN classifier using caret

We will now train our model using train method from caret library, this method will automatically find value of k using cross validation. Here we are using repeated cross validation method using trainControl . Number denotes either the number of folds and ‘repeats’ is for repeated ‘r’ fold cross validation. In this case, 3 separate 10-fold validations are used.

# Setting up train controls
repeats = 3
numbers = 10
tunel = 20



trnCntrl = trainControl(method = "repeatedcv",
                 number = numbers,
                 repeats = repeats,
                 classProbs = TRUE,
                 summaryFunction = twoClassSummary)

# KNN using train method fron caret
set.seed(123)
wineKnn <- train(quality~. , data = wine_trainsc, method = "knn",
               trControl = trnCntrl,
               metric = "ROC",
               tuneLength = tunel)

# Summary of model
wineKnn
## k-Nearest Neighbors 
## 
## 1120 samples
##   11 predictor
##    2 classes: 'High', 'Low' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1008, 1008, 1009, 1008, 1007, 1008, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    5  0.7994567  0.7846704  0.6737421
##    7  0.8006876  0.7712806  0.6679487
##    9  0.8111298  0.7829473  0.6903120
##   11  0.8109276  0.7696045  0.6979923
##   13  0.8112747  0.7729379  0.6999032
##   15  0.8123058  0.7667797  0.6973149
##   17  0.8131111  0.7740113  0.6998670
##   19  0.8161605  0.7690113  0.7069424
##   21  0.8131647  0.7701036  0.7037978
##   23  0.8139441  0.7701130  0.7069787
##   25  0.8145111  0.7712524  0.6998791
##   27  0.8135681  0.7701412  0.6922109
##   29  0.8129872  0.7696045  0.6979560
##   31  0.8114864  0.7745763  0.6896710
##   33  0.8120940  0.7746045  0.6934930
##   35  0.8145213  0.7706968  0.6992501
##   37  0.8147393  0.7679190  0.6947750
##   39  0.8140626  0.7779567  0.6941461
##   41  0.8151652  0.7751412  0.6966860
##   43  0.8151473  0.7729190  0.6934930
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 19.
# Plot to visualize optimal k selection
plot(wineKnn)

The plot above shows optimal value for k as 19 . And the final model was built using K = 19

Finally to make predictions on our test set, we use predict function in which the first argument is the formula to be applied and second argument is the new data on which we want the predictions.

# Predict values using model 
test_pred <- predict(wineKnn ,wine_testsc, type = "prob")

# Reforctoring values in two classes
test_pred$final <- as.factor(ifelse(test_pred$High > 0.5 ,"High","Low"))

#Evaluating model performance
confusionMatrix(wineTest_label, test_pred$final)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High  200  56
##       Low    78 145
##                                         
##                Accuracy : 0.7203        
##                  95% CI : (0.6777, 0.76)
##     No Information Rate : 0.5804        
##     P-Value [Acc > NIR] : 1.513e-10     
##                                         
##                   Kappa : 0.4342        
##  Mcnemar's Test P-Value : 0.06966       
##                                         
##             Sensitivity : 0.7194        
##             Specificity : 0.7214        
##          Pos Pred Value : 0.7813        
##          Neg Pred Value : 0.6502        
##              Prevalence : 0.5804        
##          Detection Rate : 0.4175        
##    Detection Prevalence : 0.5344        
##       Balanced Accuracy : 0.7204        
##                                         
##        'Positive' Class : High          
## 

Here we have accuracy of ~ 72 % .Our Classifier can be further tunned to incrase accuracy.

This concludes our exercise of building a kNN classifier.