Knn Classification

Introduction
Knn Classification

Dataset
iris.csv

Problem Defination Create Knn model with given dataset and predict the Outcome Attribute (Species)

Steps
The steps to create the relationship is:
* EDA
* Clean 0s, NAs, Outliers
* Correlation ???
* VDA
* Find value of K * Split Dataset into Train / Test
* Create Knn Model with Train
* Confirm value of K * Check Accuracy
* Predict OutcomeVariable in Test using Knn Model
* Confusion Matrix (PredictedOutcomeVariable v/s ActualOutcomeVariable)
* Create Knn Final Model with Full Dataset
* Predict OutcomeVariable Using Knn Final Model

Data Location
The Iris dataset was used in R.A. Fisher’s classic 1936 paper, The dataset can be found on
https://www.kaggle.com/uciml/iris

Data Description
The Iris dataset contains the followng information
Id
SepalLengthCm
SepalWidthCm
PetalLengthCm
PetalWidthCm
Species (Outcome Variable)

knitr Global Options

# for development
knitr::opts_chunk$set(echo=TRUE, eval=TRUE, error=TRUE, warning=FALSE, message=FALSE, cache=FALSE, tidy=FALSE, fig.path='figures/')
# for production
#knitr::opts_chunk$set(echo=TRUE, eval=TRUE, error=FALSE, warning=FALSE, message=FALSE, cache=FALSE, tidy=FALSE, fig.path='figures/')

Load Libs

library(tidyr)
library(dplyr)
library(ggplot2)
#install.packages("corrgram")
#install.packages("fpc")
#library(corrgram)
library(gridExtra) 
#install.packages("caret")
#install.packages('TMB', type = 'source') 
#install.packages("sjPlot")
library(caret)
library(sjPlot) 
#install.packages("doMC")
# Parallel Processing
# The caret package supports parallel processing in order to decrease the compute 
# time for a given experiment. It is supported automatically as long as it is 
# configured. 
# In this example we can load the doMC package and set the number of cores to 4,
# making available 4 worker threads to caret when tuning the model. This is used
# for the loops for the repeats of cross validation for each parameter combination.
# configure multicore
#library(doMC)
#registerDoMC(cores=4)

Functions

# detect NAs   
detect_na <- function(inp) {
  sum(is.na(inp))
}
# detect Outliers   
detect_outliers <- function(inp, na.rm=TRUE) {
  i.qnt <- quantile(inp, probs=c(.25, .75), na.rm=na.rm)
  i.max <- 1.5 * IQR(inp, na.rm=na.rm)
  otp <- inp
  otp[inp < (i.qnt[1] - i.max)] <- NA
  otp[inp > (i.qnt[2] + i.max)] <- NA
  inp[is.na(otp)]
}

Load Dataset

df <- read.csv("C:/Users/PC/Downloads/iris.csv", header=T, stringsAsFactors=T)
head(df)

##   Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm     Species
## 1  1           5.1          3.5           1.4          0.2 Iris-setosa
## 2  2           4.9          3.0           1.4          0.2 Iris-setosa
## 3  3           4.7          3.2           1.3          0.2 Iris-setosa
## 4  4           4.6          3.1           1.5          0.2 Iris-setosa
## 5  5           5.0          3.6           1.4          0.2 Iris-setosa
## 6  6           5.4          3.9           1.7          0.4 Iris-setosa

Dataframe Stucture

str(df)

## 'data.frame':    150 obs. of  6 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SepalLengthCm: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ SepalWidthCm : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ PetalLengthCm: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ PetalWidthCm : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species      : Factor w/ 3 levels "Iris-setosa",..: 1 1 1 1 1 1 1 1 1 1 ...

Dataframe Summary

lapply(df, FUN=summary)

## $Id
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   38.25   75.50   75.50  112.75  150.00 
## 
## $SepalLengthCm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.300   5.100   5.800   5.843   6.400   7.900 
## 
## $SepalWidthCm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.800   3.000   3.054   3.300   4.400 
## 
## $PetalLengthCm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.600   4.350   3.759   5.100   6.900 
## 
## $PetalWidthCm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.300   1.300   1.199   1.800   2.500 
## 
## $Species
##     Iris-setosa Iris-versicolor  Iris-virginica 
##              50              50              50

Missing Data

lapply(df, FUN=detect_na)

## $Id
## [1] 0
## 
## $SepalLengthCm
## [1] 0
## 
## $SepalWidthCm
## [1] 0
## 
## $PetalLengthCm
## [1] 0
## 
## $PetalWidthCm
## [1] 0
## 
## $Species
## [1] 0

Observation
No missing values

Predictor Data - Only Predictor Cols

dfT <- select(df, -Id, -Species)
head(dfT)

##   SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
## 1           5.1          3.5           1.4          0.2
## 2           4.9          3.0           1.4          0.2
## 3           4.7          3.2           1.3          0.2
## 4           4.6          3.1           1.5          0.2
## 5           5.0          3.6           1.4          0.2
## 6           5.4          3.9           1.7          0.4

Detect Outliers

lapply(dfT, FUN=detect_outliers)

## $SepalLengthCm
## numeric(0)
## 
## $SepalWidthCm
## [1] 4.4 4.1 4.2 2.0
## 
## $PetalLengthCm
## numeric(0)
## 
## $PetalWidthCm
## numeric(0)

Observation
Outliers present in SepalWidthCm.
But Outlier count is low (4).
For this model we will work with the outliers.

# boxplot of all columns
pltBoxPlot <- ggplot(stack(dfT), aes(x=ind, y=values)) +
                geom_boxplot(fill=rainbow(ncol(dfT)),color=rainbow(ncol(dfT))) +
                labs(title="Iris Data - Outliers") +
                labs(x="Features") +
            labs(y="Value")
print(pltBoxPlot)

Observation
Outliers present in SepalWidthCm.
But Outlier count is low (4).
For this model we will work with the outliers.

Correlation

#https://www.statmethods.net/stats/correlations.html
cor(dfT)

##               SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
## SepalLengthCm     1.0000000   -0.1093692     0.8717542    0.8179536
## SepalWidthCm     -0.1093692    1.0000000    -0.4205161   -0.3565441
## PetalLengthCm     0.8717542   -0.4205161     1.0000000    0.9627571
## PetalWidthCm      0.8179536   -0.3565441     0.9627571    1.0000000

Visualize Correlation

# http://www.statmethods.net/advgraphs/correlograms.html
corrgram(dfT)

## Error in corrgram(dfT): could not find function "corrgram"

Scatter Plots

plot(dfT)

Scatter Plot - Petals

ggplot(df, aes(x=PetalLengthCm, y=PetalWidthCm)) +
    geom_point(size=0.9, colour="blue", fill="blue") +
    labs(title="Scatter Plot Petals") +
    labs(x="PetalLengthCm") +
    labs(y="PetalWidthCm")

Scatter Plot - Sepals

ggplot(df, aes(x=SepalLengthCm, y=SepalWidthCm)) +
    geom_point(size=0.9, colour="blue", fill="blue") +
    labs(title="Scatter Plot - Sepals") +
    labs(x="SepalLengthCm") +
    labs(y="SepalWidthCm")

Note:
Now, we need to calculate “With-In-Sum-Of-Squares (WSS)” iteratively.
WSS is a measure to explain the homogeneity within a cluster.
WSS function plots WSS against the number of clusters.

wss Function

sjc.elbow(dfT)

Note:
We have plotted WSS with number of clusters. From here we can see that there is not much decrease in WSS even if we increase the number of clusters beyond 8 - 10.
This graph is also known as “Elbow Curve” where the bending point (E.g, nc = 8 - 10 in our case) is known as “Elbow Point”.
From the above plot we can conclude that if we keep number of clusters = 8 - 10, we should be able to get good clusters with good homogeneity within themselves.
Let’s fix the cluster size to “9” and call the kmeans() function to give the clusters.

Dataset Split

set.seed(707)
idTrn <- createDataPartition(y=df$Species, p=0.7, list=FALSE)
dfTrn <- df[idTrn,]
dfTrn <- select(dfTrn, -Id)
dfTst <- df[-idTrn,]
dfTst <- select(dfTst, -Id)

Training Dataset RowCount & ColCount

dim(dfTrn)

## [1] 105   5

Training Dataset Head

head(dfTrn)

##    SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm     Species
## 3            4.7          3.2           1.3          0.2 Iris-setosa
## 5            5.0          3.6           1.4          0.2 Iris-setosa
## 6            5.4          3.9           1.7          0.4 Iris-setosa
## 9            4.4          2.9           1.4          0.2 Iris-setosa
## 11           5.4          3.7           1.5          0.2 Iris-setosa
## 13           4.8          3.0           1.4          0.1 Iris-setosa

Training Dataset Summary

lapply(dfTrn, FUN=summary)

## $SepalLengthCm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.30    5.10    5.80    5.85    6.40    7.90 
## 
## $SepalWidthCm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.800   3.000   3.072   3.300   4.400 
## 
## $PetalLengthCm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.600   4.400   3.771   5.100   6.900 
## 
## $PetalWidthCm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.300   1.300   1.214   1.900   2.500 
## 
## $Species
##     Iris-setosa Iris-versicolor  Iris-virginica 
##              35              35              35

Testing Dataset RowCount & ColCount

dim(dfTst)

## [1] 45  5

Testing Dataset Head

head(dfTst)

##    SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm     Species
## 1            5.1          3.5           1.4          0.2 Iris-setosa
## 2            4.9          3.0           1.4          0.2 Iris-setosa
## 4            4.6          3.1           1.5          0.2 Iris-setosa
## 7            4.6          3.4           1.4          0.3 Iris-setosa
## 8            5.0          3.4           1.5          0.2 Iris-setosa
## 10           4.9          3.1           1.5          0.1 Iris-setosa

Testing Dataset Summary

lapply(dfTst, FUN=summary)

## $SepalLengthCm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.500   5.100   5.700   5.827   6.400   7.400 
## 
## $SepalWidthCm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.300   2.800   3.000   3.011   3.300   3.800 
## 
## $PetalLengthCm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.600   4.200   3.729   5.100   6.100 
## 
## $PetalWidthCm
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   0.300   1.300   1.162   1.800   2.400 
## 
## $Species
##     Iris-setosa Iris-versicolor  Iris-virginica 
##              15              15              15

**TRAINING CLASSIFIER - NOTES**

##https://topepo.github.io/caret/index.html
##https://www.machinelearningplus.com/machine-learning/caret-package/

Caret package provides train() method for training our data for various algorithms. We just need to pass different parameter values for different algorithms.

Before train() method, we will first use trainControl() method. It controls the computational nuances of the train() method.

##https://www.rdocumentation.org/packages/caret/versions/6.0-82/topics/trainControl

In trainControl() method, we are setting 3 parameters of 
* "method" parameter holds the details about resampling method. We can set "method" with many values like  "boot", "boot632", "cv", "repeatedcv", "LOOCV", "LGOCV" etc. https://stats.stackexchange.com/questions/17602/caret-re-sampling-methods
http://appliedpredictivemodeling.com/blog/2014/11/27/vpuig01pqbklmi72b8lcl3ij5hj2qm
For this exercise, we will use repeatedcv i.e, repeated cross-validation.
In fact for most classification models by default we should use "repeatedcv".   
* The "number" parameter holds the number of folds for the cross validations 
* The "repeats" parameter contains number for our repeated cross-validation. 
**We will use setting number=10 and repeats=10***
This trainControl() methods returns a list. We are going to pass this on our train() method.

##https://www.rdocumentation.org/packages/caret/versions/4.47/topics/train
Before training our classifier, set.seed().  Knn starts with random centroids.
For training Knn classifier, train() method the following parameters should be passed 
* Outcome~. We are passing our outcome variable. The "Outcome~." denotes a formula for using all attributes in our classifier and "Outcome" as the target variable. 
* data=data-frame-for-creating-model
* "method" parameter as "knn".
* "trControl" parameter should be passed with results from our trainControl()
##https://www.rdocumentation.org/packages/caret/versions/6.0-83/topics/preProcess
##https://www.machinelearningplus.com/machine-learning/caret-package/#35howtopreprocesstotransformthedata
* preProcess - used to transform or impute data before training
##https://topepo.github.io/caret/model-training-and-tuning.html#preproc
* tuneLength=20
There are two main ways to do hyper parameter tuning using the train():
- Set the tuneLength
- Define and set the tuneGrid
tuneLength corresponds to the number of unique values for the tuning parameters caret will consider while forming the hyper parameter combinations.
Caret will automatically determine the values each parameter should take.
Alternately, if you want to explicitly control what values should be considered for each parameter, then, you can define the tuneGrid and pass it to train().

Make KNN Model

set.seed(707)
knnCntrl <- trainControl(method="repeatedcv", number=10, repeats=10)
knnModel <- train(Species~., 
                data=dfTrn, 
                method="knn", 
                trControl=knnCntrl,#specify the cross validation using trControl
                preProcess = c("center","scale"), #handles missing data with mean i.e center, scale is for normalization, every column is normalized.
                tuneLength=20)#hypertuning parameters, there is also tuning grid
print(knnModel)

## k-Nearest Neighbors 
## 
## 105 samples
##   4 predictor
##   3 classes: 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 96, 95, 95, 94, 93, 94, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9400909  0.9093837
##    7  0.9583434  0.9370746
##    9  0.9615707  0.9418685
##   11  0.9530455  0.9290628
##   13  0.9365404  0.9042755
##   15  0.9355960  0.9029681
##   17  0.9376717  0.9064070
##   19  0.9320202  0.8978608
##   21  0.9226768  0.8837904
##   23  0.9063687  0.8592828
##   25  0.8967323  0.8447674
##   27  0.8870000  0.8301669
##   29  0.8808283  0.8206314
##   31  0.8828889  0.8237449
##   33  0.8823535  0.8228818
##   35  0.8731818  0.8091399
##   37  0.8652778  0.7972660
##   39  0.8589798  0.7877060
##   41  0.8660556  0.7986333
##   43  0.8596667  0.7889175
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

Plot Knn Model
# plot the effect of parameters on accuracy

plot(knnModel)

Predict All Records In TrainData

vTrn <- predict(knnModel, newdata=dfTrn)
head(vTrn)

## [1] Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa
## Levels: Iris-setosa Iris-versicolor Iris-virginica

Confusion Matrix

confusionMatrix(vTrn, dfTrn$Species)

## Confusion Matrix and Statistics
## 
##                  Reference
## Prediction        Iris-setosa Iris-versicolor Iris-virginica
##   Iris-setosa              35               0              0
##   Iris-versicolor           0              34              2
##   Iris-virginica            0               1             33
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9714          
##                  95% CI : (0.9188, 0.9941)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9571          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Iris-setosa Class: Iris-versicolor
## Sensitivity                      1.0000                 0.9714
## Specificity                      1.0000                 0.9714
## Pos Pred Value                   1.0000                 0.9444
## Neg Pred Value                   1.0000                 0.9855
## Prevalence                       0.3333                 0.3333
## Detection Rate                   0.3333                 0.3238
## Detection Prevalence             0.3333                 0.3429
## Balanced Accuracy                1.0000                 0.9714
##                      Class: Iris-virginica
## Sensitivity                         0.9429
## Specificity                         0.9857
## Pos Pred Value                      0.9706
## Neg Pred Value                      0.9718
## Prevalence                          0.3333
## Detection Rate                      0.3143
## Detection Prevalence                0.3238
## Balanced Accuracy                   0.9643

Make Final Knn Model

dfFinal <- select(df, -Id)
set.seed(707)
knnCntrl <- trainControl(method="repeatedcv",repeats=10)
knnFinal <- train(Species~., 
                data=dfFinal, 
                method="knn", 
                trControl=knnCntrl,
                preProcess = c("center","scale"), 
                tuneLength=20)
print(knnFinal)

## k-Nearest Neighbors 
## 
## 150 samples
##   4 predictor
##   3 classes: 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica' 
## 
## Pre-processing: centered (4), scaled (4) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa
##    5  0.9466667  0.920
##    7  0.9560000  0.934
##    9  0.9520000  0.928
##   11  0.9553333  0.933
##   13  0.9626667  0.944
##   15  0.9640000  0.946
##   17  0.9546667  0.932
##   19  0.9493333  0.924
##   21  0.9473333  0.921
##   23  0.9460000  0.919
##   25  0.9500000  0.925
##   27  0.9460000  0.919
##   29  0.9406667  0.911
##   31  0.9380000  0.907
##   33  0.9293333  0.894
##   35  0.9126667  0.869
##   37  0.8953333  0.843
##   39  0.8900000  0.835
##   41  0.8900000  0.835
##   43  0.8833333  0.825
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 15.

Load Prd Dataset

dfPrd <- read.csv("C:/Users/PC/Downloads/iris.prd", header=T, stringsAsFactors=T)
dfPrd <- select(dfPrd, -Id)
head(dfPrd)

##   SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm         Species
## 1           5.0          3.4           1.6          0.4     Iris-setosa
## 2           5.0          2.0           3.5          1.0 Iris-versicolor
## 3           6.3          2.5           5.0          1.9  Iris-virginica

Predict All Records In Prd Data

vPrd <- predict(knnFinal, newdata=dfPrd)
head(vPrd)

## [1] Iris-setosa     Iris-versicolor Iris-virginica 
## Levels: Iris-setosa Iris-versicolor Iris-virginica

Confusion Matrix Prd Data

confusionMatrix(vPrd, dfPrd$Species)

## Confusion Matrix and Statistics
## 
##                  Reference
## Prediction        Iris-setosa Iris-versicolor Iris-virginica
##   Iris-setosa               1               0              0
##   Iris-versicolor           0               1              0
##   Iris-virginica            0               0              1
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.2924, 1)
##     No Information Rate : 0.3333     
##     P-Value [Acc > NIR] : 0.03704    
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Iris-setosa Class: Iris-versicolor
## Sensitivity                      1.0000                 1.0000
## Specificity                      1.0000                 1.0000
## Pos Pred Value                   1.0000                 1.0000
## Neg Pred Value                   1.0000                 1.0000
## Prevalence                       0.3333                 0.3333
## Detection Rate                   0.3333                 0.3333
## Detection Prevalence             0.3333                 0.3333
## Balanced Accuracy                1.0000                 1.0000
##                      Class: Iris-virginica
## Sensitivity                         1.0000
## Specificity                         1.0000
## Pos Pred Value                      1.0000
## Neg Pred Value                      1.0000
## Prevalence                          0.3333
## Detection Rate                      0.3333
## Detection Prevalence                0.3333
## Balanced Accuracy                   1.0000

Knn Classification

Ankit Kumar Bhagat

July 01, 2017