Introduction
Knn Classification
Dataset
iris.csv
Problem Defination Create Knn model with given dataset and predict the Outcome Attribute (Species)
Steps
The steps to create the relationship is:
* EDA
* Clean 0s, NAs, Outliers
* Correlation ???
* VDA
* Find value of K * Split Dataset into Train / Test
* Create Knn Model with Train
* Confirm value of K * Check Accuracy
* Predict OutcomeVariable in Test using Knn Model
* Confusion Matrix (PredictedOutcomeVariable v/s ActualOutcomeVariable)
* Create Knn Final Model with Full Dataset
* Predict OutcomeVariable Using Knn Final Model
Data Location
The Iris dataset was used in R.A. Fisher’s classic 1936 paper, The dataset can be found on
https://www.kaggle.com/uciml/iris
Data Description
The Iris dataset contains the followng information
Id
SepalLengthCm
SepalWidthCm
PetalLengthCm
PetalWidthCm
Species (Outcome Variable)
knitr Global Options
# for development
knitr::opts_chunk$set(echo=TRUE, eval=TRUE, error=TRUE, warning=FALSE, message=FALSE, cache=FALSE, tidy=FALSE, fig.path='figures/')
# for production
#knitr::opts_chunk$set(echo=TRUE, eval=TRUE, error=FALSE, warning=FALSE, message=FALSE, cache=FALSE, tidy=FALSE, fig.path='figures/')
Load Libs
library(tidyr)
library(dplyr)
library(ggplot2)
#install.packages("corrgram")
#install.packages("fpc")
#library(corrgram)
library(gridExtra)
#install.packages("caret")
#install.packages('TMB', type = 'source')
#install.packages("sjPlot")
library(caret)
library(sjPlot)
#install.packages("doMC")
# Parallel Processing
# The caret package supports parallel processing in order to decrease the compute
# time for a given experiment. It is supported automatically as long as it is
# configured.
# In this example we can load the doMC package and set the number of cores to 4,
# making available 4 worker threads to caret when tuning the model. This is used
# for the loops for the repeats of cross validation for each parameter combination.
# configure multicore
#library(doMC)
#registerDoMC(cores=4)
Functions
# detect NAs
detect_na <- function(inp) {
sum(is.na(inp))
}
# detect Outliers
detect_outliers <- function(inp, na.rm=TRUE) {
i.qnt <- quantile(inp, probs=c(.25, .75), na.rm=na.rm)
i.max <- 1.5 * IQR(inp, na.rm=na.rm)
otp <- inp
otp[inp < (i.qnt[1] - i.max)] <- NA
otp[inp > (i.qnt[2] + i.max)] <- NA
inp[is.na(otp)]
}
Load Dataset
df <- read.csv("C:/Users/PC/Downloads/iris.csv", header=T, stringsAsFactors=T)
head(df)
## Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
## 1 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5 5.0 3.6 1.4 0.2 Iris-setosa
## 6 6 5.4 3.9 1.7 0.4 Iris-setosa
Dataframe Stucture
str(df)
## 'data.frame': 150 obs. of 6 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SepalLengthCm: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ SepalWidthCm : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ PetalLengthCm: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ PetalWidthCm : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "Iris-setosa",..: 1 1 1 1 1 1 1 1 1 1 ...
Dataframe Summary
lapply(df, FUN=summary)
## $Id
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 38.25 75.50 75.50 112.75 150.00
##
## $SepalLengthCm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
##
## $SepalWidthCm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.800 3.000 3.054 3.300 4.400
##
## $PetalLengthCm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.600 4.350 3.759 5.100 6.900
##
## $PetalWidthCm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.300 1.300 1.199 1.800 2.500
##
## $Species
## Iris-setosa Iris-versicolor Iris-virginica
## 50 50 50
Missing Data
lapply(df, FUN=detect_na)
## $Id
## [1] 0
##
## $SepalLengthCm
## [1] 0
##
## $SepalWidthCm
## [1] 0
##
## $PetalLengthCm
## [1] 0
##
## $PetalWidthCm
## [1] 0
##
## $Species
## [1] 0
Observation
No missing values
Predictor Data - Only Predictor Cols
dfT <- select(df, -Id, -Species)
head(dfT)
## SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
Detect Outliers
lapply(dfT, FUN=detect_outliers)
## $SepalLengthCm
## numeric(0)
##
## $SepalWidthCm
## [1] 4.4 4.1 4.2 2.0
##
## $PetalLengthCm
## numeric(0)
##
## $PetalWidthCm
## numeric(0)
Observation
Outliers present in SepalWidthCm.
But Outlier count is low (4).
For this model we will work with the outliers.
# boxplot of all columns
pltBoxPlot <- ggplot(stack(dfT), aes(x=ind, y=values)) +
geom_boxplot(fill=rainbow(ncol(dfT)),color=rainbow(ncol(dfT))) +
labs(title="Iris Data - Outliers") +
labs(x="Features") +
labs(y="Value")
print(pltBoxPlot)
Observation
Outliers present in SepalWidthCm.
But Outlier count is low (4).
For this model we will work with the outliers.
Correlation
#https://www.statmethods.net/stats/correlations.html
cor(dfT)
## SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
## SepalLengthCm 1.0000000 -0.1093692 0.8717542 0.8179536
## SepalWidthCm -0.1093692 1.0000000 -0.4205161 -0.3565441
## PetalLengthCm 0.8717542 -0.4205161 1.0000000 0.9627571
## PetalWidthCm 0.8179536 -0.3565441 0.9627571 1.0000000
Visualize Correlation
# http://www.statmethods.net/advgraphs/correlograms.html
corrgram(dfT)
## Error in corrgram(dfT): could not find function "corrgram"
Scatter Plots
plot(dfT)
Scatter Plot - Petals
ggplot(df, aes(x=PetalLengthCm, y=PetalWidthCm)) +
geom_point(size=0.9, colour="blue", fill="blue") +
labs(title="Scatter Plot Petals") +
labs(x="PetalLengthCm") +
labs(y="PetalWidthCm")
Scatter Plot - Sepals
ggplot(df, aes(x=SepalLengthCm, y=SepalWidthCm)) +
geom_point(size=0.9, colour="blue", fill="blue") +
labs(title="Scatter Plot - Sepals") +
labs(x="SepalLengthCm") +
labs(y="SepalWidthCm")
Note:
Now, we need to calculate “With-In-Sum-Of-Squares (WSS)” iteratively.
WSS is a measure to explain the homogeneity within a cluster.
WSS function plots WSS against the number of clusters.
wss Function
sjc.elbow(dfT)
Note:
We have plotted WSS with number of clusters. From here we can see that there is not much decrease in WSS even if we increase the number of clusters beyond 8 - 10.
This graph is also known as “Elbow Curve” where the bending point (E.g, nc = 8 - 10 in our case) is known as “Elbow Point”.
From the above plot we can conclude that if we keep number of clusters = 8 - 10, we should be able to get good clusters with good homogeneity within themselves.
Let’s fix the cluster size to “9” and call the kmeans() function to give the clusters.
Dataset Split
set.seed(707)
idTrn <- createDataPartition(y=df$Species, p=0.7, list=FALSE)
dfTrn <- df[idTrn,]
dfTrn <- select(dfTrn, -Id)
dfTst <- df[-idTrn,]
dfTst <- select(dfTst, -Id)
Training Dataset RowCount & ColCount
dim(dfTrn)
## [1] 105 5
Training Dataset Head
head(dfTrn)
## SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 5 5.0 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa
## 9 4.4 2.9 1.4 0.2 Iris-setosa
## 11 5.4 3.7 1.5 0.2 Iris-setosa
## 13 4.8 3.0 1.4 0.1 Iris-setosa
Training Dataset Summary
lapply(dfTrn, FUN=summary)
## $SepalLengthCm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.30 5.10 5.80 5.85 6.40 7.90
##
## $SepalWidthCm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.800 3.000 3.072 3.300 4.400
##
## $PetalLengthCm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.600 4.400 3.771 5.100 6.900
##
## $PetalWidthCm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.300 1.300 1.214 1.900 2.500
##
## $Species
## Iris-setosa Iris-versicolor Iris-virginica
## 35 35 35
Testing Dataset RowCount & ColCount
dim(dfTst)
## [1] 45 5
Testing Dataset Head
head(dfTst)
## SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 7 4.6 3.4 1.4 0.3 Iris-setosa
## 8 5.0 3.4 1.5 0.2 Iris-setosa
## 10 4.9 3.1 1.5 0.1 Iris-setosa
Testing Dataset Summary
lapply(dfTst, FUN=summary)
## $SepalLengthCm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.500 5.100 5.700 5.827 6.400 7.400
##
## $SepalWidthCm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.300 2.800 3.000 3.011 3.300 3.800
##
## $PetalLengthCm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.300 1.600 4.200 3.729 5.100 6.100
##
## $PetalWidthCm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.300 1.300 1.162 1.800 2.400
##
## $Species
## Iris-setosa Iris-versicolor Iris-virginica
## 15 15 15
**TRAINING CLASSIFIER - NOTES**
##https://topepo.github.io/caret/index.html
##https://www.machinelearningplus.com/machine-learning/caret-package/
Caret package provides train() method for training our data for various algorithms. We just need to pass different parameter values for different algorithms.
Before train() method, we will first use trainControl() method. It controls the computational nuances of the train() method.
##https://www.rdocumentation.org/packages/caret/versions/6.0-82/topics/trainControl
In trainControl() method, we are setting 3 parameters of
* "method" parameter holds the details about resampling method. We can set "method" with many values like "boot", "boot632", "cv", "repeatedcv", "LOOCV", "LGOCV" etc. https://stats.stackexchange.com/questions/17602/caret-re-sampling-methods
http://appliedpredictivemodeling.com/blog/2014/11/27/vpuig01pqbklmi72b8lcl3ij5hj2qm
For this exercise, we will use repeatedcv i.e, repeated cross-validation.
In fact for most classification models by default we should use "repeatedcv".
* The "number" parameter holds the number of folds for the cross validations
* The "repeats" parameter contains number for our repeated cross-validation.
**We will use setting number=10 and repeats=10***
This trainControl() methods returns a list. We are going to pass this on our train() method.
##https://www.rdocumentation.org/packages/caret/versions/4.47/topics/train
Before training our classifier, set.seed(). Knn starts with random centroids.
For training Knn classifier, train() method the following parameters should be passed
* Outcome~. We are passing our outcome variable. The "Outcome~." denotes a formula for using all attributes in our classifier and "Outcome" as the target variable.
* data=data-frame-for-creating-model
* "method" parameter as "knn".
* "trControl" parameter should be passed with results from our trainControl()
##https://www.rdocumentation.org/packages/caret/versions/6.0-83/topics/preProcess
##https://www.machinelearningplus.com/machine-learning/caret-package/#35howtopreprocesstotransformthedata
* preProcess - used to transform or impute data before training
##https://topepo.github.io/caret/model-training-and-tuning.html#preproc
* tuneLength=20
There are two main ways to do hyper parameter tuning using the train():
- Set the tuneLength
- Define and set the tuneGrid
tuneLength corresponds to the number of unique values for the tuning parameters caret will consider while forming the hyper parameter combinations.
Caret will automatically determine the values each parameter should take.
Alternately, if you want to explicitly control what values should be considered for each parameter, then, you can define the tuneGrid and pass it to train().
Make KNN Model
set.seed(707)
knnCntrl <- trainControl(method="repeatedcv", number=10, repeats=10)
knnModel <- train(Species~.,
data=dfTrn,
method="knn",
trControl=knnCntrl,#specify the cross validation using trControl
preProcess = c("center","scale"), #handles missing data with mean i.e center, scale is for normalization, every column is normalized.
tuneLength=20)#hypertuning parameters, there is also tuning grid
print(knnModel)
## k-Nearest Neighbors
##
## 105 samples
## 4 predictor
## 3 classes: 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'
##
## Pre-processing: centered (4), scaled (4)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 96, 95, 95, 94, 93, 94, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9400909 0.9093837
## 7 0.9583434 0.9370746
## 9 0.9615707 0.9418685
## 11 0.9530455 0.9290628
## 13 0.9365404 0.9042755
## 15 0.9355960 0.9029681
## 17 0.9376717 0.9064070
## 19 0.9320202 0.8978608
## 21 0.9226768 0.8837904
## 23 0.9063687 0.8592828
## 25 0.8967323 0.8447674
## 27 0.8870000 0.8301669
## 29 0.8808283 0.8206314
## 31 0.8828889 0.8237449
## 33 0.8823535 0.8228818
## 35 0.8731818 0.8091399
## 37 0.8652778 0.7972660
## 39 0.8589798 0.7877060
## 41 0.8660556 0.7986333
## 43 0.8596667 0.7889175
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
Plot Knn Model
# plot the effect of parameters on accuracy
plot(knnModel)
Predict All Records In TrainData
vTrn <- predict(knnModel, newdata=dfTrn)
head(vTrn)
## [1] Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa
## Levels: Iris-setosa Iris-versicolor Iris-virginica
Confusion Matrix
confusionMatrix(vTrn, dfTrn$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Iris-setosa Iris-versicolor Iris-virginica
## Iris-setosa 35 0 0
## Iris-versicolor 0 34 2
## Iris-virginica 0 1 33
##
## Overall Statistics
##
## Accuracy : 0.9714
## 95% CI : (0.9188, 0.9941)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9571
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Iris-setosa Class: Iris-versicolor
## Sensitivity 1.0000 0.9714
## Specificity 1.0000 0.9714
## Pos Pred Value 1.0000 0.9444
## Neg Pred Value 1.0000 0.9855
## Prevalence 0.3333 0.3333
## Detection Rate 0.3333 0.3238
## Detection Prevalence 0.3333 0.3429
## Balanced Accuracy 1.0000 0.9714
## Class: Iris-virginica
## Sensitivity 0.9429
## Specificity 0.9857
## Pos Pred Value 0.9706
## Neg Pred Value 0.9718
## Prevalence 0.3333
## Detection Rate 0.3143
## Detection Prevalence 0.3238
## Balanced Accuracy 0.9643
Make Final Knn Model
dfFinal <- select(df, -Id)
set.seed(707)
knnCntrl <- trainControl(method="repeatedcv",repeats=10)
knnFinal <- train(Species~.,
data=dfFinal,
method="knn",
trControl=knnCntrl,
preProcess = c("center","scale"),
tuneLength=20)
print(knnFinal)
## k-Nearest Neighbors
##
## 150 samples
## 4 predictor
## 3 classes: 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica'
##
## Pre-processing: centered (4), scaled (4)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9466667 0.920
## 7 0.9560000 0.934
## 9 0.9520000 0.928
## 11 0.9553333 0.933
## 13 0.9626667 0.944
## 15 0.9640000 0.946
## 17 0.9546667 0.932
## 19 0.9493333 0.924
## 21 0.9473333 0.921
## 23 0.9460000 0.919
## 25 0.9500000 0.925
## 27 0.9460000 0.919
## 29 0.9406667 0.911
## 31 0.9380000 0.907
## 33 0.9293333 0.894
## 35 0.9126667 0.869
## 37 0.8953333 0.843
## 39 0.8900000 0.835
## 41 0.8900000 0.835
## 43 0.8833333 0.825
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 15.
Load Prd Dataset
dfPrd <- read.csv("C:/Users/PC/Downloads/iris.prd", header=T, stringsAsFactors=T)
dfPrd <- select(dfPrd, -Id)
head(dfPrd)
## SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
## 1 5.0 3.4 1.6 0.4 Iris-setosa
## 2 5.0 2.0 3.5 1.0 Iris-versicolor
## 3 6.3 2.5 5.0 1.9 Iris-virginica
Predict All Records In Prd Data
vPrd <- predict(knnFinal, newdata=dfPrd)
head(vPrd)
## [1] Iris-setosa Iris-versicolor Iris-virginica
## Levels: Iris-setosa Iris-versicolor Iris-virginica
Confusion Matrix Prd Data
confusionMatrix(vPrd, dfPrd$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Iris-setosa Iris-versicolor Iris-virginica
## Iris-setosa 1 0 0
## Iris-versicolor 0 1 0
## Iris-virginica 0 0 1
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.2924, 1)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 0.03704
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Iris-setosa Class: Iris-versicolor
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.3333 0.3333
## Detection Rate 0.3333 0.3333
## Detection Prevalence 0.3333 0.3333
## Balanced Accuracy 1.0000 1.0000
## Class: Iris-virginica
## Sensitivity 1.0000
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 1.0000
## Prevalence 0.3333
## Detection Rate 0.3333
## Detection Prevalence 0.3333
## Balanced Accuracy 1.0000