This notebook describe the following:
Evaluate the model for each class label with a Confusion Matrix:
Summarize the model performance by defining the following measures:
False Negative (FN)
Specificity
Optimizing K
Packages:
Dataset: the iris dataset
library(class)
library(caret)
library(ggplot2)
dataset = iris
Structure.data = str(dataset)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#define the proportion that split training data and test data
Train_data_prop = .7
#split the Training dataset
Train_sample = sample(1:nrow(iris), Train_data_prop*nrow(iris),replace = TRUE)
Train_data = iris[Train_sample,1:4]
Train_class = iris[Train_sample,5]
#split the Test dataset
Test_data = iris[-Train_sample,1:4]
Test_class = iris[-Train_sample,5]
#Create and train the KNN model
Kneighbors = 3
Model = knn(
train = Train_data, # Training Predictor variables or features
test = Test_data, # Test Predictor variables or features
cl = Train_class, # Training data labels
k = Kneighbors # Defined k
)
table(data.frame("Actual" = Test_class, "Predicted" = Model))
Predicted
Actual setosa versicolor virginica
setosa 29 0 0
versicolor 0 24 1
virginica 0 3 19
options(scipen=999)
ConfusionMatrix = confusionMatrix(
data = Model,
reference = Test_class #Factor of labels to be true results
)
ConfusionMatrix
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 29 0 0
versicolor 0 24 3
virginica 0 1 19
Overall Statistics
Accuracy : 0.9474
95% CI : (0.8707, 0.9855)
No Information Rate : 0.3816
P-Value [Acc > NIR] : < 0.00000000000000022
Kappa : 0.9204
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.9600 0.8636
Specificity 1.0000 0.9412 0.9815
Pos Pred Value 1.0000 0.8889 0.9500
Neg Pred Value 1.0000 0.9796 0.9464
Prevalence 0.3816 0.3289 0.2895
Detection Rate 0.3816 0.3158 0.2500
Detection Prevalence 0.3816 0.3553 0.2632
Balanced Accuracy 1.0000 0.9506 0.9226
TruePositive = sum(diag(as.matrix(ConfusionMatrix$table)))
TrueNegative = sum(as.matrix(ConfusionMatrix$table)) - TruePositive
TotalObservation = sum(as.matrix(ConfusionMatrix$table))
Accuracy = TruePositive/TotalObservation
Definition: The ability of the classifier to select all cases that need to be selected and reject all cases that need to be rejected.
Accuracy = 0.9473684
Kmin = 1
Kmax = length(Test_class)
DataFrame = as.data.frame(cbind(K = c(Kmin:Kmax),Accuracy = rep(0,Kmax)))
i = Kmin
while (i<= Kmax)
{
knn.temp = knn(
train = Train_data,
test = Test_data,
cl = Train_class,
k = i
)
ConfusionMatrix.temp = confusionMatrix(data = knn.temp,reference = Test_class)
DataFrame[i,2] = ConfusionMatrix.temp$overall[[1]]
i = i+1
next
}
Accuracy.plot = ggplot(data = DataFrame, aes(x=K,y=Accuracy)) + geom_line()
Accuracy.plot