Pada kesempatan kali ini saya akan mendemonstrasikan bagaimana cara menggunakan kNN dengan package caret di R. Caret merupakan package R yang keren banget, karena sampai saat ini sudah ada 150 algoritma machine learning yang langsung bisa kamu pakai. Caret juga sudah menyediakan function sampai contoh-contoh data (training maupun testing), preprocessing, evaluasi model, dan lain-lain.
library(caret)
Loading required package: lattice
Loading required package: ggplot2
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
set.seed(300)
#Spliting data as training and test set. Using createDataPartition() function from caret. p is the percentage of data that goes to training.
indxTrain <- createDataPartition(y = iris$Species,p = 0.75,list = FALSE)
training <- iris[indxTrain,]
testing <- iris[-indxTrain,]
#Checking distibution in original data and partitioned data
prop.table(table(training$Species)) * 100
prop.table(table(testing$Species)) * 100
setosa versicolor virginica
33.33333 33.33333 33.33333
prop.table(table(iris$Species)) * 100
setosa versicolor virginica
33.33333 33.33333 33.33333
Note: creteDataParition function creates sample very effortlessly. We don’t need to write complex function!
kNN requires variables to be normalized or scaled. caret provides facility to preprocess data. I am going to choose centring and scaling
trainX <- training[,names(training) != "Species"]
preProcValues <- preProcess(x = trainX,method = c("center", "scale"))
preProcValues
Created from 114 samples and 4 variables
Pre-processing:
- centered (4)
- ignored (0)
- scaled (4)
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClassSummary)
knnFit <- train(Species ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 20)
#Output of kNN fit
knnFit
k-Nearest Neighbors
114 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
Pre-processing: centered (4), scaled (4)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 103, 102, 102, 103, 104, 103, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.9795455 0.9692784
7 0.9765152 0.9648097
9 0.9732323 0.9599186
11 0.9702020 0.9553353
13 0.9616162 0.9425332
15 0.9618687 0.9429499
17 0.9585859 0.9380065
19 0.9403030 0.9102568
21 0.9347475 0.9019235
23 0.9256566 0.8881763
25 0.9223232 0.8826633
27 0.9228788 0.8839530
29 0.9145455 0.8714530
31 0.9173232 0.8756197
33 0.9142929 0.8712033
35 0.8960101 0.8432286
37 0.8874242 0.8302540
39 0.8874242 0.8302540
41 0.8818687 0.8219206
43 0.8788384 0.8172807
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 5.
#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)
plot(knnFit)
knnPredict <- predict(knnFit,newdata = testing )
#Get the confusion matrix to see accuracy value and other parameter values
confusionMatrix(knnPredict, testing$Species )
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 12 0 0
versicolor 0 11 4
virginica 0 1 8
Overall Statistics
Accuracy : 0.8611
95% CI : (0.705, 0.9533)
No Information Rate : 0.3333
P-Value [Acc > NIR] : 8.705e-11
Kappa : 0.7917
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.9167 0.6667
Specificity 1.0000 0.8333 0.9583
Pos Pred Value 1.0000 0.7333 0.8889
Neg Pred Value 1.0000 0.9524 0.8519
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3056 0.2222
Detection Prevalence 0.3333 0.4167 0.2500
Balanced Accuracy 1.0000 0.8750 0.8125
mean(knnPredict == testing$Species)
[1] 0.8611111
Pekerjaan ini terinspirasi dari link ini: http://rstudio-pubs-static.s3.amazonaws.com/16444_caf85a306d564eb490eebdbaf0072df2.html