This is the sample data set used in “Machine Learning with R” book to teach kNN algorithm.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
setwd("/Users/njvijay/big_data/Github/MachineLearning-R/knn/data")
srcData <- read.csv("wisc_bc_data.csv")
str(srcData)
## 'data.frame': 569 obs. of 32 variables:
## $ id : int 87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
## $ diagnosis : Factor w/ 2 levels "B","M": 1 1 1 1 1 1 1 2 1 1 ...
## $ radius_mean : num 12.3 10.6 11 11.3 15.2 ...
## $ texture_mean : num 12.4 18.9 16.8 13.4 13.2 ...
## $ perimeter_mean : num 78.8 69.3 70.9 73 97.7 ...
## $ area_mean : num 464 346 373 385 712 ...
## $ smoothness_mean : num 0.1028 0.0969 0.1077 0.1164 0.0796 ...
## $ compactness_mean : num 0.0698 0.1147 0.078 0.1136 0.0693 ...
## $ concavity_mean : num 0.0399 0.0639 0.0305 0.0464 0.0339 ...
## $ points_mean : num 0.037 0.0264 0.0248 0.048 0.0266 ...
## $ symmetry_mean : num 0.196 0.192 0.171 0.177 0.172 ...
## $ dimension_mean : num 0.0595 0.0649 0.0634 0.0607 0.0554 ...
## $ radius_se : num 0.236 0.451 0.197 0.338 0.178 ...
## $ texture_se : num 0.666 1.197 1.387 1.343 0.412 ...
## $ perimeter_se : num 1.67 3.43 1.34 1.85 1.34 ...
## $ area_se : num 17.4 27.1 13.5 26.3 17.7 ...
## $ smoothness_se : num 0.00805 0.00747 0.00516 0.01127 0.00501 ...
## $ compactness_se : num 0.0118 0.03581 0.00936 0.03498 0.01485 ...
## $ concavity_se : num 0.0168 0.0335 0.0106 0.0219 0.0155 ...
## $ points_se : num 0.01241 0.01365 0.00748 0.01965 0.00915 ...
## $ symmetry_se : num 0.0192 0.035 0.0172 0.0158 0.0165 ...
## $ dimension_se : num 0.00225 0.00332 0.0022 0.00344 0.00177 ...
## $ radius_worst : num 13.5 11.9 12.4 11.9 16.2 ...
## $ texture_worst : num 15.6 22.9 26.4 15.8 15.7 ...
## $ perimeter_worst : num 87 78.3 79.9 76.5 104.5 ...
## $ area_worst : num 549 425 471 434 819 ...
## $ smoothness_worst : num 0.139 0.121 0.137 0.137 0.113 ...
## $ compactness_worst: num 0.127 0.252 0.148 0.182 0.174 ...
## $ concavity_worst : num 0.1242 0.1916 0.1067 0.0867 0.1362 ...
## $ points_worst : num 0.0939 0.0793 0.0743 0.0861 0.0818 ...
## $ symmetry_worst : num 0.283 0.294 0.3 0.21 0.249 ...
## $ dimension_worst : num 0.0677 0.0759 0.0788 0.0678 0.0677 ...
#Removing id field which is not required for prediction.
wcData <- srcData[,-1]
table(wcData$diagnosis)
##
## B M
## 357 212
prop.table(table(wcData$diagnosis))
##
## B M
## 0.6274 0.3726
It has about 569 observations(biopsies) with 32 variables. diagnosis is the response variable with 2 levels B-begnin and M - melignant. All of the predictors are quantitative with different measurement. These predictors needs to be standardize before applying kNN algorithm.
Reserve 20% data for validation. 80% for cross validation and model building
set.seed(456)
idx <- createDataPartition(wcData$diagnosis,p = 0.8,list = FALSE)
train <- wcData[idx,]
test <- wcData[-idx,]
#Checking proportion of data spliting
prop.table(table(train$diagnosis))
##
## B M
## 0.6272 0.3728
prop.table(table(test$diagnosis))
##
## B M
## 0.6283 0.3717
% is maintained in the data spliting
library(doMC)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoMC(cores = 3)
Building the model
set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3,classProbs=TRUE,summaryFunction = twoClassSummary)
knnFit <- train(diagnosis ~ ., data = train, method = "knn", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 31)
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Warning: The metric "Accuracy" was not in the result set. ROC will be used
## instead.
knnFit
## k-Nearest Neighbors
##
## 456 samples
## 30 predictors
## 2 classes: 'B', 'M'
##
## Pre-processing: centered, scaled
## Resampling: Cross-Validated (10 fold, repeated 3 times)
##
## Summary of sample sizes: 411, 410, 410, 411, 411, 410, ...
##
## Resampling results across tuning parameters:
##
## k ROC Sens Spec ROC SD Sens SD Spec SD
## 5 1 1 0.9 0.02 0.02 0.06
## 7 1 1 0.9 0.02 0.02 0.07
## 9 1 1 0.9 0.01 0.01 0.06
## 10 1 1 0.9 0.01 0.01 0.06
## 10 1 1 0.9 0.01 0.007 0.07
## 20 1 1 0.9 0.01 0.02 0.07
## 20 1 1 0.9 0.01 0.01 0.07
## 20 1 1 0.9 0.01 0.009 0.07
## 20 1 1 0.9 0.01 0.009 0.07
## 20 1 1 0.9 0.01 0.01 0.07
## 20 1 1 0.9 0.01 0.01 0.07
## 30 1 1 0.9 0.01 0.01 0.07
## 30 1 1 0.9 0.01 0.01 0.07
## 30 1 1 0.9 0.01 0.01 0.07
## 30 1 1 0.9 0.01 0.01 0.07
## 40 1 1 0.9 0.01 0.01 0.07
## 40 1 1 0.9 0.01 0.01 0.07
## 40 1 1 0.9 0.01 0.01 0.07
## 40 1 1 0.9 0.01 0.01 0.07
## 40 1 1 0.9 0.01 0.01 0.08
## 40 1 1 0.9 0.008 0.01 0.07
## 50 1 1 0.9 0.008 0.01 0.07
## 50 1 1 0.9 0.008 0.01 0.07
## 50 1 1 0.9 0.008 0.01 0.07
## 50 1 1 0.9 0.008 0.01 0.07
## 60 1 1 0.9 0.008 0.01 0.07
## 60 1 1 0.9 0.008 0.01 0.07
## 60 1 1 0.8 0.008 0.01 0.08
## 60 1 1 0.8 0.009 0.01 0.09
## 60 1 1 0.8 0.009 0.01 0.09
## 60 1 1 0.8 0.009 0.01 0.09
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 45.
plot(knnFit)
Validate the model on the test data
knnPredict <- predict(knnFit,newdata = test)
#Get the confusion matrix to see accuracy value and other parameter values
confusionMatrix(knnPredict, test$diagnosis )
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 71 4
## M 0 38
##
## Accuracy : 0.965
## 95% CI : (0.912, 0.99)
## No Information Rate : 0.628
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.923
## Mcnemar's Test P-Value : 0.134
##
## Sensitivity : 1.000
## Specificity : 0.905
## Pos Pred Value : 0.947
## Neg Pred Value : 1.000
## Prevalence : 0.628
## Detection Rate : 0.628
## Detection Prevalence : 0.664
## Balanced Accuracy : 0.952
##
## 'Positive' Class : B
##
Measuring performance using ROCR package
knnperfPredict <- predict(knnFit,newdata = test,type="prob")
knnperfPredictB <- knnperfPredict$B
#Creating dummy vars
testdiagnosis <- ifelse(test$diagnosis == 'B', 1,0)
library(ROCR)
## Loading required package: gplots
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
##
## Attaching package: 'gplots'
##
## The following object is masked from 'package:stats':
##
## lowess
perfprod <- prediction(knnperfPredictB, testdiagnosis)
perf <- performance(perfprod,"tpr","fpr")
plot(perf)
perf <- performance(perfprod,"sens","spec")
plot(perf)
ROCR supports only binary classification. It is a limitation.