Wisconsin Breast Cancer Diagnostic Data Set

This is the sample data set used in “Machine Learning with R” book to teach kNN algorithm.

Data exploration

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

setwd("/Users/njvijay/big_data/Github/MachineLearning-R/knn/data")
srcData <- read.csv("wisc_bc_data.csv")
str(srcData)

## 'data.frame':    569 obs. of  32 variables:
##  $ id               : int  87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
##  $ diagnosis        : Factor w/ 2 levels "B","M": 1 1 1 1 1 1 1 2 1 1 ...
##  $ radius_mean      : num  12.3 10.6 11 11.3 15.2 ...
##  $ texture_mean     : num  12.4 18.9 16.8 13.4 13.2 ...
##  $ perimeter_mean   : num  78.8 69.3 70.9 73 97.7 ...
##  $ area_mean        : num  464 346 373 385 712 ...
##  $ smoothness_mean  : num  0.1028 0.0969 0.1077 0.1164 0.0796 ...
##  $ compactness_mean : num  0.0698 0.1147 0.078 0.1136 0.0693 ...
##  $ concavity_mean   : num  0.0399 0.0639 0.0305 0.0464 0.0339 ...
##  $ points_mean      : num  0.037 0.0264 0.0248 0.048 0.0266 ...
##  $ symmetry_mean    : num  0.196 0.192 0.171 0.177 0.172 ...
##  $ dimension_mean   : num  0.0595 0.0649 0.0634 0.0607 0.0554 ...
##  $ radius_se        : num  0.236 0.451 0.197 0.338 0.178 ...
##  $ texture_se       : num  0.666 1.197 1.387 1.343 0.412 ...
##  $ perimeter_se     : num  1.67 3.43 1.34 1.85 1.34 ...
##  $ area_se          : num  17.4 27.1 13.5 26.3 17.7 ...
##  $ smoothness_se    : num  0.00805 0.00747 0.00516 0.01127 0.00501 ...
##  $ compactness_se   : num  0.0118 0.03581 0.00936 0.03498 0.01485 ...
##  $ concavity_se     : num  0.0168 0.0335 0.0106 0.0219 0.0155 ...
##  $ points_se        : num  0.01241 0.01365 0.00748 0.01965 0.00915 ...
##  $ symmetry_se      : num  0.0192 0.035 0.0172 0.0158 0.0165 ...
##  $ dimension_se     : num  0.00225 0.00332 0.0022 0.00344 0.00177 ...
##  $ radius_worst     : num  13.5 11.9 12.4 11.9 16.2 ...
##  $ texture_worst    : num  15.6 22.9 26.4 15.8 15.7 ...
##  $ perimeter_worst  : num  87 78.3 79.9 76.5 104.5 ...
##  $ area_worst       : num  549 425 471 434 819 ...
##  $ smoothness_worst : num  0.139 0.121 0.137 0.137 0.113 ...
##  $ compactness_worst: num  0.127 0.252 0.148 0.182 0.174 ...
##  $ concavity_worst  : num  0.1242 0.1916 0.1067 0.0867 0.1362 ...
##  $ points_worst     : num  0.0939 0.0793 0.0743 0.0861 0.0818 ...
##  $ symmetry_worst   : num  0.283 0.294 0.3 0.21 0.249 ...
##  $ dimension_worst  : num  0.0677 0.0759 0.0788 0.0678 0.0677 ...

#Removing id field which is not required for prediction.
wcData <- srcData[,-1]
table(wcData$diagnosis)

## 
##   B   M 
## 357 212

prop.table(table(wcData$diagnosis))

## 
##      B      M 
## 0.6274 0.3726

It has about 569 observations(biopsies) with 32 variables. diagnosis is the response variable with 2 levels B-begnin and M - melignant. All of the predictors are quantitative with different measurement. These predictors needs to be standardize before applying kNN algorithm.

Data Spliting

Reserve 20% data for validation. 80% for cross validation and model building

set.seed(456)
idx <- createDataPartition(wcData$diagnosis,p = 0.8,list = FALSE)
train <- wcData[idx,]
test <- wcData[-idx,]
#Checking proportion of data spliting
prop.table(table(train$diagnosis))

## 
##      B      M 
## 0.6272 0.3728

prop.table(table(test$diagnosis))

## 
##      B      M 
## 0.6283 0.3717

% is maintained in the data spliting

Setting parallelism

library(doMC)

## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel

registerDoMC(cores = 3)

Building the model

set.seed(400)
ctrl <- trainControl(method="repeatedcv",repeats = 3,classProbs=TRUE,summaryFunction = twoClassSummary)
knnFit <- train(diagnosis ~ ., data = train, method = "knn", trControl = ctrl, preProcess = c("center","scale"), tuneLength = 31)

## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

## Warning: The metric "Accuracy" was not in the result set. ROC will be used
## instead.

knnFit

## k-Nearest Neighbors 
## 
## 456 samples
##  30 predictors
##   2 classes: 'B', 'M' 
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## 
## Summary of sample sizes: 411, 410, 410, 411, 411, 410, ... 
## 
## Resampling results across tuning parameters:
## 
##   k   ROC  Sens  Spec  ROC SD  Sens SD  Spec SD
##   5   1    1     0.9   0.02    0.02     0.06   
##   7   1    1     0.9   0.02    0.02     0.07   
##   9   1    1     0.9   0.01    0.01     0.06   
##   10  1    1     0.9   0.01    0.01     0.06   
##   10  1    1     0.9   0.01    0.007    0.07   
##   20  1    1     0.9   0.01    0.02     0.07   
##   20  1    1     0.9   0.01    0.01     0.07   
##   20  1    1     0.9   0.01    0.009    0.07   
##   20  1    1     0.9   0.01    0.009    0.07   
##   20  1    1     0.9   0.01    0.01     0.07   
##   20  1    1     0.9   0.01    0.01     0.07   
##   30  1    1     0.9   0.01    0.01     0.07   
##   30  1    1     0.9   0.01    0.01     0.07   
##   30  1    1     0.9   0.01    0.01     0.07   
##   30  1    1     0.9   0.01    0.01     0.07   
##   40  1    1     0.9   0.01    0.01     0.07   
##   40  1    1     0.9   0.01    0.01     0.07   
##   40  1    1     0.9   0.01    0.01     0.07   
##   40  1    1     0.9   0.01    0.01     0.07   
##   40  1    1     0.9   0.01    0.01     0.08   
##   40  1    1     0.9   0.008   0.01     0.07   
##   50  1    1     0.9   0.008   0.01     0.07   
##   50  1    1     0.9   0.008   0.01     0.07   
##   50  1    1     0.9   0.008   0.01     0.07   
##   50  1    1     0.9   0.008   0.01     0.07   
##   60  1    1     0.9   0.008   0.01     0.07   
##   60  1    1     0.9   0.008   0.01     0.07   
##   60  1    1     0.8   0.008   0.01     0.08   
##   60  1    1     0.8   0.009   0.01     0.09   
##   60  1    1     0.8   0.009   0.01     0.09   
##   60  1    1     0.8   0.009   0.01     0.09   
## 
## ROC was used to select the optimal model using  the largest value.
## The final value used for the model was k = 45.

plot(knnFit)

plot of chunk unnamed-chunk-4

Validate the model on the test data

knnPredict <- predict(knnFit,newdata = test)
#Get the confusion matrix to see accuracy value and other parameter values
confusionMatrix(knnPredict, test$diagnosis )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 71  4
##          M  0 38
##                                        
##                Accuracy : 0.965        
##                  95% CI : (0.912, 0.99)
##     No Information Rate : 0.628        
##     P-Value [Acc > NIR] : <2e-16       
##                                        
##                   Kappa : 0.923        
##  Mcnemar's Test P-Value : 0.134        
##                                        
##             Sensitivity : 1.000        
##             Specificity : 0.905        
##          Pos Pred Value : 0.947        
##          Neg Pred Value : 1.000        
##              Prevalence : 0.628        
##          Detection Rate : 0.628        
##    Detection Prevalence : 0.664        
##       Balanced Accuracy : 0.952        
##                                        
##        'Positive' Class : B            
##

Measuring performance using ROCR package

knnperfPredict <- predict(knnFit,newdata = test,type="prob")
knnperfPredictB <- knnperfPredict$B
#Creating dummy vars
testdiagnosis <- ifelse(test$diagnosis == 'B', 1,0)

library(ROCR)

## Loading required package: gplots
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess

perfprod <- prediction(knnperfPredictB, testdiagnosis)
perf <- performance(perfprod,"tpr","fpr")
plot(perf)

plot of chunk unnamed-chunk-6

perf <- performance(perfprod,"sens","spec")
plot(perf)

plot of chunk unnamed-chunk-6

ROCR supports only binary classification. It is a limitation.

Wisconsin Breast Cancer Diagnostic Data Set

Vijayakumar Jawaharlal

May 3, 2014

Data exploration

Data Spliting

Setting parallelism