Introduction

The purpose of this analysis is to use a KNN model to predict the diagnosis of breast cancer (benign or malignant) given specific features. The dataset was obtained from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

loaded libraries

library(ggplot2)
library(class)
library(gmodels)
library(caret)
## Loading required package: lattice

Data Exploration and Preprocessing

str(bc)
## 'data.frame':    569 obs. of  32 variables:
##  $ id               : int  87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
##  $ diagnosis        : chr  "B" "B" "B" "B" ...
##  $ radius_mean      : num  12.3 10.6 11 11.3 15.2 ...
##  $ texture_mean     : num  12.4 18.9 16.8 13.4 13.2 ...
##  $ perimeter_mean   : num  78.8 69.3 70.9 73 97.7 ...
##  $ area_mean        : num  464 346 373 385 712 ...
##  $ smoothness_mean  : num  0.1028 0.0969 0.1077 0.1164 0.0796 ...
##  $ compactness_mean : num  0.0698 0.1147 0.078 0.1136 0.0693 ...
##  $ concavity_mean   : num  0.0399 0.0639 0.0305 0.0464 0.0339 ...
##  $ points_mean      : num  0.037 0.0264 0.0248 0.048 0.0266 ...
##  $ symmetry_mean    : num  0.196 0.192 0.171 0.177 0.172 ...
##  $ dimension_mean   : num  0.0595 0.0649 0.0634 0.0607 0.0554 ...
##  $ radius_se        : num  0.236 0.451 0.197 0.338 0.178 ...
##  $ texture_se       : num  0.666 1.197 1.387 1.343 0.412 ...
##  $ perimeter_se     : num  1.67 3.43 1.34 1.85 1.34 ...
##  $ area_se          : num  17.4 27.1 13.5 26.3 17.7 ...
##  $ smoothness_se    : num  0.00805 0.00747 0.00516 0.01127 0.00501 ...
##  $ compactness_se   : num  0.0118 0.03581 0.00936 0.03498 0.01485 ...
##  $ concavity_se     : num  0.0168 0.0335 0.0106 0.0219 0.0155 ...
##  $ points_se        : num  0.01241 0.01365 0.00748 0.01965 0.00915 ...
##  $ symmetry_se      : num  0.0192 0.035 0.0172 0.0158 0.0165 ...
##  $ dimension_se     : num  0.00225 0.00332 0.0022 0.00344 0.00177 ...
##  $ radius_worst     : num  13.5 11.9 12.4 11.9 16.2 ...
##  $ texture_worst    : num  15.6 22.9 26.4 15.8 15.7 ...
##  $ perimeter_worst  : num  87 78.3 79.9 76.5 104.5 ...
##  $ area_worst       : num  549 425 471 434 819 ...
##  $ smoothness_worst : num  0.139 0.121 0.137 0.137 0.113 ...
##  $ compactness_worst: num  0.127 0.252 0.148 0.182 0.174 ...
##  $ concavity_worst  : num  0.1242 0.1916 0.1067 0.0867 0.1362 ...
##  $ points_worst     : num  0.0939 0.0793 0.0743 0.0861 0.0818 ...
##  $ symmetry_worst   : num  0.283 0.294 0.3 0.21 0.249 ...
##  $ dimension_worst  : num  0.0677 0.0759 0.0788 0.0678 0.0677 ...

The id variable is not relevant for analysis using KNN so we will remove it from the data frame.

bc = bc[-1]

We will also convert our target variable diagnosis to a factor with two variables. “Benign” and “Malignant”.

bc$diagnosis = factor(bc$diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant"))

The the numeric features must be normalized using min-max normalization since they are on different scales.

##   radius_mean      texture_mean   perimeter_mean     area_mean     
##  Min.   : 6.981   Min.   : 9.71   Min.   : 43.79   Min.   : 143.5  
##  1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17   1st Qu.: 420.3  
##  Median :13.370   Median :18.84   Median : 86.24   Median : 551.1  
##  Mean   :14.127   Mean   :19.29   Mean   : 91.97   Mean   : 654.9  
##  3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10   3rd Qu.: 782.7  
##  Max.   :28.110   Max.   :39.28   Max.   :188.50   Max.   :2501.0  
##  smoothness_mean   compactness_mean  concavity_mean     points_mean     
##  Min.   :0.05263   Min.   :0.01938   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.08637   1st Qu.:0.06492   1st Qu.:0.02956   1st Qu.:0.02031  
##  Median :0.09587   Median :0.09263   Median :0.06154   Median :0.03350  
##  Mean   :0.09636   Mean   :0.10434   Mean   :0.08880   Mean   :0.04892  
##  3rd Qu.:0.10530   3rd Qu.:0.13040   3rd Qu.:0.13070   3rd Qu.:0.07400  
##  Max.   :0.16340   Max.   :0.34540   Max.   :0.42680   Max.   :0.20120  
##  symmetry_mean    dimension_mean      radius_se        texture_se    
##  Min.   :0.1060   Min.   :0.04996   Min.   :0.1115   Min.   :0.3602  
##  1st Qu.:0.1619   1st Qu.:0.05770   1st Qu.:0.2324   1st Qu.:0.8339  
##  Median :0.1792   Median :0.06154   Median :0.3242   Median :1.1080  
##  Mean   :0.1812   Mean   :0.06280   Mean   :0.4052   Mean   :1.2169  
##  3rd Qu.:0.1957   3rd Qu.:0.06612   3rd Qu.:0.4789   3rd Qu.:1.4740  
##  Max.   :0.3040   Max.   :0.09744   Max.   :2.8730   Max.   :4.8850  
##   perimeter_se       area_se        smoothness_se      compactness_se    
##  Min.   : 0.757   Min.   :  6.802   Min.   :0.001713   Min.   :0.002252  
##  1st Qu.: 1.606   1st Qu.: 17.850   1st Qu.:0.005169   1st Qu.:0.013080  
##  Median : 2.287   Median : 24.530   Median :0.006380   Median :0.020450  
##  Mean   : 2.866   Mean   : 40.337   Mean   :0.007041   Mean   :0.025478  
##  3rd Qu.: 3.357   3rd Qu.: 45.190   3rd Qu.:0.008146   3rd Qu.:0.032450  
##  Max.   :21.980   Max.   :542.200   Max.   :0.031130   Max.   :0.135400  
##   concavity_se       points_se         symmetry_se        dimension_se      
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948  
##  1st Qu.:0.01509   1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480  
##  Median :0.02589   Median :0.010930   Median :0.018730   Median :0.0031870  
##  Mean   :0.03189   Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949  
##  3rd Qu.:0.04205   3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580  
##  Max.   :0.39600   Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400  
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst   points_worst    
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493  
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993  
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461  
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140  
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100  
##  symmetry_worst   dimension_worst  
##  Min.   :0.1565   Min.   :0.05504  
##  1st Qu.:0.2504   1st Qu.:0.07146  
##  Median :0.2822   Median :0.08004  
##  Mean   :0.2901   Mean   :0.08395  
##  3rd Qu.:0.3179   3rd Qu.:0.09208  
##  Max.   :0.6638   Max.   :0.20750

Defined the normalize function

normalize = function(x){
  return ((x-min(x)) / max(x)-min(x))
}

Applying the normalize function to each numerical feature

bc_n = as.data.frame(lapply(bc[2:31], normalize))

Data Analysis and Experimental Results

We will split the data into training and testing sets. First we will randomly shuffle the data to ensure that there will be no discrepancies in the results.

set.seed(123)
bc = bc[sample(nrow(bc), replace=FALSE),]
bc_train = bc_n[1:469, ]
bc_test = bc_n[470:569, ]

bc_train_labels = bc[1:469, 1]
bc_test_labels = bc[470:569, 1]

Running the KNN function on our data, k=21

bc_test_pred = knn(train = bc_train, test = bc_test, cl = bc_train_labels, k=21)
CrossTable(x=bc_test_labels, y=bc_test_pred, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | bc_test_pred 
## bc_test_labels |    Benign | Malignant | Row Total | 
## ---------------|-----------|-----------|-----------|
##         Benign |        58 |         5 |        63 | 
##                |     0.921 |     0.079 |     0.630 | 
##                |     0.674 |     0.357 |           | 
##                |     0.580 |     0.050 |           | 
## ---------------|-----------|-----------|-----------|
##      Malignant |        28 |         9 |        37 | 
##                |     0.757 |     0.243 |     0.370 | 
##                |     0.326 |     0.643 |           | 
##                |     0.280 |     0.090 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        86 |        14 |       100 | 
##                |     0.860 |     0.140 |           | 
## ---------------|-----------|-----------|-----------|
## 
## 

Using 10-fold cross validation instead of train/test split

folds = createFolds(bc$diagnosis, k=10)

knn_fold function takes a fold and returns the validation error for that fold

knn_fold=function(features,target,fold,k){
train=features[-fold,]
validation=features[fold,]
train_labels=target[-fold]
validation_labels=target[fold]
validation_preds=knn(train,validation,train_labels,k=k)
t= table(validation_labels,validation_preds)
error=(t[1,2]+t[2,1])/(t[1,1]+t[1,2]+t[2,1]+t[2,2])
return(error)
}

crossValidationError function creates the folds and applies the knn_fold function to each fold and returns the average of the validation error over all the folds.

crossValidationError=function(features,target,k){
folds=createFolds(target,k=10)
errors=sapply(folds,knn_fold,features=features,
target=target,k=k)
return(mean(errors))
}

Using the function on our data

crossValidationError(bc_n, bc[,1],21)
## [1] 0.3760803

Tuning K in KNN using a range of values for k

ks=c(1,5,10,15,20,25,30,35,40,45,50)
errors = sapply(ks, crossValidationError, features=bc_n, target=bc[,1])
plot(errors~ks, main="Cross Validation Error vs K", xlab="k", ylab="CVError")
lines(errors~ks)

errors
##  [1] 0.4429511 0.4058217 0.4343596 0.4267879 0.3918157 0.3691935 0.3828310
##  [8] 0.3796539 0.3708258 0.3868248 0.3832307

Using z-score normalization instead of min-max normalization then getting the cross validation score.

bc_z = as.data.frame(scale(bc[-1]))
crossValidationError(bc_z, bc[,1],21)
## [1] 0.04398928
errors=sapply(ks, crossValidationError, features=bc_z, target=bc[,1])
plot(errors~ks, main="Cross Validation Error Vs K After z-score Normalization", xlab="k", ylab="CVError")
lines(errors~ks)

errors
##  [1] 0.04934427 0.02976515 0.03164376 0.03872505 0.04399036 0.04743756
##  [7] 0.04580417 0.04574151 0.04918762 0.05085559 0.04925028

Conclusion

The best model was produced using z-score normalization and 10-fold cross validation with 21-nearest neighbors.