Using k-NN machine learning algorithm to automate benign and malignant breast cancer classification

Detection of breast cancer involves examination of breast tissue for abnormal lumps. If a mass is detected, a tiny sample is extracted using a fine-needle. This biopsy is then examined microscopically by a histopathologist to determine whether or not the mass is benign or malignant. If we can automate this microscopic examination process via machine learning, we could significantly reduce the bias in the diagnostic process.

In this exercise, I will use the k-NN algorithm on the Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine learning Repository (http://archive.ics.uci.edu/ml). This dataset includes 569 cancer samples, their identification number, and 31 others features. The independent feature is “diagnosis”: B for a benign and M for a malignant tumor. The remaining 30 features are microscopic observations that I will use in training the k-NN model.

Get, explore and prepare data

setwd("C:/Users/Owner/Desktop/MachineLearningR_sampleData")
breast_cancer <- read.csv("breast_cancer.csv")

str(breast_cancer)

## 'data.frame':    569 obs. of  32 variables:
##  $ id               : int  87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
##  $ diagnosis        : Factor w/ 2 levels "B","M": 1 1 1 1 1 1 1 2 1 1 ...
##  $ radius_mean      : num  12.3 10.6 11 11.3 15.2 ...
##  $ texture_mean     : num  12.4 18.9 16.8 13.4 13.2 ...
##  $ perimeter_mean   : num  78.8 69.3 70.9 73 97.7 ...
##  $ area_mean        : num  464 346 373 385 712 ...
##  $ smoothness_mean  : num  0.1028 0.0969 0.1077 0.1164 0.0796 ...
##  $ compactness_mean : num  0.0698 0.1147 0.078 0.1136 0.0693 ...
##  $ concavity_mean   : num  0.0399 0.0639 0.0305 0.0464 0.0339 ...
##  $ points_mean      : num  0.037 0.0264 0.0248 0.048 0.0266 ...
##  $ symmetry_mean    : num  0.196 0.192 0.171 0.177 0.172 ...
##  $ dimension_mean   : num  0.0595 0.0649 0.0634 0.0607 0.0554 ...
##  $ radius_se        : num  0.236 0.451 0.197 0.338 0.178 ...
##  $ texture_se       : num  0.666 1.197 1.387 1.343 0.412 ...
##  $ perimeter_se     : num  1.67 3.43 1.34 1.85 1.34 ...
##  $ area_se          : num  17.4 27.1 13.5 26.3 17.7 ...
##  $ smoothness_se    : num  0.00805 0.00747 0.00516 0.01127 0.00501 ...
##  $ compactness_se   : num  0.0118 0.03581 0.00936 0.03498 0.01485 ...
##  $ concavity_se     : num  0.0168 0.0335 0.0106 0.0219 0.0155 ...
##  $ points_se        : num  0.01241 0.01365 0.00748 0.01965 0.00915 ...
##  $ symmetry_se      : num  0.0192 0.035 0.0172 0.0158 0.0165 ...
##  $ dimension_se     : num  0.00225 0.00332 0.0022 0.00344 0.00177 ...
##  $ radius_worst     : num  13.5 11.9 12.4 11.9 16.2 ...
##  $ texture_worst    : num  15.6 22.9 26.4 15.8 15.7 ...
##  $ perimeter_worst  : num  87 78.3 79.9 76.5 104.5 ...
##  $ area_worst       : num  549 425 471 434 819 ...
##  $ smoothness_worst : num  0.139 0.121 0.137 0.137 0.113 ...
##  $ compactness_worst: num  0.127 0.252 0.148 0.182 0.174 ...
##  $ concavity_worst  : num  0.1242 0.1916 0.1067 0.0867 0.1362 ...
##  $ points_worst     : num  0.0939 0.0793 0.0743 0.0861 0.0818 ...
##  $ symmetry_worst   : num  0.283 0.294 0.3 0.21 0.249 ...
##  $ dimension_worst  : num  0.0677 0.0759 0.0788 0.0678 0.0677 ...

All the features are numeric except for the diagnosis. The patient identification variable is not very informative, and for most machine learning task this is usually excluded.

breast_cancer<-breast_cancer[-1]
str(breast_cancer)

## 'data.frame':    569 obs. of  31 variables:
##  $ diagnosis        : Factor w/ 2 levels "B","M": 1 1 1 1 1 1 1 2 1 1 ...
##  $ radius_mean      : num  12.3 10.6 11 11.3 15.2 ...
##  $ texture_mean     : num  12.4 18.9 16.8 13.4 13.2 ...
##  $ perimeter_mean   : num  78.8 69.3 70.9 73 97.7 ...
##  $ area_mean        : num  464 346 373 385 712 ...
##  $ smoothness_mean  : num  0.1028 0.0969 0.1077 0.1164 0.0796 ...
##  $ compactness_mean : num  0.0698 0.1147 0.078 0.1136 0.0693 ...
##  $ concavity_mean   : num  0.0399 0.0639 0.0305 0.0464 0.0339 ...
##  $ points_mean      : num  0.037 0.0264 0.0248 0.048 0.0266 ...
##  $ symmetry_mean    : num  0.196 0.192 0.171 0.177 0.172 ...
##  $ dimension_mean   : num  0.0595 0.0649 0.0634 0.0607 0.0554 ...
##  $ radius_se        : num  0.236 0.451 0.197 0.338 0.178 ...
##  $ texture_se       : num  0.666 1.197 1.387 1.343 0.412 ...
##  $ perimeter_se     : num  1.67 3.43 1.34 1.85 1.34 ...
##  $ area_se          : num  17.4 27.1 13.5 26.3 17.7 ...
##  $ smoothness_se    : num  0.00805 0.00747 0.00516 0.01127 0.00501 ...
##  $ compactness_se   : num  0.0118 0.03581 0.00936 0.03498 0.01485 ...
##  $ concavity_se     : num  0.0168 0.0335 0.0106 0.0219 0.0155 ...
##  $ points_se        : num  0.01241 0.01365 0.00748 0.01965 0.00915 ...
##  $ symmetry_se      : num  0.0192 0.035 0.0172 0.0158 0.0165 ...
##  $ dimension_se     : num  0.00225 0.00332 0.0022 0.00344 0.00177 ...
##  $ radius_worst     : num  13.5 11.9 12.4 11.9 16.2 ...
##  $ texture_worst    : num  15.6 22.9 26.4 15.8 15.7 ...
##  $ perimeter_worst  : num  87 78.3 79.9 76.5 104.5 ...
##  $ area_worst       : num  549 425 471 434 819 ...
##  $ smoothness_worst : num  0.139 0.121 0.137 0.137 0.113 ...
##  $ compactness_worst: num  0.127 0.252 0.148 0.182 0.174 ...
##  $ concavity_worst  : num  0.1242 0.1916 0.1067 0.0867 0.1362 ...
##  $ points_worst     : num  0.0939 0.0793 0.0743 0.0861 0.0818 ...
##  $ symmetry_worst   : num  0.283 0.294 0.3 0.21 0.249 ...
##  $ dimension_worst  : num  0.0677 0.0759 0.0788 0.0678 0.0677 ...

In the dataset, “diagnosis” is coded as a factor. I will label “B” and “M” with benign and malignant respectively.

breast_cancer$diagnosis<- factor(breast_cancer$diagnosis, levels = c("B", "M"),
labels = c("Benign", "Malignant"))

I will use the ggplot to determine the relative number of biopsies that are benign or malignant.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.2

ggplot(breast_cancer, aes(x = diagnosis)) + 
  theme_bw() +
  geom_bar() +
  theme(text = element_text(size=20))+
  labs(y = "Number of biopsies",
       title = "Breast Cancer Classification")

round(prop.table(table(breast_cancer$diagnosis)) * 100, digits = 1)

## 
##    Benign Malignant 
##      62.7      37.3

There are ~62% cases of benign masses and ~38% cases of malignant tumors.

Feature normalization

I will use the z-score normalization to standardize the numeric features.

### Feature scaling (z-score)
 breast_cancer_Zscore<- as.data.frame(scale(breast_cancer[-1]))

str(breast_cancer_Zscore)

## 'data.frame':    569 obs. of  30 variables:
##  $ radius_mean      : num  -0.513 -1.001 -0.876 -0.808 0.302 ...
##  $ texture_mean     : num  -1.604 -0.079 -0.572 -1.372 -1.414 ...
##  $ perimeter_mean   : num  -0.54 -0.934 -0.866 -0.781 0.234 ...
##  $ area_mean        : num  -0.542 -0.877 -0.8 -0.767 0.162 ...
##  $ smoothness_mean  : num  0.458 0.037 0.806 1.425 -1.19 ...
##  $ compactness_mean : num  -0.654 0.196 -0.498 0.175 -0.663 ...
##  $ concavity_mean   : num  -0.614 -0.313 -0.732 -0.532 -0.688 ...
##  $ points_mean      : num  -0.3072 -0.5798 -0.6216 -0.0247 -0.576 ...
##  $ symmetry_mean    : num  0.538 0.403 -0.356 -0.148 -0.331 ...
##  $ dimension_mean   : num  -0.46 0.2992 0.0853 -0.2943 -1.0421 ...
##  $ radius_se        : num  -0.61 0.163 -0.752 -0.241 -0.818 ...
##  $ texture_se       : num  -0.999 -0.036 0.308 0.229 -1.458 ...
##  $ perimeter_se     : num  -0.592 0.279 -0.754 -0.502 -0.756 ...
##  $ area_se          : num  -0.504 -0.291 -0.589 -0.308 -0.497 ...
##  $ smoothness_se    : num  0.334 0.143 -0.627 1.408 -0.676 ...
##  $ compactness_se   : num  -0.764 0.577 -0.9 0.531 -0.593 ...
##  $ concavity_se     : num  -0.499 0.0545 -0.7067 -0.3321 -0.5428 ...
##  $ points_se        : num  0.0995 0.3005 -0.699 1.2729 -0.428 ...
##  $ symmetry_se      : num  -0.158 1.754 -0.407 -0.574 -0.493 ...
##  $ dimension_se     : num  -0.585 -0.18 -0.604 -0.133 -0.766 ...
##  $ radius_worst     : num  -0.5729 -0.9081 -0.7985 -0.8998 -0.0143 ...
##  $ texture_worst    : num  -1.633 -0.445 0.124 -1.612 -1.618 ...
##  $ perimeter_worst  : num  -0.6039 -0.8625 -0.8134 -0.9146 -0.0822 ...
##  $ area_worst       : num  -0.582 -0.801 -0.719 -0.784 -0.108 ...
##  $ smoothness_worst : num  0.269 -0.485 0.198 0.19 -0.866 ...
##  $ compactness_worst: num  -0.8114 -0.0176 -0.6741 -0.458 -0.5121 ...
##  $ concavity_worst  : num  -0.709 -0.386 -0.793 -0.889 -0.652 ...
##  $ points_worst     : num  -0.315 -0.538 -0.613 -0.434 -0.499 ...
##  $ symmetry_worst   : num  -0.1192 0.0634 0.1572 -1.2911 -0.6688 ...
##  $ dimension_worst  : num  -0.899 -0.447 -0.284 -0.892 -0.902 ...

Model training

Before training my model, I will split the data into the training and testing sets

breast_cancer_train <-breast_cancer_Zscore[1:450, ]

breast_cancer_test <-breast_cancer_Zscore[451:569, ]

breast_cancer_train_labels <- breast_cancer[1:450, 1]
 
breast_cancer_test_labels <- breast_cancer[451:569, 1]

I will use the k-NN algorithm in the “class” package

library(class)

## Warning: package 'class' was built under R version 3.4.2

breast_cancer_test_pred <- knn(train = breast_cancer_train, 
test = breast_cancer_test, 
cl = breast_cancer_train_labels, k = 15)

Initially, I used a k value of 21 which is approximately the square root of 450 (the number of examples in my training dataset). After the first run, I gradually change the value. My model seems to have the best generalization when I use k = 15.

Model evaluation

library(gmodels)

## Warning: package 'gmodels' was built under R version 3.4.2

CrossTable(x = breast_cancer_test_labels, y = breast_cancer_test_pred,
 prop.chisq=FALSE) ### Specifying chisq=FASLSE, since chi-square values from the output are not informative for this porpuse.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  119 
## 
##  
##                           | breast_cancer_test_pred 
## breast_cancer_test_labels |    Benign | Malignant | Row Total | 
## --------------------------|-----------|-----------|-----------|
##                    Benign |        72 |         1 |        73 | 
##                           |     0.986 |     0.014 |     0.613 | 
##                           |     0.960 |     0.023 |           | 
##                           |     0.605 |     0.008 |           | 
## --------------------------|-----------|-----------|-----------|
##                 Malignant |         3 |        43 |        46 | 
##                           |     0.065 |     0.935 |     0.387 | 
##                           |     0.040 |     0.977 |           | 
##                           |     0.025 |     0.361 |           | 
## --------------------------|-----------|-----------|-----------|
##              Column Total |        75 |        44 |       119 | 
##                           |     0.630 |     0.370 |           | 
## --------------------------|-----------|-----------|-----------|
## 
##

The k-NN model accurately classified 115 cases out of 119 breast cancer biopsies.

References:

Machine Learning with R (2nd edition) by Brett Lantz
ggplot2: Elegant Graphics for Data Analysis (2nd edition) by Hadley Wickham