Detection of breast cancer involves examination of breast tissue for abnormal lumps. If a mass is detected, a tiny sample is extracted using a fine-needle. This biopsy is then examined microscopically by a histopathologist to determine whether or not the mass is benign or malignant. If we can automate this microscopic examination process via machine learning, we could significantly reduce the bias in the diagnostic process.
In this exercise, I will use the k-NN algorithm on the Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine learning Repository (http://archive.ics.uci.edu/ml). This dataset includes 569 cancer samples, their identification number, and 31 others features. The independent feature is “diagnosis”: B for a benign and M for a malignant tumor. The remaining 30 features are microscopic observations that I will use in training the k-NN model.
setwd("C:/Users/Owner/Desktop/MachineLearningR_sampleData")
breast_cancer <- read.csv("breast_cancer.csv")
str(breast_cancer)
## 'data.frame': 569 obs. of 32 variables:
## $ id : int 87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
## $ diagnosis : Factor w/ 2 levels "B","M": 1 1 1 1 1 1 1 2 1 1 ...
## $ radius_mean : num 12.3 10.6 11 11.3 15.2 ...
## $ texture_mean : num 12.4 18.9 16.8 13.4 13.2 ...
## $ perimeter_mean : num 78.8 69.3 70.9 73 97.7 ...
## $ area_mean : num 464 346 373 385 712 ...
## $ smoothness_mean : num 0.1028 0.0969 0.1077 0.1164 0.0796 ...
## $ compactness_mean : num 0.0698 0.1147 0.078 0.1136 0.0693 ...
## $ concavity_mean : num 0.0399 0.0639 0.0305 0.0464 0.0339 ...
## $ points_mean : num 0.037 0.0264 0.0248 0.048 0.0266 ...
## $ symmetry_mean : num 0.196 0.192 0.171 0.177 0.172 ...
## $ dimension_mean : num 0.0595 0.0649 0.0634 0.0607 0.0554 ...
## $ radius_se : num 0.236 0.451 0.197 0.338 0.178 ...
## $ texture_se : num 0.666 1.197 1.387 1.343 0.412 ...
## $ perimeter_se : num 1.67 3.43 1.34 1.85 1.34 ...
## $ area_se : num 17.4 27.1 13.5 26.3 17.7 ...
## $ smoothness_se : num 0.00805 0.00747 0.00516 0.01127 0.00501 ...
## $ compactness_se : num 0.0118 0.03581 0.00936 0.03498 0.01485 ...
## $ concavity_se : num 0.0168 0.0335 0.0106 0.0219 0.0155 ...
## $ points_se : num 0.01241 0.01365 0.00748 0.01965 0.00915 ...
## $ symmetry_se : num 0.0192 0.035 0.0172 0.0158 0.0165 ...
## $ dimension_se : num 0.00225 0.00332 0.0022 0.00344 0.00177 ...
## $ radius_worst : num 13.5 11.9 12.4 11.9 16.2 ...
## $ texture_worst : num 15.6 22.9 26.4 15.8 15.7 ...
## $ perimeter_worst : num 87 78.3 79.9 76.5 104.5 ...
## $ area_worst : num 549 425 471 434 819 ...
## $ smoothness_worst : num 0.139 0.121 0.137 0.137 0.113 ...
## $ compactness_worst: num 0.127 0.252 0.148 0.182 0.174 ...
## $ concavity_worst : num 0.1242 0.1916 0.1067 0.0867 0.1362 ...
## $ points_worst : num 0.0939 0.0793 0.0743 0.0861 0.0818 ...
## $ symmetry_worst : num 0.283 0.294 0.3 0.21 0.249 ...
## $ dimension_worst : num 0.0677 0.0759 0.0788 0.0678 0.0677 ...
All the features are numeric except for the diagnosis. The patient identification variable is not very informative, and for most machine learning task this is usually excluded.
breast_cancer<-breast_cancer[-1]
str(breast_cancer)
## 'data.frame': 569 obs. of 31 variables:
## $ diagnosis : Factor w/ 2 levels "B","M": 1 1 1 1 1 1 1 2 1 1 ...
## $ radius_mean : num 12.3 10.6 11 11.3 15.2 ...
## $ texture_mean : num 12.4 18.9 16.8 13.4 13.2 ...
## $ perimeter_mean : num 78.8 69.3 70.9 73 97.7 ...
## $ area_mean : num 464 346 373 385 712 ...
## $ smoothness_mean : num 0.1028 0.0969 0.1077 0.1164 0.0796 ...
## $ compactness_mean : num 0.0698 0.1147 0.078 0.1136 0.0693 ...
## $ concavity_mean : num 0.0399 0.0639 0.0305 0.0464 0.0339 ...
## $ points_mean : num 0.037 0.0264 0.0248 0.048 0.0266 ...
## $ symmetry_mean : num 0.196 0.192 0.171 0.177 0.172 ...
## $ dimension_mean : num 0.0595 0.0649 0.0634 0.0607 0.0554 ...
## $ radius_se : num 0.236 0.451 0.197 0.338 0.178 ...
## $ texture_se : num 0.666 1.197 1.387 1.343 0.412 ...
## $ perimeter_se : num 1.67 3.43 1.34 1.85 1.34 ...
## $ area_se : num 17.4 27.1 13.5 26.3 17.7 ...
## $ smoothness_se : num 0.00805 0.00747 0.00516 0.01127 0.00501 ...
## $ compactness_se : num 0.0118 0.03581 0.00936 0.03498 0.01485 ...
## $ concavity_se : num 0.0168 0.0335 0.0106 0.0219 0.0155 ...
## $ points_se : num 0.01241 0.01365 0.00748 0.01965 0.00915 ...
## $ symmetry_se : num 0.0192 0.035 0.0172 0.0158 0.0165 ...
## $ dimension_se : num 0.00225 0.00332 0.0022 0.00344 0.00177 ...
## $ radius_worst : num 13.5 11.9 12.4 11.9 16.2 ...
## $ texture_worst : num 15.6 22.9 26.4 15.8 15.7 ...
## $ perimeter_worst : num 87 78.3 79.9 76.5 104.5 ...
## $ area_worst : num 549 425 471 434 819 ...
## $ smoothness_worst : num 0.139 0.121 0.137 0.137 0.113 ...
## $ compactness_worst: num 0.127 0.252 0.148 0.182 0.174 ...
## $ concavity_worst : num 0.1242 0.1916 0.1067 0.0867 0.1362 ...
## $ points_worst : num 0.0939 0.0793 0.0743 0.0861 0.0818 ...
## $ symmetry_worst : num 0.283 0.294 0.3 0.21 0.249 ...
## $ dimension_worst : num 0.0677 0.0759 0.0788 0.0678 0.0677 ...
In the dataset, “diagnosis” is coded as a factor. I will label “B” and “M” with benign and malignant respectively.
breast_cancer$diagnosis<- factor(breast_cancer$diagnosis, levels = c("B", "M"),
labels = c("Benign", "Malignant"))
I will use the ggplot to determine the relative number of biopsies that are benign or malignant.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.2
ggplot(breast_cancer, aes(x = diagnosis)) +
theme_bw() +
geom_bar() +
theme(text = element_text(size=20))+
labs(y = "Number of biopsies",
title = "Breast Cancer Classification")
round(prop.table(table(breast_cancer$diagnosis)) * 100, digits = 1)
##
## Benign Malignant
## 62.7 37.3
There are ~62% cases of benign masses and ~38% cases of malignant tumors.
I will use the z-score normalization to standardize the numeric features.
### Feature scaling (z-score)
breast_cancer_Zscore<- as.data.frame(scale(breast_cancer[-1]))
str(breast_cancer_Zscore)
## 'data.frame': 569 obs. of 30 variables:
## $ radius_mean : num -0.513 -1.001 -0.876 -0.808 0.302 ...
## $ texture_mean : num -1.604 -0.079 -0.572 -1.372 -1.414 ...
## $ perimeter_mean : num -0.54 -0.934 -0.866 -0.781 0.234 ...
## $ area_mean : num -0.542 -0.877 -0.8 -0.767 0.162 ...
## $ smoothness_mean : num 0.458 0.037 0.806 1.425 -1.19 ...
## $ compactness_mean : num -0.654 0.196 -0.498 0.175 -0.663 ...
## $ concavity_mean : num -0.614 -0.313 -0.732 -0.532 -0.688 ...
## $ points_mean : num -0.3072 -0.5798 -0.6216 -0.0247 -0.576 ...
## $ symmetry_mean : num 0.538 0.403 -0.356 -0.148 -0.331 ...
## $ dimension_mean : num -0.46 0.2992 0.0853 -0.2943 -1.0421 ...
## $ radius_se : num -0.61 0.163 -0.752 -0.241 -0.818 ...
## $ texture_se : num -0.999 -0.036 0.308 0.229 -1.458 ...
## $ perimeter_se : num -0.592 0.279 -0.754 -0.502 -0.756 ...
## $ area_se : num -0.504 -0.291 -0.589 -0.308 -0.497 ...
## $ smoothness_se : num 0.334 0.143 -0.627 1.408 -0.676 ...
## $ compactness_se : num -0.764 0.577 -0.9 0.531 -0.593 ...
## $ concavity_se : num -0.499 0.0545 -0.7067 -0.3321 -0.5428 ...
## $ points_se : num 0.0995 0.3005 -0.699 1.2729 -0.428 ...
## $ symmetry_se : num -0.158 1.754 -0.407 -0.574 -0.493 ...
## $ dimension_se : num -0.585 -0.18 -0.604 -0.133 -0.766 ...
## $ radius_worst : num -0.5729 -0.9081 -0.7985 -0.8998 -0.0143 ...
## $ texture_worst : num -1.633 -0.445 0.124 -1.612 -1.618 ...
## $ perimeter_worst : num -0.6039 -0.8625 -0.8134 -0.9146 -0.0822 ...
## $ area_worst : num -0.582 -0.801 -0.719 -0.784 -0.108 ...
## $ smoothness_worst : num 0.269 -0.485 0.198 0.19 -0.866 ...
## $ compactness_worst: num -0.8114 -0.0176 -0.6741 -0.458 -0.5121 ...
## $ concavity_worst : num -0.709 -0.386 -0.793 -0.889 -0.652 ...
## $ points_worst : num -0.315 -0.538 -0.613 -0.434 -0.499 ...
## $ symmetry_worst : num -0.1192 0.0634 0.1572 -1.2911 -0.6688 ...
## $ dimension_worst : num -0.899 -0.447 -0.284 -0.892 -0.902 ...
Before training my model, I will split the data into the training and testing sets
breast_cancer_train <-breast_cancer_Zscore[1:450, ]
breast_cancer_test <-breast_cancer_Zscore[451:569, ]
breast_cancer_train_labels <- breast_cancer[1:450, 1]
breast_cancer_test_labels <- breast_cancer[451:569, 1]
I will use the k-NN algorithm in the “class” package
library(class)
## Warning: package 'class' was built under R version 3.4.2
breast_cancer_test_pred <- knn(train = breast_cancer_train,
test = breast_cancer_test,
cl = breast_cancer_train_labels, k = 15)
Initially, I used a k value of 21 which is approximately the square root of 450 (the number of examples in my training dataset). After the first run, I gradually change the value. My model seems to have the best generalization when I use k = 15.
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.4.2
CrossTable(x = breast_cancer_test_labels, y = breast_cancer_test_pred,
prop.chisq=FALSE) ### Specifying chisq=FASLSE, since chi-square values from the output are not informative for this porpuse.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 119
##
##
## | breast_cancer_test_pred
## breast_cancer_test_labels | Benign | Malignant | Row Total |
## --------------------------|-----------|-----------|-----------|
## Benign | 72 | 1 | 73 |
## | 0.986 | 0.014 | 0.613 |
## | 0.960 | 0.023 | |
## | 0.605 | 0.008 | |
## --------------------------|-----------|-----------|-----------|
## Malignant | 3 | 43 | 46 |
## | 0.065 | 0.935 | 0.387 |
## | 0.040 | 0.977 | |
## | 0.025 | 0.361 | |
## --------------------------|-----------|-----------|-----------|
## Column Total | 75 | 44 | 119 |
## | 0.630 | 0.370 | |
## --------------------------|-----------|-----------|-----------|
##
##
The k-NN model accurately classified 115 cases out of 119 breast cancer biopsies.