library(gmodels) #cross table function
library(ggvis) # better visualization
library(qtlcharts) # correlation plot
library(class) #execute KNN
To predict whether patient’s extracted cells from the mass are malignant or benign using the machine learning model kth Nearest Neighbour (KNN). This machine learning model will study and learn from measurements of biopsied cells from women with abnormal breast masses.
The Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml. This data was donated by researchers of the University of Wisconsin and includes the measurements from digitized images of fine-needle aspirate of a breast mass.
str(wbcd)
## 'data.frame': 569 obs. of 32 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
From the data preview above, the target variable and features can be identified:
Target Variable
The target feature is obviously diagnosis with levels “M” for malignant and “B” for Benign. Of the 569 examples, we have the following proportion:
round(prop.table(table(wbcd$diagnosis)), digits = 2)
##
## B M
## 0.63 0.37
Features
The other 30 features comprise of the mean, standard error, and worst for the following:
Data visualizaition can help to identify some preliminary connections and trends about the dataset. For illustration purposes, only radius_mean, area_mean, and smoothness_mean will be compared in the visualizations.
Comparing correlation of radius_mean, area_mean, and smoothness_mean.
cor(x)
## rad area smo
## rad 1.0000000 0.9873572 0.1705812
## area 0.9873572 1.0000000 0.1770284
## smo 0.1705812 0.1770284 1.0000000
With a bit of tricks using grammar of graphics, we can easily tell characteristics of Malingnant and Benign cells.
After inspecting from the visualizations and getting a glimpse of the dataset. It can be observed that:
It is necessary to normalize the dataset hen using kNN because of it’s dependency on distance. As the ranges for smoothness is 0.11 and for area is 2357.5, the distance impact on area is much more than smoothness.
Traditional normalization will be used on the features. An anonymous function is created and will be applied to the dataset as below:
normalize <- function(x)
return ((x - min(x))/(max(x) - min(x)))
Applying the normalization function to our features in the dataset:
wbcd_n <- as.data.frame(lapply(wbcd[3:32], normalize))
wbcd_nfull <- cbind(wbcd[1:2], wbcd_n)
summary(wbcd_n$area_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1174 0.1729 0.2169 0.2711 1.0000
As seen in the summary above for one of the feature radius_mean, it is confirmed that the features has been normalized.
Data preparation involves creating a training set and a test set from our dataset. Because the dataset is already randomly sorted, there is no need to randomly sample 2/3 for training and 1/3 for test.
The training and test labels will also be prepared in this section
The training and test set will be divided as the following:
wbcd_train <- wbcd_n[1:469,]
wbcd_test <- wbcd_n[470:569,]
#Preparing train and test labels
wbcd_train_labels <- wbcd[1:469, 2]
wbcd_test_labels <- wbcd[470:569, 2]
The knn() function from the class package will be used here. As for the parameter k, a value of k = 21 will be chosen as sqrt(469) is 21.66. Bare in mind that the nearest odd number is chosen to avoid ties.
wbcd_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k =21)
To evaluate the model’s accuracy, the CrossTable() function is used:
CrossTable(x = wbcd_test_labels, y = wbcd_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_pred
## wbcd_test_labels | B | M | Row Total |
## -----------------|-----------|-----------|-----------|
## B | 77 | 0 | 77 |
## | 1.000 | 0.000 | 0.770 |
## | 0.975 | 0.000 | |
## | 0.770 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## M | 2 | 21 | 23 |
## | 0.087 | 0.913 | 0.230 |
## | 0.025 | 1.000 | |
## | 0.020 | 0.210 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 79 | 21 | 100 |
## | 0.790 | 0.210 | |
## -----------------|-----------|-----------|-----------|
##
##
From the CrossTable, it is clear that only the accuracy rate of the model is 98% as there were only 2 entries wrongly predicted.
The two errors wrongly predicted are labelled as false negatives. This is arguably the most dangerous error as patients might think they are cancer free but in reality, they are not.
data.frame("Test IDs" = wbcd[470:569, 1], "Diagnosis" = wbcd_test_labels, "Predicted" = wbcd_pred)
## Test.IDs Diagnosis Predicted
## 1 911366 B B
## 2 9113778 B B
## 3 9113816 B B
## 4 911384 B B
## 5 9113846 B B
## 6 911391 B B
## 7 911408 B B
## 8 911654 B B
## 9 911673 B B
## 10 911685 B B
## 11 911916 M M
## 12 912193 B B
## 13 91227 B B
## 14 912519 B B
## 15 912558 B B
## 16 912600 B B
## 17 913063 B B
## 18 913102 B B
## 19 913505 M M
## 20 913512 B B
## 21 913535 M B
## 22 91376701 B B
## 23 91376702 B B
## 24 914062 M M
## 25 914101 B B
## 26 914102 B B
## 27 914333 B B
## 28 914366 B B
## 29 914580 B B
## 30 914769 M M
## 31 91485 M M
## 32 914862 B B
## 33 91504 M M
## 34 91505 B B
## 35 915143 M M
## 36 915186 B B
## 37 915276 B B
## 38 91544001 B B
## 39 91544002 B B
## 40 915452 B B
## 41 915460 M M
## 42 91550 B B
## 43 915664 B B
## 44 915691 M M
## 45 915940 B B
## 46 91594602 M B
## 47 916221 B B
## 48 916799 M M
## 49 916838 M M
## 50 917062 B B
## 51 917080 B B
## 52 917092 B B
## 53 91762702 M M
## 54 91789 B B
## 55 917896 B B
## 56 917897 B B
## 57 91805 B B
## 58 91813701 B B
## 59 91813702 B B
## 60 918192 B B
## 61 918465 B B
## 62 91858 B B
## 63 91903901 B B
## 64 91903902 B B
## 65 91930402 M M
## 66 919537 B B
## 67 919555 M M
## 68 91979701 M M
## 69 919812 B B
## 70 921092 B B
## 71 921362 B B
## 72 921385 B B
## 73 921386 B B
## 74 921644 B B
## 75 922296 B B
## 76 922297 B B
## 77 922576 B B
## 78 922577 B B
## 79 922840 B B
## 80 923169 B B
## 81 923465 B B
## 82 923748 B B
## 83 923780 B B
## 84 924084 B B
## 85 924342 B B
## 86 924632 B B
## 87 924934 B B
## 88 924964 B B
## 89 925236 B B
## 90 925277 B B
## 91 925291 B B
## 92 925292 B B
## 93 925311 B B
## 94 925622 M M
## 95 926125 M M
## 96 926424 M M
## 97 926682 M M
## 98 926954 M M
## 99 927241 M M
## 100 92751 B B
To improve the model’s accuracy and reduce its false negative, Z-Score Transformation is tested. Z-Score has no min and max value. Hence, one might think malignant tumours might grow uncontrollably and more weight is put in its distance calculation.
#Z-Score Normalization
wbcd_z <- as.data.frame(scale(wbcd[c(-1,-2)]))
#train and test set
wbcd_z_train <- wbcd_z[1:469, ]
wbcd_z_test <- wbcd_z[470:569, ]
#train and test labels
wbcd_z_train_labels <- wbcd[1:469, 2]
wbcd_z_test_labels <- wbcd[470:569, 2]
#model training
wbcd_z_pred <- knn(train = wbcd_z_train, test = wbcd_z_test, cl = wbcd_z_train_labels, k = 19)
#model evaluation
CrossTable(x = wbcd_z_test_labels, y = wbcd_z_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_z_pred
## wbcd_z_test_labels | B | M | Row Total |
## -------------------|-----------|-----------|-----------|
## B | 77 | 0 | 77 |
## | 1.000 | 0.000 | 0.770 |
## | 0.963 | 0.000 | |
## | 0.770 | 0.000 | |
## -------------------|-----------|-----------|-----------|
## M | 3 | 20 | 23 |
## | 0.130 | 0.870 | 0.230 |
## | 0.037 | 1.000 | |
## | 0.030 | 0.200 | |
## -------------------|-----------|-----------|-----------|
## Column Total | 80 | 20 | 100 |
## | 0.800 | 0.200 | |
## -------------------|-----------|-----------|-----------|
##
##
In this case, Z-Score transformation does not help improve the model, considering that the model already has a high accuracy rate of 98% and it gives back the same accuracy rate as traditional transformation.
Unlike many classification algorithms, k-NN does not do any learning. It simply stores the training data verbatim. Unlabeled test examples are then matched to the most similar records in the training set using a distance function, and the unlabeled example is assigned the label of its neighbors.
In spite of the fact that k-NN is a very simple algorithm, it is capable of tackling extremely complex tasks, such as the identification of cancerous masses.