Libraries Used

library(gmodels) #cross table function
library(ggvis) # better visualization
library(qtlcharts) # correlation plot
library(class) #execute KNN


Objective

To predict whether patient’s extracted cells from the mass are malignant or benign using the machine learning model kth Nearest Neighbour (KNN). This machine learning model will study and learn from measurements of biopsied cells from women with abnormal breast masses.


Step 1: Data Exploration

The Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml. This data was donated by researchers of the University of Wisconsin and includes the measurements from digitized images of fine-needle aspirate of a breast mass.

Data Preview

str(wbcd)
## 'data.frame':    569 obs. of  32 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...

Features

From the data preview above, the target variable and features can be identified:

Target Variable
The target feature is obviously diagnosis with levels “M” for malignant and “B” for Benign. Of the 569 examples, we have the following proportion:

round(prop.table(table(wbcd$diagnosis)), digits = 2)
## 
##    B    M 
## 0.63 0.37

Features
The other 30 features comprise of the mean, standard error, and worst for the following:

  • Radius
  • Texture
  • Perimeter
  • Area
  • Smoothness
  • Compactness
  • Concavity
  • Concave Points
  • Symmetry
  • Fractal Dimension

Visualization

Data visualizaition can help to identify some preliminary connections and trends about the dataset. For illustration purposes, only radius_mean, area_mean, and smoothness_mean will be compared in the visualizations.

Correlation Plot

Comparing correlation of radius_mean, area_mean, and smoothness_mean.

cor(x)
##            rad      area       smo
## rad  1.0000000 0.9873572 0.1705812
## area 0.9873572 1.0000000 0.1770284
## smo  0.1705812 0.1770284 1.0000000

Scatterplots

With a bit of tricks using grammar of graphics, we can easily tell characteristics of Malingnant and Benign cells.

Radius_Mean vs Area_Mean


Radius_Mean vs Smoothness_Mean

Preliminary Insights

After inspecting from the visualizations and getting a glimpse of the dataset. It can be observed that:

  • Malignant cells have higher radius, area, and smoothness
  • Benign cells have lower radius, area, and smoothness


Step 2: Data Preprocessing

It is necessary to normalize the dataset hen using kNN because of it’s dependency on distance. As the ranges for smoothness is 0.11 and for area is 2357.5, the distance impact on area is much more than smoothness.

Normalization

Traditional normalization will be used on the features. An anonymous function is created and will be applied to the dataset as below:

normalize <- function(x)
  return ((x - min(x))/(max(x) - min(x)))

Applying the normalization function to our features in the dataset:

wbcd_n <- as.data.frame(lapply(wbcd[3:32], normalize))
wbcd_nfull <- cbind(wbcd[1:2], wbcd_n)
summary(wbcd_n$area_mean)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1174  0.1729  0.2169  0.2711  1.0000

As seen in the summary above for one of the feature radius_mean, it is confirmed that the features has been normalized.

Data Preparation

Data preparation involves creating a training set and a test set from our dataset. Because the dataset is already randomly sorted, there is no need to randomly sample 2/3 for training and 1/3 for test.

The training and test labels will also be prepared in this section

The training and test set will be divided as the following:

  • First 469 entries used as training set
  • Last 100 entries used as test set
wbcd_train <- wbcd_n[1:469,]
wbcd_test <- wbcd_n[470:569,]

#Preparing train and test labels
wbcd_train_labels <- wbcd[1:469, 2]
wbcd_test_labels <- wbcd[470:569, 2]


Step 3: Model Training

The knn() function from the class package will be used here. As for the parameter k, a value of k = 21 will be chosen as sqrt(469) is 21.66. Bare in mind that the nearest odd number is chosen to avoid ties.

wbcd_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k =21)


Step 4: Model Evaluation

To evaluate the model’s accuracy, the CrossTable() function is used:

CrossTable(x = wbcd_test_labels, y = wbcd_pred, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                  | wbcd_pred 
## wbcd_test_labels |         B |         M | Row Total | 
## -----------------|-----------|-----------|-----------|
##                B |        77 |         0 |        77 | 
##                  |     1.000 |     0.000 |     0.770 | 
##                  |     0.975 |     0.000 |           | 
##                  |     0.770 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##                M |         2 |        21 |        23 | 
##                  |     0.087 |     0.913 |     0.230 | 
##                  |     0.025 |     1.000 |           | 
##                  |     0.020 |     0.210 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        79 |        21 |       100 | 
##                  |     0.790 |     0.210 |           | 
## -----------------|-----------|-----------|-----------|
## 
## 

From the CrossTable, it is clear that only the accuracy rate of the model is 98% as there were only 2 entries wrongly predicted.

The two errors wrongly predicted are labelled as false negatives. This is arguably the most dangerous error as patients might think they are cancer free but in reality, they are not.

Expanded Predicted Data

data.frame("Test IDs" = wbcd[470:569, 1], "Diagnosis" = wbcd_test_labels, "Predicted" = wbcd_pred)
##     Test.IDs Diagnosis Predicted
## 1     911366         B         B
## 2    9113778         B         B
## 3    9113816         B         B
## 4     911384         B         B
## 5    9113846         B         B
## 6     911391         B         B
## 7     911408         B         B
## 8     911654         B         B
## 9     911673         B         B
## 10    911685         B         B
## 11    911916         M         M
## 12    912193         B         B
## 13     91227         B         B
## 14    912519         B         B
## 15    912558         B         B
## 16    912600         B         B
## 17    913063         B         B
## 18    913102         B         B
## 19    913505         M         M
## 20    913512         B         B
## 21    913535         M         B
## 22  91376701         B         B
## 23  91376702         B         B
## 24    914062         M         M
## 25    914101         B         B
## 26    914102         B         B
## 27    914333         B         B
## 28    914366         B         B
## 29    914580         B         B
## 30    914769         M         M
## 31     91485         M         M
## 32    914862         B         B
## 33     91504         M         M
## 34     91505         B         B
## 35    915143         M         M
## 36    915186         B         B
## 37    915276         B         B
## 38  91544001         B         B
## 39  91544002         B         B
## 40    915452         B         B
## 41    915460         M         M
## 42     91550         B         B
## 43    915664         B         B
## 44    915691         M         M
## 45    915940         B         B
## 46  91594602         M         B
## 47    916221         B         B
## 48    916799         M         M
## 49    916838         M         M
## 50    917062         B         B
## 51    917080         B         B
## 52    917092         B         B
## 53  91762702         M         M
## 54     91789         B         B
## 55    917896         B         B
## 56    917897         B         B
## 57     91805         B         B
## 58  91813701         B         B
## 59  91813702         B         B
## 60    918192         B         B
## 61    918465         B         B
## 62     91858         B         B
## 63  91903901         B         B
## 64  91903902         B         B
## 65  91930402         M         M
## 66    919537         B         B
## 67    919555         M         M
## 68  91979701         M         M
## 69    919812         B         B
## 70    921092         B         B
## 71    921362         B         B
## 72    921385         B         B
## 73    921386         B         B
## 74    921644         B         B
## 75    922296         B         B
## 76    922297         B         B
## 77    922576         B         B
## 78    922577         B         B
## 79    922840         B         B
## 80    923169         B         B
## 81    923465         B         B
## 82    923748         B         B
## 83    923780         B         B
## 84    924084         B         B
## 85    924342         B         B
## 86    924632         B         B
## 87    924934         B         B
## 88    924964         B         B
## 89    925236         B         B
## 90    925277         B         B
## 91    925291         B         B
## 92    925292         B         B
## 93    925311         B         B
## 94    925622         M         M
## 95    926125         M         M
## 96    926424         M         M
## 97    926682         M         M
## 98    926954         M         M
## 99    927241         M         M
## 100    92751         B         B


Step 5: Model Improvement

To improve the model’s accuracy and reduce its false negative, Z-Score Transformation is tested. Z-Score has no min and max value. Hence, one might think malignant tumours might grow uncontrollably and more weight is put in its distance calculation.

#Z-Score Normalization
wbcd_z <- as.data.frame(scale(wbcd[c(-1,-2)]))

#train and test set
wbcd_z_train <- wbcd_z[1:469, ]
wbcd_z_test <- wbcd_z[470:569, ]

#train and test labels
wbcd_z_train_labels <- wbcd[1:469, 2]
wbcd_z_test_labels <- wbcd[470:569, 2]

#model training
wbcd_z_pred <- knn(train = wbcd_z_train, test = wbcd_z_test, cl = wbcd_z_train_labels, k = 19)

#model evaluation
CrossTable(x = wbcd_z_test_labels, y = wbcd_z_pred, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                    | wbcd_z_pred 
## wbcd_z_test_labels |         B |         M | Row Total | 
## -------------------|-----------|-----------|-----------|
##                  B |        77 |         0 |        77 | 
##                    |     1.000 |     0.000 |     0.770 | 
##                    |     0.963 |     0.000 |           | 
##                    |     0.770 |     0.000 |           | 
## -------------------|-----------|-----------|-----------|
##                  M |         3 |        20 |        23 | 
##                    |     0.130 |     0.870 |     0.230 | 
##                    |     0.037 |     1.000 |           | 
##                    |     0.030 |     0.200 |           | 
## -------------------|-----------|-----------|-----------|
##       Column Total |        80 |        20 |       100 | 
##                    |     0.800 |     0.200 |           | 
## -------------------|-----------|-----------|-----------|
## 
## 


Conclusion

In this case, Z-Score transformation does not help improve the model, considering that the model already has a high accuracy rate of 98% and it gives back the same accuracy rate as traditional transformation.

Unlike many classification algorithms, k-NN does not do any learning. It simply stores the training data verbatim. Unlabeled test examples are then matched to the most similar records in the training set using a distance function, and the unlabeled example is assigned the label of its neighbors.

In spite of the fact that k-NN is a very simple algorithm, it is capable of tackling extremely complex tasks, such as the identification of cancerous masses.