Introduction
Routine breast cancer screening allows the disease to be diagnosed and treated prior to it causing noticeable symptoms. The process of early detection involves examining the breast tissue for abnormal lumps or masses. If a lump is found, a fine-needle aspiration biopsy is performed, which utilizes a hollow needle to extract a small portion of cells from the mass. A clinician then examines the cells under a microscope to determine whether the mass is likely to be malignant or benign. We will investigate the utility of machine learning for detecting cancer by applying the kNN algorithm to measurements of biopsied cells from women with abnormal breast masses.
Nearest Neighbour
In general, nearest neighbor classifiers are well-suited for classification tasks where relationships among the features and the target classes are numerous, complicated, or otherwise extremely difficult to understand, yet the items of similar class type tend to be fairly homogeneous. Another way of putting it would be to say that if a concept is difficult to define, but you know it when you see it, then nearest neighbors might be appropriate. In a single sentence, nearest neighbor classifiers are defined by their characteristic of classifying unlabeled examples by assigning them the class of the most similar labeled examples. The nearest neighbors approach to classification is utilized by the kNN algorithm.
kNN algorithm
The kNN algorithm begins with a training dataset made up of examples that are classified into several categories, as labeled by a nominal variable. For each record in the test dataset, kNN identifies k records in the training data that are the “nearest” in similarity, where k is an integer specified in advance. The unlabeled test instance is assigned the class of the majority of the k nearest neighbors.
The kNN algorithm treats the features as coordinates in a multidimensional feature space.
There are many different ways to calculate distance. Traditionally, the kNN algorithm uses Euclidean distance, which is the distance one would measure if you could use a ruler to connect two points.
{d(p,q)={}.}
Data Loading and Exploration
We now proceed by setting up the environment. We first set the seed, even if we are not generating any random data, it is a good practice to follow as it helps in making our research reproducible. We now use the “readr” package in order to import csv file.Read the data using the “read.csv” command and save it into a dataframe which we have named as ‘wdbc’. We rename the variable columns using the tidyverse paclage and rename function. Using the command str(wdbc), we can confirm that the data is structured with 569 examples and 32 features as we expected. Here the variables are not named in the given dataset so we need to do that from the UC Irvine Machine Learning Repository(source of this research data). The first variable is an integer variable named id. As this is simply a unique identifier (ID) for each patient in the data, it does not provide useful information and we will need to exclude it from the model.
## 'data.frame': 569 obs. of 32 variables:
## $ ID : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ Diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ perimeter_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ area_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ smoothness : num 1001 1326 1203 386 1297 ...
## $ compactness : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ concavity : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concave_points : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ symmetry : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ fractal_dimension : num 0.242 0.181 0.207 0.26 0.181 ...
## $ radius_mean2 : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ perimeter_mean2 : num 1.095 0.543 0.746 0.496 0.757 ...
## $ area_mean2 : num 0.905 0.734 0.787 1.156 0.781 ...
## $ smoothness2 : num 8.59 3.4 4.58 3.44 5.44 ...
## $ compactness2 : num 153.4 74.1 94 27.2 94.4 ...
## $ concavity2 : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ concave_points2 : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ symmetry2 : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ fractal_dimension2: num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ radius_mean3 : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ perimeter_mean3 : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ area_mean3 : num 25.4 25 23.6 14.9 22.5 ...
## $ smoothness3 : num 17.3 23.4 25.5 26.5 16.7 ...
## $ compactness3 : num 184.6 158.8 152.5 98.9 152.2 ...
## $ concavity3 : num 2019 1956 1709 568 1575 ...
## $ concave_points3 : num 0.162 0.124 0.144 0.21 0.137 ...
## $ symmetry3 : num 0.666 0.187 0.424 0.866 0.205 ...
## $ fractal_dimension3: num 0.712 0.242 0.45 0.687 0.4 ...
## $ V30 : num 0.265 0.186 0.243 0.258 0.163 ...
## $ V31 : num 0.46 0.275 0.361 0.664 0.236 ...
## $ V32 : num 0.1189 0.089 0.0876 0.173 0.0768 ...
The next variable, diagnosis, is of particular interest, as it is the outcome we hope to predict. This feature indicates whether the example is from a benign or malignant mass. The table() output indicates that 357 masses are benign while 212 are malignant. Many R machine learning classifiers require that the target feature is coded as a factor, so we will need to recode the diagnosis variable. We will also take this opportunity to give the B and M values more informative labels using the labels parameter. Now, when we look at the prop.table() output, we notice that the values have been labeled Benign and Malignant, with 62.7 percent and 37.3 percent of the masses, respectively. The remaining 30 features are all numeric, and as expected, consist of three different measurements of ten characteristics. For illustrative purposes, we will only take a closer look at three of the features.
##
## B M
## 357 212
wdbc$Diagnosis <- factor(wdbc$Diagnosis, levels = c("B", "M"),
labels = c("Benign", "Malignant"))
round(prop.table(table(wdbc$Diagnosis)) * 100, digits = 1)##
## Benign Malignant
## 62.7 37.3
## radius_mean area_mean smoothness
## Min. : 6.981 Min. : 43.79 Min. : 143.5
## 1st Qu.:11.700 1st Qu.: 75.17 1st Qu.: 420.3
## Median :13.370 Median : 86.24 Median : 551.1
## Mean :14.127 Mean : 91.97 Mean : 654.9
## 3rd Qu.:15.780 3rd Qu.:104.10 3rd Qu.: 782.7
## Max. :28.110 Max. :188.50 Max. :2501.0
Preparing the data
Features are typically transformed to a standard range prior to applying the kNN algorithm. The rationale for this step is that the distance formula is dependent on how features are measured. In particular, if certain features have much larger values than others, the distance measurements will be strongly dominated by the larger values. The traditional method of rescaling features for kNN is min-max normalization. This process transforms a feature such that all of its values fall in a range between 0 and 1. The formula for normalizing a feature is as follows. Essentially, the formula subtracts the minimum of feature X from each value and divides by the range of X.
Xnew = (x - min(x)) / (max(x) - min(x))
To normalize these features, we need to create a normalize() function in R.After executing the previous code, the normalize() function is available for use. Let’s test the function on a couple of vectors.
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))} #creating a normalize() function
normalize(c(1, 2, 3, 4, 5))## [1] 0.00 0.25 0.50 0.75 1.00
## [1] 0.00 0.25 0.50 0.75 1.00
We can now apply the normalize() function to the numeric features in our data frame. Rather than normalizing each of the 10 numeric variables individually, we will use one of R’s functions to automate the process. The lapply() function of R takes a list and applies a function to each element of the list. As a data frame is a list of equal-length vectors, we can use lapply() to apply normalize() to each feature in the data frame. The final step is to convert the list returned by lapply() to a data frame using the as.data.frame() function. The full process looks like this To confirm that the transformation was applied correctly, let’s look at one variable’s summary statistics. As expected, the area_mean variable, which originally ranged from 143.5 to 2501.0, now ranges from 0 to 1.
wdbc_n <- as.data.frame(lapply(wdbc[2:31], normalize)) #Applying normalize function to all the variables
summary(wdbc_n$area_mean) #Checking the effect of normalize on area variable## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2168 0.2933 0.3329 0.4168 1.0000
Data preparation – creating training and test datasets
Although all 569 biopsies are labeled with a benign or malignant status, it is not very interesting to predict what we already know. Additionally, any performance measures we obtain during training may be misleading, as we do not know the extent to which cases has been overfitted, or how well it will generalize to unseen cases. A more interesting question is how well our learner performs on a dataset of unlabeled data. In the absence of new sample data, we can simulate this scenario by dividing our data into two portions: a training dataset that will be used to build the kNN model and a test dataset that will be used to estimate the predictive accuracy of the model. We will use the first 469 records for the training dataset and the remaining 100 to simulate new patients. Using the data extraction methods, we will split the wcbd_n data frame into the bc_train and bc_test data frames.
wdbc_train <- wdbc_n[1:469, ] #Creating training data
wdbc_test <- wdbc_n[470:569, ] #Creating test dataWhen we constructed our training and test data, we excluded the target variable, diagnosis. For training the kNN model, we will need to store these class labels in factor vectors, divided to the training and test datasets.
Training a model on the data
Equipped with our training data and labels vector, we are now ready to classify our unknown records. For the kNN algorithm, the training phase actually involves no model building—the process of training a lazy learner like kNN simply involves storing the input data in a structured format. To classify our test instances, we will use a kNN implementation from the class package, which provides a set of basic R functions for classification.
The knn() function in the class package provides a standard, classic implementation of the kNN algorithm. For each instance in the test data, the function will identify the k-nearest neighbors, using Euclidean distance, where k is a user-specified number. The test instance is classified by taking a “vote” among the k-Nearest Neighbors—specifically, this involves assigning the class of the majority of the k neighbors. A tie vote is broken at random. The image below shows the syntax of kNN algorithm.
Syntax of kNN Algorithm
As our training data includes 469 instances, we might try k = 21, an odd number roughly equal to the square root of 469. Using an odd number will reduce the chance of ending with a tie vote. Now we can use the knn() function to classify the test data
library(class)
wdbc_test_pred <- knn(train = wdbc_train, test = wdbc_test,
cl = wdbc_train_labels, k=21)The knn() function returns a factor vector of predicted labels for each of the examples in the test dataset, which we have assigned to wbcd_test_pred.
Evaluating model performance
The next step of the process is to evaluate how well the predicted classes in the wbcd_test_pred vector match up with the known values in the wbcd_test_labels vector. To do this, we can use the CrossTable() function in the gmodels package.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wdbc_test_pred
## wdbc_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 77 | 0 | 77 |
## | 1.000 | 0.000 | 0.770 |
## | 0.975 | 0.000 | |
## | 0.770 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 2 | 21 | 23 |
## | 0.087 | 0.913 | 0.230 |
## | 0.025 | 1.000 | |
## | 0.020 | 0.210 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 79 | 21 | 100 |
## | 0.790 | 0.210 | |
## -----------------|-----------|-----------|-----------|
##
##
The cell percentages in the table indicate the proportion of values that fall into four categories. In the top-left cell , are the true negative results. These 77 of 100 values indicate cases where the mass was benign, and the kNN algorithm correctly identified it as such. The bottom-right cell , indicates the true positive results, where the classifier and the clinically determined label agree that the mass is malignant. A total of 21 of 100 predictions were true positives.
The cells falling on the other diagonal contain counts of examples where the kNN approach disagreed with the true label. The 2 examples in the lower-left FN cell are false negative results; in this case, the predicted value was benign but the tumor was actually malignant. Errors in this direction could be extremely costly, as they might lead a patient to believe that she is cancer-free, when in reality the disease may continue to spread. The cell labeled FP would contain the false positive results, if there were any. These values occur when the model classifies a mass as malignant when in reality it was benign. Although such errors are less dangerous than a false negative result, they should also be avoided as they could lead to additional financial burden on the health care system, or additional stress for the patient, as additional tests or treatment may have to be provided.
A total of 2 percent, that is, 2 out of 100 masses were incorrectly classified by the kNN approach. While 98 percent accuracy seems impressive for a few lines of R code, we might try another iteration of the model to see if we can improve the performance and reduce the number of values that have been incorrectly classified, particularly, as the errors were dangerous false negatives.
Improving model performance
We will attempt two simple variations on our previous classifier. First, we will employ an alternative method for rescaling our numeric features. Second, we will try several different values for k.
Transformation – z-score standardization
Although normalization is traditionally used for kNN classification, it may not always be the most appropriate way to rescale features. Because z-score standardized values have no predefined minimum and maximum, extreme values are not compressed towards the center. One might suspect that with a malignant tumor, we might see some very extreme outliers, as the tumors grow uncontrollably. It might, therefore, be reasonable to allow the outliers to be weighted more heavily in the distance calculation. Let’s see whether z-score standardization can improve our predictive accuracy.
To standardize a vector, we can use R’s built in scale() function, which by default rescales values using the z-score standardization. The scale() function offers the additional benefit that it can be applied directly to a data frame, so we can avoid use of the lapply() function. To create a z-score standardized version of the wbcd data, we can use the following command, which rescales all features with the exception of diagnosis, and stores the result as a data frame in the wbcd_z variable.
To confirm that the transformation was applied correctly, we can look at the summary statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.9828 -0.6913 -0.2358 0.0000 0.4992 3.9726
The mean of a z-score standardized variable should always be zero, and the range should be fairly compact. A z-score greater than 3 or less than -3 indicates an extremely rare value. The previous summary seems reasonable. As we had done before, we need to divide the data into training and test sets, then classify the test instances using the knn() function. We’ll then compare the predicted labels to the actual labels using CrossTable().
wdbc_train_z <- wdbc_z[1:469, ]
wdbc_test_z <- wdbc_z[470:569, ]
wdbc_train_labels <- wdbc[1:469, 1]
wdbc_test_labels <- wdbc[470:569, 1]
wdbc_test_pred_z <- knn(train = wdbc_train_z, test = wdbc_test_z,
cl = wdbc_train_labels, k=21)
CrossTable(x = wdbc_test_labels, y = wdbc_test_pred_z,
prop.chisq=FALSE)##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wdbc_test_pred_z
## wdbc_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 77 | 0 | 77 |
## | 1.000 | 0.000 | 0.770 |
## | 0.975 | 0.000 | |
## | 0.770 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 2 | 21 | 23 |
## | 0.087 | 0.913 | 0.230 |
## | 0.025 | 1.000 | |
## | 0.020 | 0.210 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 79 | 21 | 100 |
## | 0.790 | 0.210 | |
## -----------------|-----------|-----------|-----------|
##
##
Surprisingly, in the following table, the results of our new transformation show a same accuracy. This might not be the case with all the datasets.
Testing alternative values of k
We may be able do even better by examining performance across various values of k. Using the normalized training and test datasets, the same 100 records were classified using several different k values. The number of false negatives and false positives are shown for each iteration:
The given table shows how accuracy varies with k
Although the classifier was never perfect, the 1NN approach was able to avoid some of the false negatives at the expense of adding false positives.
Conclusion
In this project we learnt the kNN Algorithm and also in a few simple lines of R code, we were able to correctly identify whether a mass was malignant or benign 98 percent of the time. This shows the power of R tools and Machine Learning algorithms, and their usefullness when combined together.