Euclidean distance is measured “as the crow flies,” implying the shortest direct route. Euclidean distance is specified by the following formula. The term p1 refers to the value of the first feature of example p, while q1 refers to the value of the first feature of example q: \[ dist(P,Q) = \sqrt{(P_1 - Q_1)^2 + (P_2 - Q_2)^2 + ... + (P_n - Q_n)^2} \]Manhattan distance, which is based on the paths a pedestrian would take by walking city blocks.| ingredient | sweetness | crunchiness | food type | distance to the tomato |
|---|---|---|---|---|
| grape | 8 | 5 | fruit | sqrt((6 - 8)^2 + (4 - 5)^2) = 2.2 |
| green bean | 3 | 7 | vegetable | sqrt((6 - 3)^2 + (4 - 7)^2) = 4.2 |
| nuts | 3 | 6 | protein | sqrt((6 - 3)^2 + (4 - 6)^2) = 3.6 |
| orange | 7 | 3 | fruit | sqrt((6 - 7)^2 + (4 - 3)^2) = 1.4 |
One common practice is to set k equal to the square root of the number of training examples. An alternative approach is to test several k values on a variety of test datasets and choose the one that delivers the best classification performance - Choosing a large k reduces the impact or variance caused by noisy data, but can bias the learner such that it runs the risk of ignoring small, but important patterns. - Using a single nearest neighbor allows noisy data or outliers, to unduly influence the classification of examples.
The traditional method of rescaling features for kNN is min-max normalization. Normalized feature values can be interpreted as indicating how far, from 0 percent to 100 percent, the original value fell along the range between the original minimum and maximum \[X_{new} = \frac{X-Min(X)}{Max(X)-Min(X)} \]
Another common transformation is called z-score standardization \[X_{new} = \frac{X-Mean(X)}{StdDev(X)} \]
Dummy coding The Euclidean distance formula is not defined for nominal data. Therefore, to calculate the distance between nominal features, we need to convert them into a numeric format. A typical solution utilizes dummy coding, where a value of 1 indicates one category, and 0 indicates the other
We will investigate the utility of machine learning for detecting cancer by applying the kNN algorithm to measurements of biopsied cells from women with abnormal breast masses.
wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
str(wbcd)
## 'data.frame': 569 obs. of 32 variables:
## $ id : int 87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
## $ diagnosis : chr "B" "B" "B" "B" ...
## $ radius_mean : num 12.3 10.6 11 11.3 15.2 ...
## $ texture_mean : num 12.4 18.9 16.8 13.4 13.2 ...
## $ perimeter_mean : num 78.8 69.3 70.9 73 97.7 ...
## $ area_mean : num 464 346 373 385 712 ...
## $ smoothness_mean : num 0.1028 0.0969 0.1077 0.1164 0.0796 ...
## $ compactness_mean : num 0.0698 0.1147 0.078 0.1136 0.0693 ...
## $ concavity_mean : num 0.0399 0.0639 0.0305 0.0464 0.0339 ...
## $ points_mean : num 0.037 0.0264 0.0248 0.048 0.0266 ...
## $ symmetry_mean : num 0.196 0.192 0.171 0.177 0.172 ...
## $ dimension_mean : num 0.0595 0.0649 0.0634 0.0607 0.0554 ...
## $ radius_se : num 0.236 0.451 0.197 0.338 0.178 ...
## $ texture_se : num 0.666 1.197 1.387 1.343 0.412 ...
## $ perimeter_se : num 1.67 3.43 1.34 1.85 1.34 ...
## $ area_se : num 17.4 27.1 13.5 26.3 17.7 ...
## $ smoothness_se : num 0.00805 0.00747 0.00516 0.01127 0.00501 ...
## $ compactness_se : num 0.0118 0.03581 0.00936 0.03498 0.01485 ...
## $ concavity_se : num 0.0168 0.0335 0.0106 0.0219 0.0155 ...
## $ points_se : num 0.01241 0.01365 0.00748 0.01965 0.00915 ...
## $ symmetry_se : num 0.0192 0.035 0.0172 0.0158 0.0165 ...
## $ dimension_se : num 0.00225 0.00332 0.0022 0.00344 0.00177 ...
## $ radius_worst : num 13.5 11.9 12.4 11.9 16.2 ...
## $ texture_worst : num 15.6 22.9 26.4 15.8 15.7 ...
## $ perimeter_worst : num 87 78.3 79.9 76.5 104.5 ...
## $ area_worst : num 549 425 471 434 819 ...
## $ smoothness_worst : num 0.139 0.121 0.137 0.137 0.113 ...
## $ compactness_worst: num 0.127 0.252 0.148 0.182 0.174 ...
## $ concavity_worst : num 0.1242 0.1916 0.1067 0.0867 0.1362 ...
## $ points_worst : num 0.0939 0.0793 0.0743 0.0861 0.0818 ...
## $ symmetry_worst : num 0.283 0.294 0.3 0.21 0.249 ...
## $ dimension_worst : num 0.0677 0.0759 0.0788 0.0678 0.0677 ...
Regardless of the machine learning method, ID variables should always be excluded. Neglecting to do so can lead to erroneous findings because the ID can be used to uniquely “predict” each example. Therefore, a model that includes an identifier will most likely suffer from overfitting, and is not likely to generalize well to other data.
# drop the id feature altogether
wbcd <- wbcd[-1]
The next variable, diagnosis, is of particular interest, as it is the outcome we hope to predict.
table(wbcd$diagnosis)
##
## B M
## 357 212
Many R machine learning classifiers require that the target feature is coded as a factor, so we will need to recode the diagnosis variable. We will also take this opportunity to give the B and M values more informative labels using the labels parameter:
wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant"))
round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
##
## Benign Malignant
## 62.7 37.3
summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. : 6.981 Min. : 143.5 Min. :0.05263
## 1st Qu.:11.700 1st Qu.: 420.3 1st Qu.:0.08637
## Median :13.370 Median : 551.1 Median :0.09587
## Mean :14.127 Mean : 654.9 Mean :0.09636
## 3rd Qu.:15.780 3rd Qu.: 782.7 3rd Qu.:0.10530
## Max. :28.110 Max. :2501.0 Max. :0.16340
smoothness_mean ranges from 0.05 to 0.16, while area_mean ranges from 143.5 to 2501.0, the impact of area is going to be much larger than smoothness in the distance calculation. This could potentially cause problems for our classifier, so let’s apply normalization to rescale the features to a standard range of values.To normalize these features, we need to create a normalize() function in R. This function takes a vector x of numeric values, and for each value in x, subtract the minimum value in x and divide by the range of values in x. Finally, the resulting vector is returned.
normalize <- function(x) {
return ((x-min(x)) / (max(x) - min(x)))
}
# test normalize function
normalize(c(1, 2, 3, 4, 5))
## [1] 0.00 0.25 0.50 0.75 1.00
normalize(c(10, 20, 30, 40, 50))
## [1] 0.00 0.25 0.50 0.75 1.00
Rather than normalizing each of the 30 numeric variables individually, we will use one of R’s functions to automate the process.
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
summary(wbcd_n$area_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1174 0.1729 0.2169 0.2711 1.0000
We can simulate this scenario by dividing our data into two portions: a training dataset that will be used to build the kNN model and a test dataset that will be used to estimate the predictive accuracy of the model.
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]
When we constructed our training and test data, we excluded the target variable, diagnosis. For training the kNN model, we will need to store these class labels in factor vectors, divided to the training and test datasets:
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
For the kNN algorithm, the training phase actually involves no model building–the process of training a lazy learner like kNN simply involves storing the input data in a structured format.
library(class)
## Warning: package 'class' was built under R version 3.2.5
# knn Grammer, the function returns a factor vector of predicted classes for each row in the test data frame.
p <- knn(train, test, class, k)
As our training data includes 469 instances, we might try k = 21, an odd number roughly equal to the square root of 469. Using an odd number will reduce the chance of ending with a tie vote.
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 21)
wbcd_test_pred
## [1] Benign Benign Benign Benign Malignant Benign Malignant
## [8] Benign Malignant Benign Malignant Benign Malignant Malignant
## [15] Benign Benign Malignant Benign Malignant Benign Malignant
## [22] Malignant Malignant Malignant Benign Benign Benign Benign
## [29] Malignant Malignant Malignant Benign Malignant Malignant Benign
## [36] Benign Benign Benign Benign Malignant Malignant Benign
## [43] Malignant Malignant Benign Malignant Malignant Malignant Malignant
## [50] Malignant Benign Benign Benign Benign Benign Benign
## [57] Benign Benign Malignant Benign Benign Benign Benign
## [64] Benign Malignant Malignant Benign Benign Benign Benign
## [71] Benign Malignant Benign Benign Malignant Malignant Benign
## [78] Benign Benign Benign Benign Benign Benign Malignant
## [85] Benign Benign Malignant Benign Benign Benign Benign
## [92] Malignant Benign Benign Benign Benign Benign Malignant
## [99] Benign Malignant
## Levels: Benign Malignant
The next step of the process is to evaluate how well the predicted classes in the wbcd_test_pred vector match up with the known values in the wbcd_test_labels vector.
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.2.5
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.968 | 0.000 | |
## | 0.610 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 2 | 37 | 39 |
## | 0.051 | 0.949 | 0.390 |
## | 0.032 | 1.000 | |
## | 0.020 | 0.370 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 63 | 37 | 100 |
## | 0.630 | 0.370 | |
## -----------------|-----------|-----------|-----------|
##
##
true negative results. These 61 of 100 values indicate cases where the mass was benign, and the kNN algorithm correctly identified it as suchtrue positive results, where the classifier and the clinically determined label agree that the mass is malignant. A total of 37 of 100 predictions were true positives.false negative results; in this case, the predicted value was benign but the tumor was actually malignant. Errors in this direction could be extremely costly, as they might lead a patient to believe that she is cancer-free, when in reality the disease may continue to spread.false positive results,if there were any. These values occur when the model classifies a mass as malignant when in reality it was benign. Although such errors are less dangerous than a false negative result, they should also be avoided as they could lead to additional financial burden on the health care system, or additional stress for the patient, as additional tests or treatment may have to be provided.First, we will employ an alternative method for rescaling our numeric features. Second, we will try several different values for k.
The scale() function offers the additional benefit that it can be applied directly to a data frame, so we can avoid use of the lapply() function.
# Rescales all features with the exception of diagnosis, and stores the result as a data frame in the wbcd_z variable
wbcd_z <- as.data.frame(scale(wbcd[-1]))
# Confirm the transportation was applied correctly
summary(wbcd_z$area_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4530 -0.6666 -0.2949 0.0000 0.3632 5.2460
Perform the below steps as before:
# Divide the data into training and test sets
wbcd_train <- wbcd_z[1:469, ]
wbcd_test <- wbcd_z[470:569, ]
# Then classify the test instances using the knn() function.
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
# We'll then compare the predicted labels to the actual labels using CrossTable()
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=21)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.924 | 0.000 | |
## | 0.610 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 5 | 34 | 39 |
## | 0.128 | 0.872 | 0.390 |
## | 0.076 | 1.000 | |
## | 0.050 | 0.340 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 66 | 34 | 100 |
## | 0.660 | 0.340 | |
## -----------------|-----------|-----------|-----------|
##
##
Using the normalized training and test datasets, the same 100 records were classified using several different k values. The number of false negatives and false positives are shown for each iteration:
| K value | # false negatives | # false positives | % classified Incorrectly |
|---|---|---|---|
| 1 | 1 | 3 | 4% |
| 5 | 2 | 0 | 2% |
| 11 | 3 | 0 | 3% |
| 15 | 3 | 0 | 3% |
| 21 | 2 | 0 | 2% |
| 27 | 4 | 0 | 4% |
Unlike many classification algorithms, kNN does not do any learning. It simply stores the training data verbatim. Unlabeled test examples are then matched to the most similar records in the training set using a distance function, and the unlabeled example is assigned the label of its neighbors.