The k-NN algorithm is a type of supervised learning algorithm that can be used for classification and regression tasks. It works by finding the k-nearest neighbors of a test instance in a labeled training set, and then assigning the test instance the label of the majority class among its k-nearest neighbors.
The Euclidean distance formula is a commonly used method for calculating the distance between two points in n-dimensional space. It is defined as:
d(p, q) = sqrt((q1 - p1)^2 + (q2 - p2)^2 + … + (qn - pn)^2)
Where p and q are two n-dimensional points, and pi and qi are their respective coordinates in each dimension.
The distance between two points can also be calculated using other distance metrics, such as Manhattan distance or cosine similarity, depending on the problem at hand.
k-NN is one of the most simple learning algorithms available in a data scientist’s toolkit. k-NN really shines when fast training of large datasets and simplicity are desired. It also does not make assumptions about the underlying data distribution of a dataset. Finally, it is an effective tool in predictive analytics that can quickly and easily be applied to datasets to allow for a quick check on the classification of the data.
K-NN has some drawbacks. For instance, it can overfit data, meaning it matches the data too closely and may not work well with new data. Choosing a large k value can ignore noise in the data, but it may also overlook important patterns. Nominal features need extra processing before using the k-NN algorithm. Additionally, k-NN is fast to train but slow to classify. It’s called “lazy learning” because it doesn’t compute anything on the training data beforehand. Lastly, k-NN doesn’t make a model of the data, so it can be difficult to interpret what the classifications actually mean.
As mentioned above, it can be difficult to select an appropriate value for k. According to Lantz, a good rule of thumb is
\[k=\sqrt(N)\]
Where N is the number of training examples.
One way to find the best k value for k-NN is to use k-fold cross-validation. This method splits the data into k subsets and runs the k-NN algorithm k times. Each time, one of the subsets is used as a training set. This way, all of the data is used as a training set k-1 times. K is usually set to 10, so it’s called 10-fold cross-validation. This method is more accurate, but it takes longer because the algorithm runs k times. It’s important to note that the k in k-fold cross-validation is different from the k in k-NN.
It’s essential to consider that the features in a dataset may have different ranges, making some features more significant than others when using the distance formula. Features with larger ranges could overshadow the smaller ones. Therefore, it’s crucial to use feature scaling to normalize the data before using the distance formula.
Another method to scale features is known as z-score standardization. z-score standardization uses the concept of a z-score, which essentially shows the values distance away from the feature set’s standard deviation, to scale the feature. The equation for scaling is given as
\[X_{new}=X-\mu/\sigma = X-\overline{X}/stDev(X) = z \]
The new value of the feature X is rescaled to the z-score. Unlike in the min-max normalization seen above, these numbers can be positive or negative, and there is no minimum or maximum value that the new X must be between.
Lantz outlines five essential steps to apply the k-NN algorithm to a data set: 1) collecting the data, 2) exploring and preparing the data, 3) training the model, 4) evaluating the model, and 5) improving the model. These steps are crucial to ensure that the k-NN algorithm is applied effectively and accurately to the data set.
The dataset that will be analyzed contains different measurements of cells from cancer biopsies like radius, texture, smoothness, perimeter, area, compactness, concavity, concave points, symmetry, and fractal dimension. This data was obtained by a minimally invasive procedure from 569 different patients (Olvi L. Mangasarian and Wolberg 1995).
First, we will begin by importing the dataset (In this case, the modified dataset provided with the textbook was used)
We can use the read.csv() function to import the
dataset.
library(class)
library(gmodels)
wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
Following along with the textbook, we can subset the data, starting with removing the first feature (one that contains identification)
wbcd <- wbcd[-1]
Next week can make a table showing how many benign and malignant
tumors there are. After that, we can show the percentage of each kind of
tumor using prop.table()
table(wbcd$diagnosis)
##
## B M
## 357 212
Lets convert the “diagnosis” column in the “wbcd” data frame to a factor variable with levels “B” and “M”, which correspond to “Benign” and “Malignant”, respectively. This is done to ensure that the “diagnosis” variable is properly classified and labeled.
wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"),
labels = c("Benign", "Malignant"))
round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
##
## Benign Malignant
## 62.7 37.3
We can also take a look at a summary of three of the features.
summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. : 6.981 Min. : 143.5 Min. :0.05263
## 1st Qu.:11.700 1st Qu.: 420.3 1st Qu.:0.08637
## Median :13.370 Median : 551.1 Median :0.09587
## Mean :14.127 Mean : 654.9 Mean :0.09636
## 3rd Qu.:15.780 3rd Qu.: 782.7 3rd Qu.:0.10530
## Max. :28.110 Max. :2501.0 Max. :0.16340
The above shows us that the features are all scaled differently and need to be normalized.
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
summary(wbcd_n$area_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1174 0.1729 0.2169 0.2711 1.0000
The data has been normalized. Now we need to split the data into training and test sets.
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
Now we can train the model for the data which was just prepared
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k = 21)
Finally, we can take a look at the model we just created and determine how well it performed.
This will output a table showing the number of true positives, true negatives, false positives, and false negatives for the classification, as well as associated statistics such as sensitivity, specificity, positive predictive value, and negative predictive value.
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.968 | 0.000 | |
## | 0.610 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 2 | 37 | 39 |
## | 0.051 | 0.949 | 0.390 |
## | 0.032 | 1.000 | |
## | 0.020 | 0.370 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 63 | 37 | 100 |
## | 0.630 | 0.370 | |
## -----------------|-----------|-----------|-----------|
##
##
Now we can apply z score standardization to the data and see if this makes a difference.
wbcd_z = as.data.frame(scale(wbcd[-1]))
wbcd_train <- wbcd_z[1:469, ]
wbcd_test <- wbcd_z[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
After scaling we can perform the same methods that we did in training and evaluation.
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k = 21)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.924 | 0.000 | |
## | 0.610 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 5 | 34 | 39 |
## | 0.128 | 0.872 | 0.390 |
## | 0.076 | 1.000 | |
## | 0.050 | 0.340 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 66 | 34 | 100 |
## | 0.660 | 0.340 | |
## -----------------|-----------|-----------|-----------|
##
##
We could go back to the first model that we created and modify some of the inputs to see if we get a better fit with different k values.
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
First, let’s attempt using an even number
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k = 22)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.968 | 0.000 | |
## | 0.610 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 2 | 37 | 39 |
## | 0.051 | 0.949 | 0.390 |
## | 0.032 | 1.000 | |
## | 0.020 | 0.370 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 63 | 37 | 100 |
## | 0.630 | 0.370 | |
## -----------------|-----------|-----------|-----------|
##
##
Even though the value of k was increased, the second model’s results are still not as good as those obtained from the first model. Lets try picking a new k value.
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k = 11)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.953 | 0.000 | |
## | 0.610 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 3 | 36 | 39 |
## | 0.077 | 0.923 | 0.390 |
## | 0.047 | 1.000 | |
## | 0.030 | 0.360 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 64 | 36 | 100 |
## | 0.640 | 0.360 | |
## -----------------|-----------|-----------|-----------|
##
##
And we obtain the same value as for \(k=22\). Finally, let’s double our initial k and subtract 1 (41).
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k = 41)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.938 | 0.000 | |
## | 0.610 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 4 | 35 | 39 |
## | 0.103 | 0.897 | 0.390 |
## | 0.062 | 1.000 | |
## | 0.040 | 0.350 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 65 | 35 | 100 |
## | 0.650 | 0.350 | |
## -----------------|-----------|-----------|-----------|
##
##
It appears that our value has gone down.Overall, this code applies the k-NN algorithm to the WBCD dataset and evaluates the accuracy of the model predictions using a confusion table. This can help to determine how well the model is likely to perform on new, unseen data and whether it is suitable for use in practice.