According to Lantz, k-NN is an algorithm that “uses information about an example’s k-nearest neighbors to classify unlabeled examples. The letter k is a variable term implying that any number of nearest neighbors could be used. After choosing k, the algorithm requires a training dataset made up of examples that have been classified into several categories, as labeled by a nominal variable. Then, for each unlabeled record in the test dataset, k-NN identifies k records in the training data that are the ‘nearest’ in similarity. The unlabeled test instance is assigned the class of the majority of the k nearest neighbors.”(Lantz 2015)
That is to say that by using the k-NN algorithm you can determine classification of previous unclassified data by comparing the variance between multiple points. This variance can be obtained through various methods, like using Euclidean distance from geometry. The equation for euclidean distance \(d\) between two points \(p\) and \(q\) in \(n\)-dimensional space can be described by the following formula:
\[d \left(p,q\right)= \sqrt{\left(p_{1}-q_{1}\right)^2 + \left(p_{2}-q_{2}\right)^2 + ... +\left(p_{n}-q_{n}\right)^2}\] This formula is also known as pythagora’s theorem, and is also how the length of sides of a triangle can be calculated as in \(a^2 + b^2 = c^2\).
There are other methods of obtaining the variance between multiple points. For example, one can obtain the Pearson correlation \(\rho\), the Jackknife correlation \(\varrho\), the Goodman-Kruskal correlation \(\gamma\), and many others (Campello and Jaskowiak 2011).
k-NN is one of the most simple learning algorithms available in a data scientist’s toolkit. k-NN really shines when fast training of large datasets and simplicity are desired. It also does not make assumptions about the underlying data distribution of a dataset. Finally, it is an effective tool in predictive analytics that can quickly and easily be applied to datasets to allow for a quick check on the classification of the data.
k-NN is not without its weaknesses. For example, there is the ability to overfit the data. Overfitting is “[t]he production of an analysis which corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably”(Dictionary, n.d.). Using a large value for k can make the data classification ignore noise collected in the data, but this can be an instance of “throwing the baby out with the bathwater”; the algorithm could also be ignoring important patterns that can hide within the noise. This is known as bias-variance tradeoff. Because the k value must be chosen it can be difficult to select an appropriate value for k.
Nominal features, or features that have values that cannot necessarily be compared numerically to one another (like colors or labels), require additional processing prior to using the k-NN algorithm.
Another weakness exhibited by k-NN is despite having a relatively quick training phase, the classification phase takes a speed hit and is rather slow comparatively. This is due to the fact that no computation is done on the training data prior to classification and is known as lazy learning.
Finally, k-NN does not create a model of the data, which can make it difficult to understand what the classifications obtained actually mean.
As mentioned above, it can be difficult to select an appropriate value for k. According to Lantz, a good rule of thumb is \[ k = \sqrt{N} \] Where \(N\) is the number of training examples.
Another method to determine that appropriate value of k is known as k-fold cross validation. k-fold cross validation divides the data into k subsets, and the k-NN algorithm is applied k times. Each time, one of the subsets is held out and used as a training set. In this method, all of the data is used as a training set \(k-1\) times.(Schneider 1997). This method is more exact, but because the algorithm is being applied k times, the amount of time required to run it goes up by a factor of k. It is important to note that k for k-fold cross validation is different than the k in the k-NN algorithm. A value of \(k=10\) is typically chosen, so the method can also be called 10-fold cross validation.
It is important to note that the features in a dataset might have very different ranges when compared to other features. If the distance formula was applied to unmodified features, there is a potential for the features with larger ranges to dominate or mask the features with smaller ranges. Because of this, it is important to prepare the data with feature scaling.
There are a few different ways that an analyst could scale the data being used. The traditional method for feature scaling used in k-NN classification is called min-max normalization. This method scales all of the values within a feature such that they fall between 0 and 1, and the equation representing this method is given as \[ X_{\textrm{new}}=\frac{X-\textrm{min}\left(X\right)}{\textrm{max}\left(X\right)-\textrm{min}\left(X\right)} \] This formula essentially takes any value \(X\) and subtracts the minimum value of that feature. I then divides this number by the difference of the largest value and the smallest value from that feature.
Another method to scale features is known as z-score standardization. z-score standardization uses the concept of a \(z\)-score, which essentially shows the values distance away from the feature set’s standard deviation, to scale the feature. The equation for scaling is given as
\[ X_{\textrm{new}}=\frac{X-\mu}{\sigma}=\frac{X-\bar{X}}{\textrm{StDev}\left(X\right)} =z \]
The new value of the feature \(X\) is rescaled to the \(z\)-score. Unlike in the min-max normalization seen above, these numbers can be positive or negative, and there is no minimum or maximum value that the new \(X\) must be between.
Another form of normalization that must be down is known as dummy coding, and is important for nominal features (features that are not numerical but categorical) Dummy coding takes all the categories in a feature and assigns them a value, 1 or 0, in their own new feature, 1 meaning that the feature contains that category, and 2 meaning that the feature does not. Then, you keep \(N-1\) of the new features, where \(N\) is the total number of categories. The same amount of data is still encoded in these new features because if it contains only 0, then it must be of the dropped category. In this way, all of the new features are either 1 or 0, which ensures that they are standardized the same as in min-max scaling.
According to Lantz, there are 5 steps required to be able to apply the k-NN algorithm to a data set. They are: 1. Data Collection 2. Data Exploration & Preparation 3. Model Training 4. Model Evaluation 5. Model Improvement.
The dataset that will be analyzed contains different measurements of cells from cancer biopsies like radius, texture, smoothness, perimeter, area, compactness, concavity, concave points, symmetry, and fractal dimension. This data was obtained by a minimally invasive procedure from 569 different patients (Olvi L. Mangasarian and Wolberg 1995).
First, we will begin by importing the dataset (In this case, the modified dataset provided with the textbook was used)
We can use the read.csv() function to import the dataset.
wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
Following along with the textbook, we can subset the data, starting with removing the first feature (one that contains identification)
wbcd <- wbcd[-1]
Next week can make a table showing how many benign and malignant tumors there are. After that, we can show the percentage of each kind of tumor using prop.table()
table(wbcd$diagnosis)
B M
357 212
wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"),
labels = c("Benign", "Malignant"))
round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
Benign Malignant
62.7 37.3
We can also take a look at a summary of three of the features.
summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])
radius_mean area_mean smoothness_mean
Min. : 6.981 Min. : 143.5 Min. :0.05263
1st Qu.:11.700 1st Qu.: 420.3 1st Qu.:0.08637
Median :13.370 Median : 551.1 Median :0.09587
Mean :14.127 Mean : 654.9 Mean :0.09636
3rd Qu.:15.780 3rd Qu.: 782.7 3rd Qu.:0.10530
Max. :28.110 Max. :2501.0 Max. :0.16340
The above shows us that the features are all scaled differently and need to be normalized.
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
summary(wbcd_n$area_mean)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.1174 0.1729 0.2169 0.2711 1.0000
The data has been normalized. Now we need to split the data into training and test sets.
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
Now we can train the model for the data which was just prepared
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k = 21)
Finally, we can take a look at the model we just created and determine how well it performed.
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 100
| wbcd_test_pred
wbcd_test_labels | Benign | Malignant | Row Total |
-----------------|-----------|-----------|-----------|
Benign | 61 | 0 | 61 |
| 1.000 | 0.000 | 0.610 |
| 0.968 | 0.000 | |
| 0.610 | 0.000 | |
-----------------|-----------|-----------|-----------|
Malignant | 2 | 37 | 39 |
| 0.051 | 0.949 | 0.390 |
| 0.032 | 1.000 | |
| 0.020 | 0.370 | |
-----------------|-----------|-----------|-----------|
Column Total | 63 | 37 | 100 |
| 0.630 | 0.370 | |
-----------------|-----------|-----------|-----------|
But how does it hold up? Let’s try a different method of standardization to find out.
Now we can apply \(z\)-score standardization to the data and see if this makes a difference.
wbcd_z = as.data.frame(scale(wbcd[-1]))
wbcd_train <- wbcd_z[1:469, ]
wbcd_test <- wbcd_z[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
After scaling we can perform the same methods that we did in training and evaluation.
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k = 21)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 100
| wbcd_test_pred
wbcd_test_labels | Benign | Malignant | Row Total |
-----------------|-----------|-----------|-----------|
Benign | 61 | 0 | 61 |
| 1.000 | 0.000 | 0.610 |
| 0.924 | 0.000 | |
| 0.610 | 0.000 | |
-----------------|-----------|-----------|-----------|
Malignant | 5 | 34 | 39 |
| 0.128 | 0.872 | 0.390 |
| 0.076 | 1.000 | |
| 0.050 | 0.340 | |
-----------------|-----------|-----------|-----------|
Column Total | 66 | 34 | 100 |
| 0.660 | 0.340 | |
-----------------|-----------|-----------|-----------|
Looks like our new model was a little less reliable than the first one. Further evaluation and improvement is required.
We could go back to the first model that we created and modify some of the inputs to see if we get a better fit with different k values.
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
First, let’s attempt using an even number (This is usually not done in the case of ties)
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k = 22)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 100
| wbcd_test_pred
wbcd_test_labels | Benign | Malignant | Row Total |
-----------------|-----------|-----------|-----------|
Benign | 61 | 0 | 61 |
| 1.000 | 0.000 | 0.610 |
| 0.953 | 0.000 | |
| 0.610 | 0.000 | |
-----------------|-----------|-----------|-----------|
Malignant | 3 | 36 | 39 |
| 0.077 | 0.923 | 0.390 |
| 0.047 | 1.000 | |
| 0.030 | 0.360 | |
-----------------|-----------|-----------|-----------|
Column Total | 64 | 36 | 100 |
| 0.640 | 0.360 | |
-----------------|-----------|-----------|-----------|
These results are still not as good as the results obtained with the first model, despite having a larger value of k.
Next, we can try picking a value of k that is \(\frac{\sqrt{N}}{2}\), or 11.
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k = 11)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 100
| wbcd_test_pred
wbcd_test_labels | Benign | Malignant | Row Total |
-----------------|-----------|-----------|-----------|
Benign | 61 | 0 | 61 |
| 1.000 | 0.000 | 0.610 |
| 0.953 | 0.000 | |
| 0.610 | 0.000 | |
-----------------|-----------|-----------|-----------|
Malignant | 3 | 36 | 39 |
| 0.077 | 0.923 | 0.390 |
| 0.047 | 1.000 | |
| 0.030 | 0.360 | |
-----------------|-----------|-----------|-----------|
Column Total | 64 | 36 | 100 |
| 0.640 | 0.360 | |
-----------------|-----------|-----------|-----------|
And we obtain the same value as for \(k=22\). Finally, let’s double our initial k and subtract 1 (41).
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k = 41)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 100
| wbcd_test_pred
wbcd_test_labels | Benign | Malignant | Row Total |
-----------------|-----------|-----------|-----------|
Benign | 61 | 0 | 61 |
| 1.000 | 0.000 | 0.610 |
| 0.938 | 0.000 | |
| 0.610 | 0.000 | |
-----------------|-----------|-----------|-----------|
Malignant | 4 | 35 | 39 |
| 0.103 | 0.897 | 0.390 |
| 0.062 | 1.000 | |
| 0.040 | 0.350 | |
-----------------|-----------|-----------|-----------|
Column Total | 65 | 35 | 100 |
| 0.650 | 0.350 | |
-----------------|-----------|-----------|-----------|
It appears that our value has gone down. This is an interesting discovery!
Now we can take a look at the Iris(Anderson 1936) dataset providied by the University of California Irvine’s machine learning dataset repository (Lichman 2013). This dataset provides a simple 150 point 5 feature dataset that can be classified easily using kNN. Below we will follow the steps provided by Lantz.
The Iris dataset is already included in R, so it will be easy to preproccess. The features included in the data set are Sepal Length, Sepal Width, Petal Length, Petal Width, and Species.
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
table(iris$Species)
setosa versicolor virginica
50 50 50
round(prop.table(table(factor(iris$Species, levels = c("setosa",
"versicolor",
"virginica"),
labels = c("Setosa", "Versicolor",
"Virginica"))))
* 100, digits = 1)
Setosa Versicolor Virginica
33.3 33.3 33.3
Looking at the structure of the dataset using str() shows us that we need to normalize the data. This can be done with our normalization function created above.
iris_n <- as.data.frame(lapply(iris[1:4], normalize))
iris_n <- cbind(iris_n, iris[5])
Next, we can split the dataset into two sets, the test set and the training set. For this dataset, I used the R function sample() to randomly select 3/4 of the data to be used as a training set and then to use the remaining data as a test set to determine the validity of the model.
randomIris <- sample(nrow(iris), 112)
iris_test <- iris_n[-randomIris, ]
iris_train <- iris_n[randomIris,]
Next we can use the caret package to use k-fold cross validation. This method can give us some insight into what the best value for k in our kNN model will be.
set.seed(123)
ctrl <- trainControl(method = "repeatedcv", repeats = 5)
iris_pred <- train(Species ~ ., data = iris_train, method = "knn")
iris_pred
k-Nearest Neighbors
112 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 112, 112, 112, 112, 112, 112, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.9436228 0.9146894
7 0.9466854 0.9193047
9 0.9466643 0.9192784
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 7.
What this tells us is that we should use the value \(9\) for k.
Finally, we can train the model using our training data.
iris_test_pred <- knn(train = iris_train[1:4], test = iris_test[1:4],
cl = iris_train$Species, k = 9)
CrossTable(x = iris_test$Species, y = iris_test_pred,
prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 38
| iris_test_pred
iris_test$Species | setosa | versicolor | virginica | Row Total |
------------------|------------|------------|------------|------------|
setosa | 10 | 0 | 0 | 10 |
| 1.000 | 0.000 | 0.000 | 0.263 |
| 1.000 | 0.000 | 0.000 | |
| 0.263 | 0.000 | 0.000 | |
------------------|------------|------------|------------|------------|
versicolor | 0 | 13 | 0 | 13 |
| 0.000 | 1.000 | 0.000 | 0.342 |
| 0.000 | 1.000 | 0.000 | |
| 0.000 | 0.342 | 0.000 | |
------------------|------------|------------|------------|------------|
virginica | 0 | 0 | 15 | 15 |
| 0.000 | 0.000 | 1.000 | 0.395 |
| 0.000 | 0.000 | 1.000 | |
| 0.000 | 0.000 | 0.395 | |
------------------|------------|------------|------------|------------|
Column Total | 10 | 13 | 15 | 38 |
| 0.263 | 0.342 | 0.395 | |
------------------|------------|------------|------------|------------|
According to the cross table, the predictions we got were extremely close for setosa, but markedly less so for versicolor or virginica. Perhaps we could use the thumbrule of \(\sqrt{N}\) to evaluate our model further.
\(\sqrt{150}\approx 13\). Let’s attempt our model using the number 13 to see what results we obtain.
iris_test_pred2 <- knn(train = iris_train[1:4], test = iris_test[1:4],
cl = iris_train$Species, k = 13)
CrossTable(x = iris_test$Species, y = iris_test_pred2,
prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 38
| iris_test_pred2
iris_test$Species | setosa | versicolor | virginica | Row Total |
------------------|------------|------------|------------|------------|
setosa | 10 | 0 | 0 | 10 |
| 1.000 | 0.000 | 0.000 | 0.263 |
| 1.000 | 0.000 | 0.000 | |
| 0.263 | 0.000 | 0.000 | |
------------------|------------|------------|------------|------------|
versicolor | 0 | 13 | 0 | 13 |
| 0.000 | 1.000 | 0.000 | 0.342 |
| 0.000 | 0.929 | 0.000 | |
| 0.000 | 0.342 | 0.000 | |
------------------|------------|------------|------------|------------|
virginica | 0 | 1 | 14 | 15 |
| 0.000 | 0.067 | 0.933 | 0.395 |
| 0.000 | 0.071 | 1.000 | |
| 0.000 | 0.026 | 0.368 | |
------------------|------------|------------|------------|------------|
Column Total | 10 | 14 | 14 | 38 |
| 0.263 | 0.368 | 0.368 | |
------------------|------------|------------|------------|------------|
The obtained results are the same as those obtained before.
Let’s try \(k\approx\frac{\sqrt{N}}{2}\), which gives us \(7\)
iris_test_pred <- knn(train = iris_train[1:4], test = iris_test[1:4],
cl = iris_train$Species, k = 7)
CrossTable(x = iris_test$Species, y = iris_test_pred,
prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 38
| iris_test_pred
iris_test$Species | setosa | versicolor | virginica | Row Total |
------------------|------------|------------|------------|------------|
setosa | 10 | 0 | 0 | 10 |
| 1.000 | 0.000 | 0.000 | 0.263 |
| 1.000 | 0.000 | 0.000 | |
| 0.263 | 0.000 | 0.000 | |
------------------|------------|------------|------------|------------|
versicolor | 0 | 12 | 1 | 13 |
| 0.000 | 0.923 | 0.077 | 0.342 |
| 0.000 | 1.000 | 0.062 | |
| 0.000 | 0.316 | 0.026 | |
------------------|------------|------------|------------|------------|
virginica | 0 | 0 | 15 | 15 |
| 0.000 | 0.000 | 1.000 | 0.395 |
| 0.000 | 0.000 | 0.938 | |
| 0.000 | 0.000 | 0.395 | |
------------------|------------|------------|------------|------------|
Column Total | 10 | 12 | 16 | 38 |
| 0.263 | 0.316 | 0.421 | |
------------------|------------|------------|------------|------------|
This model gives us the best results by far. Sometimes simplicity is best. ## References
Anderson, Edgar. 1936. “The Species Problem in Iris.” Annals of the Missouri Botanical Garden. doi:10.2307/2394164.
Campello, Ricardo J. G. B., and Pablo A. Jaskowiak. 2011. “Comparing Correlation Coefficients as Dissimilarity Measures for Cancer Classification in Gene Expression Data.” Brazilian Symposium on Biometrics. doi:10.1.1.208.993.
Dictionary, English Oxford Living. n.d. “Definition of Overfitting.” https://en.oxforddictionaries.com/definition/overfitting.
Lantz, Brett. 2015. Machine Learning with R. Birmingham, United Kingdom: Packt Publishing.
Lichman, M. 2013. “UCI Machine Learning Repository.” University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.
Olvi L. Mangasarian, W. Nick Street, and William H. Wolberg. 1995. “Breast Cancer Diagnosis and Prognosis via Linear Programming.” https://doi.org/10.1287/opre.43.4.570.
Schneider, Jeff. 1997. “Cross Validation.” https://www.cs.cmu.edu/~schneide/tut5/node42.html.