Discussion

Defining k-NN

The k-NN algorithm is a type of supervised learning algorithm that can be used for classification and regression tasks. It works by finding the k-nearest neighbors of a test instance in a labeled training set, and then assigning the test instance the label of the majority class among its k-nearest neighbors.

The Euclidean distance formula is a commonly used method for calculating the distance between two points in n-dimensional space. It is defined as:

d(p, q) = sqrt((q1 - p1)^2 + (q2 - p2)^2 + … + (qn - pn)^2)

Where p and q are two n-dimensional points, and pi and qi are their respective coordinates in each dimension.

The distance between two points can also be calculated using other distance metrics, such as Manhattan distance or cosine similarity, depending on the problem at hand.

Strengths of k-NN

k-NN is one of the most simple learning algorithms available in a data scientist’s toolkit. k-NN really shines when fast training of large datasets and simplicity are desired. It also does not make assumptions about the underlying data distribution of a dataset. Finally, it is an effective tool in predictive analytics that can quickly and easily be applied to datasets to allow for a quick check on the classification of the data.

Weakness of k-NN

K-NN has some drawbacks. For instance, it can overfit data, meaning it matches the data too closely and may not work well with new data. Choosing a large k value can ignore noise in the data, but it may also overlook important patterns. Nominal features need extra processing before using the k-NN algorithm. Additionally, k-NN is fast to train but slow to classify. It’s called “lazy learning” because it doesn’t compute anything on the training data beforehand. Lastly, k-NN doesn’t make a model of the data, so it can be difficult to interpret what the classifications actually mean.

Selecting an appropriate k

Sqaure Root of N method

As mentioned above, it can be difficult to select an appropriate value for k. According to Lantz, a good rule of thumb is

\[k=\sqrt(N)\]

Where N is the number of training examples.

k-fold cross validation method

One way to find the best k value for k-NN is to use k-fold cross-validation. This method splits the data into k subsets and runs the k-NN algorithm k times. Each time, one of the subsets is used as a training set. This way, all of the data is used as a training set k-1 times. K is usually set to 10, so it’s called 10-fold cross-validation. This method is more accurate, but it takes longer because the algorithm runs k times. It’s important to note that the k in k-fold cross-validation is different from the k in k-NN.

Feature scaling

It’s essential to consider that the features in a dataset may have different ranges, making some features more significant than others when using the distance formula. Features with larger ranges could overshadow the smaller ones. Therefore, it’s crucial to use feature scaling to normalize the data before using the distance formula.

z-score Standardization

Another method to scale features is known as z-score standardizationz-score standardization uses the concept of a z-score, which essentially shows the values distance away from the feature set’s standard deviation, to scale the feature. The equation for scaling is given as

\[X_{new}=X-\mu/\sigma = X-\overline{X}/stDev(X) = z \]

The new value of the feature X is rescaled to the z-score. Unlike in the min-max normalization seen above, these numbers can be positive or negative, and there is no minimum or maximum value that the new X must be between.

Applying the k-NN Algorithm

Lantz outlines five essential steps to apply the k-NN algorithm to a data set: 1) collecting the data, 2) exploring and preparing the data, 3) training the model, 4) evaluating the model, and 5) improving the model. These steps are crucial to ensure that the k-NN algorithm is applied effectively and accurately to the data set.

Data Collection

The dataset that will be analyzed contains different measurements of cells from cancer biopsies like radius, texture, smoothness, perimeter, area, compactness, concavity, concave points, symmetry, and fractal dimension. This data was obtained by a minimally invasive procedure from 569 different patients (Olvi L. Mangasarian and Wolberg 1995).

Data Exploration and Preparation

First, we will begin by importing the dataset (In this case, the modified dataset provided with the textbook was used)

We can use the read.csv() function to import the dataset.

library(class)
library(gmodels)
wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)

Following along with the textbook, we can subset the data, starting with removing the first feature (one that contains identification)

wbcd <- wbcd[-1]

Next week can make a table showing how many benign and malignant tumors there are. After that, we can show the percentage of each kind of tumor using prop.table()

table(wbcd$diagnosis)
## 
##   B   M 
## 357 212

Lets convert the “diagnosis” column in the “wbcd” data frame to a factor variable with levels “B” and “M”, which correspond to “Benign” and “Malignant”, respectively. This is done to ensure that the “diagnosis” variable is properly classified and labeled.

wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"),
                         labels = c("Benign", "Malignant"))
round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
## 
##    Benign Malignant 
##      62.7      37.3

We can also take a look at a summary of three of the features.

summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])
##   radius_mean       area_mean      smoothness_mean  
##  Min.   : 6.981   Min.   : 143.5   Min.   :0.05263  
##  1st Qu.:11.700   1st Qu.: 420.3   1st Qu.:0.08637  
##  Median :13.370   Median : 551.1   Median :0.09587  
##  Mean   :14.127   Mean   : 654.9   Mean   :0.09636  
##  3rd Qu.:15.780   3rd Qu.: 782.7   3rd Qu.:0.10530  
##  Max.   :28.110   Max.   :2501.0   Max.   :0.16340

The above shows us that the features are all scaled differently and need to be normalized.

normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
summary(wbcd_n$area_mean)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1174  0.1729  0.2169  0.2711  1.0000

The data has been normalized. Now we need to split the data into training and test sets.

wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]

Model Training

Now we can train the model for the data which was just prepared

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                      cl = wbcd_train_labels, k = 21)

Model Evaluation

Finally, we can take a look at the model we just created and determine how well it performed.

This will output a table showing the number of true positives, true negatives, false positives, and false negatives for the classification, as well as associated statistics such as sensitivity, specificity, positive predictive value, and negative predictive value.

CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
           prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        61 |         0 |        61 | 
##                  |     1.000 |     0.000 |     0.610 | 
##                  |     0.968 |     0.000 |           | 
##                  |     0.610 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         2 |        37 |        39 | 
##                  |     0.051 |     0.949 |     0.390 | 
##                  |     0.032 |     1.000 |           | 
##                  |     0.020 |     0.370 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        63 |        37 |       100 | 
##                  |     0.630 |     0.370 |           | 
## -----------------|-----------|-----------|-----------|
## 
## 

Model Improvement

Now we can apply z score standardization to the data and see if this makes a difference.

wbcd_z = as.data.frame(scale(wbcd[-1]))
wbcd_train <- wbcd_z[1:469, ]
wbcd_test <- wbcd_z[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]

After scaling we can perform the same methods that we did in training and evaluation.

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                      cl = wbcd_train_labels, k = 21)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
           prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        61 |         0 |        61 | 
##                  |     1.000 |     0.000 |     0.610 | 
##                  |     0.924 |     0.000 |           | 
##                  |     0.610 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         5 |        34 |        39 | 
##                  |     0.128 |     0.872 |     0.390 | 
##                  |     0.076 |     1.000 |           | 
##                  |     0.050 |     0.340 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        66 |        34 |       100 | 
##                  |     0.660 |     0.340 |           | 
## -----------------|-----------|-----------|-----------|
## 
## 

Further Improvement

We could go back to the first model that we created and modify some of the inputs to see if we get a better fit with different k values.

wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]

First, let’s attempt using an even number

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                      cl = wbcd_train_labels, k = 22)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
           prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        61 |         0 |        61 | 
##                  |     1.000 |     0.000 |     0.610 | 
##                  |     0.968 |     0.000 |           | 
##                  |     0.610 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         2 |        37 |        39 | 
##                  |     0.051 |     0.949 |     0.390 | 
##                  |     0.032 |     1.000 |           | 
##                  |     0.020 |     0.370 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        63 |        37 |       100 | 
##                  |     0.630 |     0.370 |           | 
## -----------------|-----------|-----------|-----------|
## 
## 

Even though the value of k was increased, the second model’s results are still not as good as those obtained from the first model. Lets try picking a new k value.

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                      cl = wbcd_train_labels, k = 11)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
           prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        61 |         0 |        61 | 
##                  |     1.000 |     0.000 |     0.610 | 
##                  |     0.953 |     0.000 |           | 
##                  |     0.610 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         3 |        36 |        39 | 
##                  |     0.077 |     0.923 |     0.390 | 
##                  |     0.047 |     1.000 |           | 
##                  |     0.030 |     0.360 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        64 |        36 |       100 | 
##                  |     0.640 |     0.360 |           | 
## -----------------|-----------|-----------|-----------|
## 
## 

And we obtain the same value as for \(k=22\). Finally, let’s double our initial k and subtract 1 (41).

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                      cl = wbcd_train_labels, k = 41)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
           prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        61 |         0 |        61 | 
##                  |     1.000 |     0.000 |     0.610 | 
##                  |     0.938 |     0.000 |           | 
##                  |     0.610 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         4 |        35 |        39 | 
##                  |     0.103 |     0.897 |     0.390 | 
##                  |     0.062 |     1.000 |           | 
##                  |     0.040 |     0.350 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        65 |        35 |       100 | 
##                  |     0.650 |     0.350 |           | 
## -----------------|-----------|-----------|-----------|
## 
## 

It appears that our value has gone down.Overall, this code applies the k-NN algorithm to the WBCD dataset and evaluates the accuracy of the model predictions using a confusion table. This can help to determine how well the model is likely to perform on new, unseen data and whether it is suitable for use in practice.