Homework 1 - Practicing KNN

Dataset

In this particular assignment, we will be using the breast cancer dataset to create a diagnostic system for malignant or benign tumors. This particular dataset was published on Kaggle. As motivation, it would be nice if we could create an accurate and explainable model able to detect instances of breast cancer. This kind of tool would be helpful in assisting clinicians in the diagnostic process, hopefully to arrive at better and faster outcomes for patients, meaning both money and lives saved.

In order to apply a kNN algorithm on this kind of dataset, or any model for that matter, it is always a good idea to inspect the actual data prior to applying the model. In order to do so:

First we must load relevant libraries to work with:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Let us load in our particular dataset. Again this dataset can be found here.

# Load the cancer dataset
cancer_data <- read_csv("Cancer_Data.csv")
## New names:
## • `` -> `...33`
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 568 Columns: 33
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): diagnosis
## dbl (31): id, radius_mean, texture_mean, perimeter_mean, area_mean, smoothne...
## lgl  (1): ...33
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

To get a sense of the number of features/labels columns we are working with:

# How many different features of interest?
num_columns <- ncol(cancer_data)
print(paste("The dataset has", num_columns, "columns."))
## [1] "The dataset has 33 columns."

To train a model that sufficiently represents the true distribution from which data was sampled, we need to ensure that our dataset is of adequate size. In other words, how many rows does our dataset have?

num_rows <- nrow(cancer_data)
print(paste("The dataset has", num_rows, "rows"))
## [1] "The dataset has 568 rows"

We are working with instances of malignant and benign tumors, so we want to filter out any additional unknowns that may or may not appear as labels in our dataset. An implication of this decision in our classifier, is that for samples that may be unknown, our model will never learn to predict uncertainty. However, for the sake of this project, we are not interested in quantifying confidence in the level of prediction, but assigning a hard label.

# The crop types we are interested in predicting
diagnosis_types <- c("M", "B")
specific_cancer_data <- cancer_data |> 
  filter(diagnosis %in% diagnosis_types)

When we are using kNN, we are interested in some quantitative relationship between 2 or more variables; it is hard to assign a particular ordinal relationship to categorical variables. And if we do, our specific interpretation may mean something we don’t intend to convey. Thus, we are interested avoiding this issue for now; if we wanted to work with categorical data, would use a different model. We ask ourselves: how many of the values that we are working with are usable (quantitative)? Let us ignore the “id” which is treated as a quantitative variable for now.

# Lets check numerical values vs categorical
numeric_cols <- sapply(cancer_data, is.numeric)

# Print numerical columns
print("Numerical Columns:")
## [1] "Numerical Columns:"
print(names(cancer_data)[numeric_cols])
##  [1] "id"                      "radius_mean"            
##  [3] "texture_mean"            "perimeter_mean"         
##  [5] "area_mean"               "smoothness_mean"        
##  [7] "compactness_mean"        "concavity_mean"         
##  [9] "concave points_mean"     "symmetry_mean"          
## [11] "fractal_dimension_mean"  "radius_se"              
## [13] "texture_se"              "perimeter_se"           
## [15] "area_se"                 "smoothness_se"          
## [17] "compactness_se"          "concavity_se"           
## [19] "concave points_se"       "symmetry_se"            
## [21] "fractal_dimension_se"    "radius_worst"           
## [23] "texture_worst"           "perimeter_worst"        
## [25] "area_worst"              "smoothness_worst"       
## [27] "compactness_worst"       "concavity_worst"        
## [29] "concave points_worst"    "symmetry_worst"         
## [31] "fractal_dimension_worst"

In any particular data-science problem, we should always inspect the data prior to applying any model on it, as it might give us a lot of insight into what relationships we might expect, why the model might or might not perform so well at inference, etc…. Thus, for these variables, we investigate the relationships between them:

With radius, mean area, and diagnosis, there is a very evident correlation between these variables; these would be relatively poor features to have a model use together, as we could effectively use one of the features alone with nearly the same performance. That doesn’t mean that this figure conveys nothing interesting, however. Instances of smaller mean area/radius are very predictably benign. Thus, one of these features will be used with kNN later on.

specific_cancer_data |> 
  ggplot() +
  geom_point(aes(x = radius_mean, y = area_mean, color = diagnosis)) +
  ggtitle("Radius vs Area")

For texture and smoothness, there is substantial overlap among benign and malignant instances. Though, the general trend appears to be that greater mean texture and smoothness values are more likely to be malignant. Though, in a kNN scenario, these two features alone would not be very effective.

specific_cancer_data |> 
  ggplot() +
  geom_point(aes(x = texture_mean, y = smoothness_mean, color = diagnosis)) +
  ggtitle("Texture vs Smoothness")

For compactness and concavity, the more concave and more compact, the more likely it tends to be malignant. While these features do appear slightly more correlated with one another, there is still a degree of variability that might be interesting to capture in a kNN model.

specific_cancer_data |> 
  ggplot() +
  geom_point(aes(x = compactness_mean, y = concavity_mean, color = diagnosis)) +
  ggtitle("Compactness vs Concavity")

Symmetry seems to generally be a weaker feature for distinguishing diagnosis, though greater concavity seems to be more associated with malignant cancers.

specific_cancer_data |> 
  ggplot() +
  geom_point(aes(x = symmetry_mean, y = concavity_mean, color = diagnosis)) +
  ggtitle("Symmetry vs Concavity")

Prior to training, it is probably a good idea to split our dataset into a training set and a testing set. If we begin testing on trained data, one outcome that is possible is that our model learns/memorizes our data extremely well, but generalizes poorly when introduced to new samples from the same distribution. We would effectively be blind in knowing how well our model truly performs. Thus, a more impartial way of doing this is to hold out a certain amount of our data (say 20%) to test on later.

set.seed(1)
training_data_rows <- sample(1:nrow(specific_cancer_data), 
                             size = 0.8 * nrow(specific_cancer_data))
train_data <- specific_cancer_data[training_data_rows, ]
test_data <- specific_cancer_data[-training_data_rows, ]

For this model, I will investigate concavity, texture, and radius, in relation to diagnosis. I must filter out other misc. cols I don’t care about, and also perform some scaling (it would be unfair to measure distances on different scales when using kNN as some would dominate others). This scaling should yield features that are all within the range of [0, 1]. We ignore diagnosis in the mutate across, as this is not numeric, but the label:

# notation guide when under pipe operator: https://magrittr.tidyverse.org/
scaled_train_data <- train_data |>
  select(concavity_mean, texture_mean, area_mean, radius_mean, compactness_mean, symmetry_mean, diagnosis) |>
  mutate(across(-diagnosis, ~ (. - min(.)) / (max(.) - min(.))))

scaled_test_data <- test_data |>
  select(concavity_mean, texture_mean, area_mean, radius_mean, compactness_mean, symmetry_mean, diagnosis) |> 
  mutate(across(-diagnosis, ~ (. - min(.)) / (max(.) - min(.))))
head(scaled_test_data)
## # A tibble: 6 × 7
##   concavity_mean texture_mean area_mean radius_mean compactness_mean
##            <dbl>        <dbl>     <dbl>       <dbl>            <dbl>
## 1         0.581         0.335     0.150       0.260            0.750
## 2         0.103         0.389     0.269       0.408            0.150
## 3         0.645         0.448     0.409       0.562            1    
## 4         0.0580        0.126     0.172       0.291            0.121
## 5         0.554         0.216     0.370       0.516            0.538
## 6         0.164         0.342     0.233       0.363            0.169
## # ℹ 2 more variables: symmetry_mean <dbl>, diagnosis <chr>

Lets also plot our training and testing data separately to ensure that both look relatively similar. If we were apply training data that is unrepresentative of the testing data, we would naturally expect very terrible performance and inference time.

ggplot(scaled_train_data, aes(x = concavity_mean, y = radius_mean, color = diagnosis)) +
  geom_point() +
  labs(title = "Training Data: Concavity vs. Radius", x = "Concavity Mean", y = "Radius Mean")

ggplot(scaled_test_data, aes(x = concavity_mean, y = radius_mean, color = diagnosis)) +
  geom_point() +
  labs(title = "Testing Data: Concavity vs. Radius", x = "Concavity Mean", y = "Radius Mean")

kNN Algorithm Implementation

In the 2D case, I adapt the implementation of my_knn:

# How to pass column names as arguments to r functions: https://stackoverflow.com/questions/2641653/pass-a-data-frame-column-name-to-a-function
# How to dynamically pass the column originally as a string arg and then have it treated as actual sym for col later: https://stackoverflow.com/questions/57136322/what-does-the-operator-mean-in-r-particularly-in-the-context-symx 
my_knn2d <- function(df, x0_col, x1_col, label_col, x_0, x_1, k = 3) {
  df |>
    mutate(distance = sqrt((df[[x0_col]] - x_0)^2 + 
                           (df[[x1_col]] - x_1)^2)) |>
    arrange(distance) |>
    head(k) |>
    count(!!sym(label_col)) |>
    arrange(desc(n)) |>
    slice(1) |>
    pull(!!sym(label_col))
}

Adapting to the 3D case is fairly trivial. We only need to add a \(x_2\) term to distance:

# How to pass column names as arguments to r functions: https://stackoverflow.com/questions/2641653/pass-a-data-frame-column-name-to-a-function
# How to dynamically pass the column originally as a string arg and then have it treated as actual sym for col later: https://stackoverflow.com/questions/57136322/what-does-the-operator-mean-in-r-particularly-in-the-context-symx 
my_knn3d <- function(df, x0_col, x1_col, x2_col, label_col, x_0, x_1, x_2, k = 3) {
  df |>
    mutate(distance = sqrt((df[[x0_col]] - x_0)^2 + 
                           (df[[x1_col]] - x_1)^2 + 
                           (df[[x2_col]] - x_2)^2)) |>
    arrange(distance) |>
    head(k) |>
    count(!!sym(label_col)) |>
    arrange(desc(n)) |>
    slice(1) |>
    pull(!!sym(label_col))
}

I wound up taking the original function implementation for my_knn in class, extending one of them into my_knn3d, as well as allowing for more generic arguments such that column names could be passed, as well as the categorical column such that it is compatible with count and pull. Please see the comments in each implementation as references to understand how to dynamically pass these columns from data-frames as arguments.

I then run test predictions to make sure my functions properly pass arguments and can produce a valid prediction. I will test whether these predictions are actually interesting or good in the subsequent results section.

# test both 2D and 3D
my_knn2d(x_0 = 0.35, 
         x_1 = 0.30, 
         df = scaled_train_data,
         x0_col = "concavity_mean",
         x1_col = "radius_mean",
         label_col = "diagnosis",
         k=5)
## [1] "M"
my_knn3d(x_0 = 0.35, 
         x_1 = 0.30, 
         x_2 = 0.50,
         df = scaled_train_data,
         x0_col = "concavity_mean",
         x1_col = "radius_mean",
         x2_col = "texture_mean",
         label_col = "diagnosis",
         k=5)
## [1] "M"

Results, Testing, & Visualization

In this section, I will be testing my model on 2D and 3D scenarios with \(k=3, 5, 10, 15\). The features that I have identified as less highly correlated (unlike mean_area and radius) and more interesting are concavity, radius, and texture mean. While not shown, I previously used compactness instead in place of each of the other features, and noticed a cap on accuracy around 85% and poor confusion matrices across all k. I will use a subset of the previously selected features for 2D kNN and all three for 3D kNN. I will then compare the confusion matrices, and accuracies at the end. For the best 2D and 3D models, I will visualize the regions/surfaces for my algorithm’s predictions.

2D Model

Let us try for \(k = 3\):

k = 3

k3_2d_preds <- NULL

for (row in 1:nrow(scaled_test_data)) {
  k3_2d_preds[row] <- my_knn2d(
    df = scaled_train_data,
    x0_col = "concavity_mean",
    x1_col = "radius_mean",
    label_col = "diagnosis",
    x_0 = scaled_test_data$concavity_mean[row], 
    x_1 = scaled_test_data$radius_mean[row], 
    k = k
  )
}

# Lets see the confusion matrix
k3_2d_cm = table(k3_2d_preds, scaled_test_data$diagnosis)
print(k3_2d_cm)
##            
## k3_2d_preds  B  M
##           B 67  6
##           M  4 37
# Calculate accuracy
k3_2d_accuracy <- sum(k3_2d_preds == scaled_test_data$diagnosis) / nrow(scaled_test_data)
print(k3_2d_accuracy)
## [1] 0.9122807

Now for \(k=5\):

k = 5

k5_2d_preds <- NULL

for (row in 1:nrow(scaled_test_data)) {
  k5_2d_preds[row] <- my_knn2d(
    df = scaled_train_data,
    x0_col = "concavity_mean",
    x1_col = "radius_mean",
    label_col = "diagnosis",
    x_0 = scaled_test_data$concavity_mean[row], 
    x_1 = scaled_test_data$radius_mean[row], 
    k = k
  )
}

# Lets see the confusion matrix
k5_2d_cm = table(k5_2d_preds, scaled_test_data$diagnosis)
print(k5_2d_cm)
##            
## k5_2d_preds  B  M
##           B 65  6
##           M  6 37
# Calculate accuracy
k5_2d_accuracy <- sum(k3_2d_preds == scaled_test_data$diagnosis) / nrow(scaled_test_data)
print(k5_2d_accuracy)
## [1] 0.9122807

Now for \(k=10\):

k = 10

k10_2d_preds <- NULL

for (row in 1:nrow(scaled_test_data)) {
  k10_2d_preds[row] <- my_knn2d(
    df = scaled_train_data,
    x0_col = "concavity_mean",
    x1_col = "radius_mean",
    label_col = "diagnosis",
    x_0 = scaled_test_data$concavity_mean[row], 
    x_1 = scaled_test_data$radius_mean[row], 
    k = k
  )
}

# Lets see the confusion matrix
k10_2d_cm = table(k10_2d_preds, scaled_test_data$diagnosis)
print(k10_2d_cm)
##             
## k10_2d_preds  B  M
##            B 68  7
##            M  3 36
# Calculate accuracy
k10_2d_accuracy <- sum(k10_2d_preds == scaled_test_data$diagnosis) / nrow(scaled_test_data)
print(k10_2d_accuracy)
## [1] 0.9122807

3D Model

Let us try for \(k = 3\):

k = 3

k3_3d_preds <- NULL

for (row in 1:nrow(scaled_test_data)) {
  k3_3d_preds[row] <- my_knn3d(
    df = scaled_train_data,
    x0_col = "concavity_mean",
    x1_col = "radius_mean",
    x2_col = "texture_mean",
    label_col = "diagnosis",
    x_0 = scaled_test_data$concavity_mean[row], 
    x_1 = scaled_test_data$radius_mean[row], 
    x_2 = scaled_test_data$texture_mean[row], 
    k = k
  )
}

k3_3d_cm = table(k3_3d_preds, scaled_test_data$diagnosis)
print(k3_3d_cm)
##            
## k3_3d_preds  B  M
##           B 70  9
##           M  1 34
k3_3d_accuracy <- sum(k3_3d_preds == scaled_test_data$diagnosis) / nrow(scaled_test_data)
print(k3_3d_accuracy)
## [1] 0.9122807

Let us try for \(k = 5\):

k = 5

k5_3d_preds <- NULL

for (row in 1:nrow(scaled_test_data)) {
  k5_3d_preds[row] <- my_knn3d(
    df = scaled_train_data,
    x0_col = "concavity_mean",
    x1_col = "radius_mean",
    x2_col = "texture_mean",
    label_col = "diagnosis",
    x_0 = scaled_test_data$concavity_mean[row], 
    x_1 = scaled_test_data$radius_mean[row], 
    x_2 = scaled_test_data$texture_mean[row], 
    k = k
  )
}

k5_3d_cm = table(k5_3d_preds, scaled_test_data$diagnosis)
print(k5_3d_cm)
##            
## k5_3d_preds  B  M
##           B 69 10
##           M  2 33
k5_3d_accuracy <- sum(k5_3d_preds == scaled_test_data$diagnosis) / nrow(scaled_test_data)
print(k5_3d_accuracy)
## [1] 0.8947368

Let us try for \(k = 10\):

k = 10

k10_3d_preds <- NULL

for (row in 1:nrow(scaled_test_data)) {
  k10_3d_preds[row] <- my_knn3d(
    df = scaled_train_data,
    x0_col = "concavity_mean",
    x1_col = "radius_mean",
    x2_col = "texture_mean",
    label_col = "diagnosis",
    x_0 = scaled_test_data$concavity_mean[row], 
    x_1 = scaled_test_data$radius_mean[row], 
    x_2 = scaled_test_data$texture_mean[row], 
    k = k
  )
}

k10_3d_cm = table(k10_3d_preds, scaled_test_data$diagnosis)
print(k10_3d_cm)
##             
## k10_3d_preds  B  M
##            B 70  9
##            M  1 34
k10_3d_accuracy <- sum(k10_3d_preds == scaled_test_data$diagnosis) / nrow(scaled_test_data)
print(k10_3d_accuracy)
## [1] 0.9122807

Interpreting Accuracy

Accuracy is not usually the most reliable of metrics. In the case where our data appears highly imbalanced and accuracy may be high, it could be the case that our model is predicting only the largest represented class in the dataset. For reference of how imbalanced our dataset appears (to gauge how much we should distrust accuracy as a metric) here is our counts of each class instance and their ratio:

diagnosis_counts_train <- table(scaled_train_data$diagnosis)
diagnosis_ratio_train <- diagnosis_counts_train / sum(diagnosis_counts_train)
print(diagnosis_counts_train)
## 
##   B   M 
## 285 169
print(diagnosis_ratio_train)
## 
##         B         M 
## 0.6277533 0.3722467
diagnosis_counts_test <- table(scaled_test_data$diagnosis)
diagnosis_ratio_test <- diagnosis_counts_test / sum(diagnosis_counts_test)
print(diagnosis_counts_test)
## 
##  B  M 
## 71 43
print(diagnosis_ratio_test)
## 
##        B        M 
## 0.622807 0.377193

While the data is not exactly balanced, a 60/40 ratio for benign to malignant cancer is not that terrible, considering that we are working with cancer data.

The \(k=3,5,10\) 2D kNN accuracies were as follows:

print(k3_2d_accuracy)
## [1] 0.9122807
print(k5_2d_accuracy)
## [1] 0.9122807
print(k10_2d_accuracy)
## [1] 0.9122807

The \(k=3,5,10\) 3D kNN accuracies were as follows:

print(k3_3d_accuracy)
## [1] 0.9122807
print(k5_3d_accuracy)
## [1] 0.8947368
print(k10_3d_accuracy)
## [1] 0.9122807

As it turns out, our selection of k in this scenario does not have a very large sway on the accuracy (with the exception of the 3d case where \(k=3\) leads to a slight drop). I would surmise that this has to do with the fact that data generally appeared well separated in the plots at the beginning, and that large enough samples allow for some stability in the metric with small changes in k. If we were to select a massive k, say\(k=400\), however, our accuracy tends to drop to around the same value as the ratio for the largest class present in the data (roughly 0.62 as shown below). This is almost always going to predict the largest class (benign):

bad_k = 400

bad_preds <- NULL

for (row in 1:nrow(scaled_test_data)) {
  bad_preds[row] <- my_knn2d(
    df = scaled_train_data,
    x0_col = "concavity_mean",
    x1_col = "radius_mean",
    label_col = "diagnosis",
    x_0 = scaled_test_data$concavity_mean[row], 
    x_1 = scaled_test_data$radius_mean[row], 
    k = bad_k
  )
}

# Lets see the confusion matrix
bad_cm = table(bad_preds, scaled_test_data$diagnosis)
print(bad_cm)
##          
## bad_preds  B  M
##         B 71 43
# Calculate accuracy
bad_accuracy <- sum(bad_preds == scaled_test_data$diagnosis) / nrow(scaled_test_data)
print(bad_accuracy)
## [1] 0.622807

The \(k=3,5,10\) 2D kNN confusion matrices were as follows:

print(k3_2d_cm)
##            
## k3_2d_preds  B  M
##           B 67  6
##           M  4 37
print(k5_2d_cm)
##            
## k5_2d_preds  B  M
##           B 65  6
##           M  6 37
print(k10_2d_cm)
##             
## k10_2d_preds  B  M
##            B 68  7
##            M  3 36

As we began increased k, the number of false malignant predictions generally began to increase. In the context of this particular problem - detecting cancer, this actually might be a preferable outcome to missing an instance of cancer. Though, increasing k had almost no effect on overall long term performance, and then shifting k too far, as shown later caused the model to predict values of only one class (the dominant class).

The \(k=3,5,10\) 3D kNN confusion matrices were as follows:

print(k3_3d_cm)
##            
## k3_3d_preds  B  M
##           B 70  9
##           M  1 34
print(k5_3d_cm)
##            
## k5_3d_preds  B  M
##           B 69 10
##           M  2 33
print(k10_3d_cm)
##             
## k10_3d_preds  B  M
##            B 70  9
##            M  1 34

For the 3D kNN approach, our model almost never missed a benign prediction, tending to make mistakes more when it came to malignant cancers. Thus, I am a little bit more skeptical of the 3D model compared to 2D, as I prefer the outcome of making mistakes predicting as malignant when it is actually benign.

Visualizations

2D

Here, we are going to generate a sequence of every possible combination of the features that we are interested in investigating. Using this, we will be able to visualize the regions for which our 2d kNN algorithm predicts as cancer or benign.

every_cancer_ever <- expand.grid(concavity_mean = seq(from = 0, to = 1, by = 0.01),
                                 radius_mean = seq(from = 0, to = 1, by = 0.01))

In the following, we iterate over all combinations of our feature of interest, and generate predictions with our model. I select \(k=15\), as our training set size \(n=190\text{ B}+94 \text{ M}=284 \text{ total}\) seems sufficient and (relatively) balanced enough to work with this large of a value k.

preds_grid = NULL

for (row in 1:nrow(every_cancer_ever)) {
  preds_grid[row] <- my_knn2d(
    df = scaled_train_data,
    x0_col = "concavity_mean",
    x1_col = "radius_mean",
    label_col = "diagnosis",
    x_0 = every_cancer_ever[row, "concavity_mean"], 
    x_1 = every_cancer_ever[row, "radius_mean"], 
    k = 15
  )
}

The following cell will take a long time to run. Though it becomes quite obvious that the relationship can be describes as: benign cells generally have a maximum possible radius around 0.625 and a maximum mean concavity of near 0.75 (scaled). Benign cells with higher concavity will have smaller radius, and vice versa. That is, we can almost choose a very curved line with a negative slope to separate instances of benign and malignant predictions, with only a tiny cluster of malignant predictions in the red.

every_cancer_ever |>
  mutate(prediction=preds_grid) |>
  ggplot() +
  geom_tile(aes(x=concavity_mean, y=radius_mean, fill=prediction))

3D

Again, we need to generate the set of all combinations of features, this time including the texture_mean. I reduce the step size/interval, as plotting otherwise would take far to long to generate. In order to work with 3D visualizations, I will be using a package called “plotly”:

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
every_cancer_ever <- expand.grid(concavity_mean = seq(from = 0, to = 1, by = 0.05),
                                 radius_mean = seq(from = 0, to = 1, by = 0.05),
                                 texture_mean = seq(from = 0, to = 1, by = 0.05))

Again, we loop over all possible combinations of input features that we generated, and ask our model to make a prediction that we then save under preds_grid. This time, we have included the texture_mean feature. To remain consistent with the other 2D visualization, I am selecting \(k=15\) again.

preds_grid <- NULL

for (row in 1:nrow(every_cancer_ever)) {
  preds_grid[row] <- my_knn3d(
    df = scaled_train_data,
    x0_col = "concavity_mean",
    x1_col = "radius_mean",
    x2_col = "texture_mean",
    label_col = "diagnosis",
    x_0 = every_cancer_ever[row, "concavity_mean"], 
    x_1 = every_cancer_ever[row, "radius_mean"], 
    x_2 = every_cancer_ever[row, "texture_mean"],
    k = 15
  )
}

plot_ly(
  x = every_cancer_ever$concavity_mean, 
  y = every_cancer_ever$radius_mean, 
  z = every_cancer_ever$texture_mean, 
  color = preds_grid,
  colors = c("red", "blue"),
  type = "scatter3d",
)
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Here, x corresponds to concavity, y to radius, and z to texture. You can see that generally for lower concavity and radii values, the model tends to predict benign. However, there appears to be a wide range of possible values for texture along the z axis. Consequently, for greater values of all three features (closer to 1), we tend to get predictions of malignant cells. These findings for breast cancer are fairly consistent with other types of cancer, such as melanoma, where we would be more concerned with moles that are 1. highly asymmetric, 2. irregular in border, 3. an uneven color, 4. large in diameter, and 5. changing in shape. In effect, our dataset contains many of these features but tailored towards breast cancer instead.

As far as our actual model performance, I was a little bit surprised how \(k\) didn’t matter so much in the case of a much larger dataset where the values for features corresponding to different classes tended to be well separated. It was also interesting to note that the 3D kNN provided nearly no advantage whatsoever in our predictions over 2D. Though, I would probably need to test many different combinations of other features to make such a generalization. Lastly, it was interesting to note that specific choices of 2D vs. 3D kNN, and sometimes for values of \(k\), there would be a trade-off in correct positive predictions and correct negative predictions (as was seen in the confusion matrix).

For future work, we could alternatively imagine separating our predictions along some hyper-plane, perhaps considering a different model like a support vector machine.