A Thorough Analysis of the K-Nearest Neighbor Classifier to Predict Benign and Malignant Breast Cancer Tumors

Research Question: How accurate is kNN in depicting benign and malignant Breast Cancer tumors; what is the optimal K value for this process?

Introduction: Because of Breast Cancer’s high mortality rate and being a leading cause of death among women worldwide, there has been importance given to Machine Learning algorithms to depict early signs of benign and malignant tumors effectively. With many Physicians not knowing how to read mammogram results and Radiologists misdiagnosing 15% of their patients, these methods are essential to efficient and accurate outcomes. Using the Breast Cancer Wisconsin (Diagnostic) Dataset from the University of California at Irvine Machine Learning repository, I successfully operated K-Folds cross-validation to find an optimal K value; I utilized this number to run the accuracy of depicting benign and malignant tumors: setting the value I calculated as the K parameter for kNN.

Libraries

library(readr)  # Importing the readr library for reading data
library(class)  # Importing the class library for k-nearest neighbors algorithm
library(gmodels)  # Importing the gmodels library for creating cross-tabulation tables
library(dplyr)  # Importing the dplyr library for data manipulation
library(ggplot2)  # Importing the ggplot2 library for plot visualization
library(caret) # Importing the 'caret' package for machine learning and data manipulation

Read data

df <- read.csv("C:/Users/sahas/OneDrive/Documents/ISLRData/data.csv")

Display the structure and summary of the data frame

str(df)

## 'data.frame':    569 obs. of  33 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  $ X                      : logi  NA NA NA NA NA NA ...

# Remove unnecessary columns from the data frame
df <- select(df, -id, -X)

# Create a table showing the count of each unique value in the 'diagnosis' column
table(df$diagnosis)

## 
##   B   M 
## 357 212

#Divides each B and M values by total and makes into rounded percent
round(prop.table(table(df$diagnosis)) * 100, digits = 1)

## 
##    B    M 
## 62.7 37.3

Check for any missing/NULL values in data

sum(is.na(df))

## [1] 0

Display first 6 rows of data

head(df)

##   diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 1         M       17.99        10.38         122.80    1001.0         0.11840
## 2         M       20.57        17.77         132.90    1326.0         0.08474
## 3         M       19.69        21.25         130.00    1203.0         0.10960
## 4         M       11.42        20.38          77.58     386.1         0.14250
## 5         M       20.29        14.34         135.10    1297.0         0.10030
## 6         M       12.45        15.70          82.57     477.1         0.12780
##   compactness_mean concavity_mean concave.points_mean symmetry_mean
## 1          0.27760         0.3001             0.14710        0.2419
## 2          0.07864         0.0869             0.07017        0.1812
## 3          0.15990         0.1974             0.12790        0.2069
## 4          0.28390         0.2414             0.10520        0.2597
## 5          0.13280         0.1980             0.10430        0.1809
## 6          0.17000         0.1578             0.08089        0.2087
##   fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 1                0.07871    1.0950     0.9053        8.589  153.40
## 2                0.05667    0.5435     0.7339        3.398   74.08
## 3                0.05999    0.7456     0.7869        4.585   94.03
## 4                0.09744    0.4956     1.1560        3.445   27.23
## 5                0.05883    0.7572     0.7813        5.438   94.44
## 6                0.07613    0.3345     0.8902        2.217   27.19
##   smoothness_se compactness_se concavity_se concave.points_se symmetry_se
## 1      0.006399        0.04904      0.05373           0.01587     0.03003
## 2      0.005225        0.01308      0.01860           0.01340     0.01389
## 3      0.006150        0.04006      0.03832           0.02058     0.02250
## 4      0.009110        0.07458      0.05661           0.01867     0.05963
## 5      0.011490        0.02461      0.05688           0.01885     0.01756
## 6      0.007510        0.03345      0.03672           0.01137     0.02165
##   fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst
## 1             0.006193        25.38         17.33          184.60     2019.0
## 2             0.003532        24.99         23.41          158.80     1956.0
## 3             0.004571        23.57         25.53          152.50     1709.0
## 4             0.009208        14.91         26.50           98.87      567.7
## 5             0.005115        22.54         16.67          152.20     1575.0
## 6             0.005082        15.47         23.75          103.40      741.6
##   smoothness_worst compactness_worst concavity_worst concave.points_worst
## 1           0.1622            0.6656          0.7119               0.2654
## 2           0.1238            0.1866          0.2416               0.1860
## 3           0.1444            0.4245          0.4504               0.2430
## 4           0.2098            0.8663          0.6869               0.2575
## 5           0.1374            0.2050          0.4000               0.1625
## 6           0.1791            0.5249          0.5355               0.1741
##   symmetry_worst fractal_dimension_worst
## 1         0.4601                 0.11890
## 2         0.2750                 0.08902
## 3         0.3613                 0.08758
## 4         0.6638                 0.17300
## 5         0.2364                 0.07678
## 6         0.3985                 0.12440

# Define a function to normalize a vector
normalize <- function(x) {
    return ((x - min(x)) / (max(x) - min(x)))
}
# Apply the normalization function to all columns except 'diagnosis' in the data frame
new_df <- as.data.frame(lapply(select(df, -diagnosis), normalize))

# Display summary statistics of selected columns in the new data frame
summary(select(new_df, radius_mean, smoothness_mean))

##   radius_mean     smoothness_mean 
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2233   1st Qu.:0.3046  
##  Median :0.3024   Median :0.3904  
##  Mean   :0.3382   Mean   :0.3948  
##  3rd Qu.:0.4164   3rd Qu.:0.4755  
##  Max.   :1.0000   Max.   :1.0000

Split the data frame into training and test sets

df_train <- new_df[1:140, ] # 75%
df_test <- new_df[141:569, ] # 25%
df_train_labels <- df[1:140, 1] # Created a new variable for first column,'diagnosis'
df_test_labels  <- df[141:569, 1]

# Apply k-nearest neighbors algorithm to predict the labels for the test set
df_test_pred <- knn(train = df_train, test = df_test, cl = df_train_labels, k = 12)

# Create a cross-tabulation table for the predicted labels and true labels
ct <- CrossTable(x = df_test_labels, y = df_test_pred, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  429 
## 
##  
##                | df_test_pred 
## df_test_labels |         B |         M | Row Total | 
## ---------------|-----------|-----------|-----------|
##              B |       278 |        20 |       298 | 
##                |     0.933 |     0.067 |     0.695 | 
##                |     0.986 |     0.136 |           | 
##                |     0.648 |     0.047 |           | 
## ---------------|-----------|-----------|-----------|
##              M |         4 |       127 |       131 | 
##                |     0.031 |     0.969 |     0.305 | 
##                |     0.014 |     0.864 |           | 
##                |     0.009 |     0.296 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |       282 |       147 |       429 | 
##                |     0.657 |     0.343 |           | 
## ---------------|-----------|-----------|-----------|
## 
##

ct

## $t
##    y
## x     B   M
##   B 278  20
##   M   4 127
## 
## $prop.row
##    y
## x            B          M
##   B 0.93288591 0.06711409
##   M 0.03053435 0.96946565
## 
## $prop.col
##    y
## x           B         M
##   B 0.9858156 0.1360544
##   M 0.0141844 0.8639456
## 
## $prop.tbl
##    y
## x             B           M
##   B 0.648018648 0.046620047
##   M 0.009324009 0.296037296

Calculate the accuracy by dividing the sum of correctly predicted labels by the total count

(105 + 33) / 140

## [1] 0.9857143

# Define a function to map 'M' to 1 and 'B' to 0
diagnosis_value <- function(diagnosis) {
    if (diagnosis == 'M') {
        return(1)
    } else {
        return(0)
    }
}

# Apply the diagnosis_value function to the 'diagnosis' column
temp <- sapply(df$diagnosis, diagnosis_value)

Create a scatter plot with regression line for ‘radius_mean’ and ‘texture_mean’ columns

ggplot(df, aes(x = radius_mean, y = texture_mean, color = diagnosis)) +
    geom_point() +
    geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

Create a scatter plot with regression line for ‘smoothness_mean’ and ‘compactness_mean’ columns

ggplot(df, aes(x = smoothness_mean, y = compactness_mean, color = diagnosis)) +
    geom_point()

K-Folds Cross Validation:

wbcd = read.csv("C:/Users/sahas/OneDrive/Documents/ISLRData/data.csv", stringsAsFactors = FALSE)

# Removing the 'X' column from the 'wbcd' dataset
wbcd$X = NULL

# Training a k-nearest neighbors (knn) model using the 'train' function from 'caret' package
fit <- train(
  diagnosis ~ ., method = "knn", tuneGrid = expand.grid(k = 1:70), trControl = trainControl(method = "cv", number = 10),   metric = "Accuracy", data = wbcd[, -1]
)

Retrieve the results

results <- fit$results
best_k <- fit$bestTune$k

Print the results table

print(results)

##     k  Accuracy     Kappa AccuracySD    KappaSD
## 1   1 0.9173743 0.8214028 0.02350543 0.05054478
## 2   2 0.9174682 0.8222554 0.01993043 0.04301568
## 3   3 0.9351072 0.8592993 0.02429340 0.05314472
## 4   4 0.9316286 0.8528427 0.02749441 0.05802215
## 5   5 0.9367081 0.8639830 0.02548378 0.05305371
## 6   6 0.9332307 0.8557280 0.02331774 0.05023147
## 7   7 0.9349851 0.8594529 0.02219954 0.04741365
## 8   8 0.9350153 0.8596740 0.02355979 0.04994585
## 9   9 0.9349851 0.8593845 0.02369001 0.05102852
## 10 10 0.9332296 0.8552975 0.02316747 0.04981634
## 11 11 0.9349235 0.8590273 0.02067371 0.04448934
## 12 12 0.9402482 0.8702771 0.01897607 0.04049362
## 13 13 0.9367081 0.8625168 0.02242255 0.04839577
## 14 14 0.9315055 0.8509312 0.02788767 0.06115760
## 15 15 0.9279341 0.8429443 0.02805059 0.06116059
## 16 16 0.9279654 0.8432029 0.02393962 0.05266600
## 17 17 0.9279654 0.8428436 0.02537651 0.05587704
## 18 18 0.9262412 0.8390187 0.02572809 0.05637077
## 19 19 0.9315055 0.8502005 0.02521901 0.05606449
## 20 20 0.9315055 0.8502005 0.02521901 0.05606449
## 21 21 0.9332599 0.8539212 0.02287859 0.05072240
## 22 22 0.9314742 0.8499285 0.02395668 0.05280239
## 23 23 0.9297198 0.8462078 0.02606779 0.05766754
## 24 24 0.9279654 0.8421618 0.02532790 0.05616518
## 25 25 0.9279654 0.8421618 0.02532790 0.05616518
## 26 26 0.9279654 0.8421618 0.02532790 0.05616518
## 27 27 0.9262110 0.8380312 0.02578823 0.05759923
## 28 28 0.9279654 0.8417520 0.02393962 0.05336501
## 29 29 0.9261797 0.8376818 0.02595091 0.05763294
## 30 30 0.9279038 0.8416499 0.02428290 0.05387065
## 31 31 0.9261797 0.8378250 0.02464530 0.05428923
## 32 32 0.9226396 0.8294574 0.02914644 0.06519704
## 33 33 0.9226396 0.8294574 0.02914644 0.06519704
## 34 34 0.9226396 0.8294574 0.02914644 0.06519704
## 35 35 0.9209154 0.8254892 0.03021814 0.06754738
## 36 36 0.9208841 0.8254747 0.03260226 0.07267218
## 37 37 0.9226698 0.8297069 0.02901472 0.06420271
## 38 38 0.9226698 0.8297069 0.02901472 0.06420271
## 39 39 0.9244555 0.8333955 0.02616440 0.05834203
## 40 40 0.9244253 0.8331460 0.02631267 0.05945169
## 41 41 0.9261494 0.8369709 0.02610262 0.05940539
## 42 42 0.9209165 0.8248743 0.02655136 0.06125409
## 43 43 0.9243950 0.8330067 0.02368551 0.05410749
## 44 44 0.9209165 0.8248743 0.02655136 0.06125409
## 45 45 0.9226407 0.8286991 0.02659715 0.06178080
## 46 46 0.9226407 0.8286991 0.02659715 0.06178080
## 47 47 0.9191308 0.8207974 0.02521159 0.05828835
## 48 48 0.9173764 0.8171386 0.02208886 0.05184012
## 49 49 0.9191308 0.8207974 0.02521159 0.05828835
## 50 50 0.9191308 0.8207974 0.02521159 0.05828835
## 51 51 0.9191308 0.8207974 0.02521159 0.05828835
## 52 52 0.9191308 0.8207974 0.02521159 0.05828835
## 53 53 0.9191308 0.8207974 0.02521159 0.05828835
## 54 54 0.9173764 0.8170725 0.02499424 0.05790253
## 55 55 0.9173764 0.8167585 0.02759540 0.06312184
## 56 56 0.9155907 0.8126001 0.02736730 0.06267279
## 57 57 0.9155907 0.8126001 0.02736730 0.06267279
## 58 58 0.9173138 0.8163950 0.02787665 0.06298493
## 59 59 0.9173451 0.8169170 0.02769194 0.06230404
## 60 60 0.9172835 0.8165043 0.03031278 0.06887258
## 61 61 0.9155291 0.8127722 0.02637346 0.06063564
## 62 62 0.9190692 0.8207419 0.02798144 0.06335796
## 63 63 0.9138050 0.8089473 0.02580462 0.05896187
## 64 64 0.9138050 0.8089473 0.02580462 0.05896187
## 65 65 0.9138050 0.8089473 0.02580462 0.05896187
## 66 66 0.9155907 0.8131850 0.02331905 0.05308922
## 67 67 0.9138050 0.8089473 0.02580462 0.05896187
## 68 68 0.9172835 0.8165043 0.03031278 0.06887258
## 69 69 0.9155291 0.8127722 0.02637346 0.06063564
## 70 70 0.9155291 0.8127722 0.02637346 0.06063564

Summarize results table

print(paste("Optimal K value:", best_k))

## [1] "Optimal K value: 12"

The optimal value of k turned out to be 12. The highest accuracy was achieved by testing 100 data points, resulting in an impressive 98% accuracy. However, the lowest accuracy was shown when k was set to 69, by testing 469 points, and resulting in 31.56% accuracy. Setting k to 69 over-fit the model by calculating too many neighboring points; there would be a lot of misdiagnosis, so using cross validation assures the amount of false negatives and positives are at a minimum.

A Thorough Analysis of the K-Nearest Neighbor Classifier to Predict Benign and Malignant Breast Cancer Tumors

Sahasra Chatakondu

2023-06-30