Research Question: How accurate is kNN in depicting benign and malignant Breast Cancer tumors; what is the optimal K value for this process?
Introduction: Because of Breast Cancer’s high mortality rate and being a leading cause of death among women worldwide, there has been importance given to Machine Learning algorithms to depict early signs of benign and malignant tumors effectively. With many Physicians not knowing how to read mammogram results and Radiologists misdiagnosing 15% of their patients, these methods are essential to efficient and accurate outcomes. Using the Breast Cancer Wisconsin (Diagnostic) Dataset from the University of California at Irvine Machine Learning repository, I successfully operated K-Folds cross-validation to find an optimal K value; I utilized this number to run the accuracy of depicting benign and malignant tumors: setting the value I calculated as the K parameter for kNN.
Libraries
library(readr) # Importing the readr library for reading data
library(class) # Importing the class library for k-nearest neighbors algorithm
library(gmodels) # Importing the gmodels library for creating cross-tabulation tables
library(dplyr) # Importing the dplyr library for data manipulation
library(ggplot2) # Importing the ggplot2 library for plot visualization
library(caret) # Importing the 'caret' package for machine learning and data manipulation
Read data
df <- read.csv("C:/Users/sahas/OneDrive/Documents/ISLRData/data.csv")
Display the structure and summary of the data frame
str(df)
## 'data.frame': 569 obs. of 33 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
## $ X : logi NA NA NA NA NA NA ...
# Remove unnecessary columns from the data frame
df <- select(df, -id, -X)
# Create a table showing the count of each unique value in the 'diagnosis' column
table(df$diagnosis)
##
## B M
## 357 212
#Divides each B and M values by total and makes into rounded percent
round(prop.table(table(df$diagnosis)) * 100, digits = 1)
##
## B M
## 62.7 37.3
Check for any missing/NULL values in data
sum(is.na(df))
## [1] 0
Display first 6 rows of data
head(df)
## diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 1 M 17.99 10.38 122.80 1001.0 0.11840
## 2 M 20.57 17.77 132.90 1326.0 0.08474
## 3 M 19.69 21.25 130.00 1203.0 0.10960
## 4 M 11.42 20.38 77.58 386.1 0.14250
## 5 M 20.29 14.34 135.10 1297.0 0.10030
## 6 M 12.45 15.70 82.57 477.1 0.12780
## compactness_mean concavity_mean concave.points_mean symmetry_mean
## 1 0.27760 0.3001 0.14710 0.2419
## 2 0.07864 0.0869 0.07017 0.1812
## 3 0.15990 0.1974 0.12790 0.2069
## 4 0.28390 0.2414 0.10520 0.2597
## 5 0.13280 0.1980 0.10430 0.1809
## 6 0.17000 0.1578 0.08089 0.2087
## fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 1 0.07871 1.0950 0.9053 8.589 153.40
## 2 0.05667 0.5435 0.7339 3.398 74.08
## 3 0.05999 0.7456 0.7869 4.585 94.03
## 4 0.09744 0.4956 1.1560 3.445 27.23
## 5 0.05883 0.7572 0.7813 5.438 94.44
## 6 0.07613 0.3345 0.8902 2.217 27.19
## smoothness_se compactness_se concavity_se concave.points_se symmetry_se
## 1 0.006399 0.04904 0.05373 0.01587 0.03003
## 2 0.005225 0.01308 0.01860 0.01340 0.01389
## 3 0.006150 0.04006 0.03832 0.02058 0.02250
## 4 0.009110 0.07458 0.05661 0.01867 0.05963
## 5 0.011490 0.02461 0.05688 0.01885 0.01756
## 6 0.007510 0.03345 0.03672 0.01137 0.02165
## fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst
## 1 0.006193 25.38 17.33 184.60 2019.0
## 2 0.003532 24.99 23.41 158.80 1956.0
## 3 0.004571 23.57 25.53 152.50 1709.0
## 4 0.009208 14.91 26.50 98.87 567.7
## 5 0.005115 22.54 16.67 152.20 1575.0
## 6 0.005082 15.47 23.75 103.40 741.6
## smoothness_worst compactness_worst concavity_worst concave.points_worst
## 1 0.1622 0.6656 0.7119 0.2654
## 2 0.1238 0.1866 0.2416 0.1860
## 3 0.1444 0.4245 0.4504 0.2430
## 4 0.2098 0.8663 0.6869 0.2575
## 5 0.1374 0.2050 0.4000 0.1625
## 6 0.1791 0.5249 0.5355 0.1741
## symmetry_worst fractal_dimension_worst
## 1 0.4601 0.11890
## 2 0.2750 0.08902
## 3 0.3613 0.08758
## 4 0.6638 0.17300
## 5 0.2364 0.07678
## 6 0.3985 0.12440
# Define a function to normalize a vector
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
# Apply the normalization function to all columns except 'diagnosis' in the data frame
new_df <- as.data.frame(lapply(select(df, -diagnosis), normalize))
# Display summary statistics of selected columns in the new data frame
summary(select(new_df, radius_mean, smoothness_mean))
## radius_mean smoothness_mean
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2233 1st Qu.:0.3046
## Median :0.3024 Median :0.3904
## Mean :0.3382 Mean :0.3948
## 3rd Qu.:0.4164 3rd Qu.:0.4755
## Max. :1.0000 Max. :1.0000
Split the data frame into training and test sets
df_train <- new_df[1:140, ] # 75%
df_test <- new_df[141:569, ] # 25%
df_train_labels <- df[1:140, 1] # Created a new variable for first column,'diagnosis'
df_test_labels <- df[141:569, 1]
# Apply k-nearest neighbors algorithm to predict the labels for the test set
df_test_pred <- knn(train = df_train, test = df_test, cl = df_train_labels, k = 12)
# Create a cross-tabulation table for the predicted labels and true labels
ct <- CrossTable(x = df_test_labels, y = df_test_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 429
##
##
## | df_test_pred
## df_test_labels | B | M | Row Total |
## ---------------|-----------|-----------|-----------|
## B | 278 | 20 | 298 |
## | 0.933 | 0.067 | 0.695 |
## | 0.986 | 0.136 | |
## | 0.648 | 0.047 | |
## ---------------|-----------|-----------|-----------|
## M | 4 | 127 | 131 |
## | 0.031 | 0.969 | 0.305 |
## | 0.014 | 0.864 | |
## | 0.009 | 0.296 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 282 | 147 | 429 |
## | 0.657 | 0.343 | |
## ---------------|-----------|-----------|-----------|
##
##
ct
## $t
## y
## x B M
## B 278 20
## M 4 127
##
## $prop.row
## y
## x B M
## B 0.93288591 0.06711409
## M 0.03053435 0.96946565
##
## $prop.col
## y
## x B M
## B 0.9858156 0.1360544
## M 0.0141844 0.8639456
##
## $prop.tbl
## y
## x B M
## B 0.648018648 0.046620047
## M 0.009324009 0.296037296
Calculate the accuracy by dividing the sum of correctly predicted labels by the total count
(105 + 33) / 140
## [1] 0.9857143
# Define a function to map 'M' to 1 and 'B' to 0
diagnosis_value <- function(diagnosis) {
if (diagnosis == 'M') {
return(1)
} else {
return(0)
}
}
# Apply the diagnosis_value function to the 'diagnosis' column
temp <- sapply(df$diagnosis, diagnosis_value)
Create a scatter plot with regression line for ‘radius_mean’ and ‘texture_mean’ columns
ggplot(df, aes(x = radius_mean, y = texture_mean, color = diagnosis)) +
geom_point() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
Create a scatter plot with regression line for ‘smoothness_mean’ and ‘compactness_mean’ columns
ggplot(df, aes(x = smoothness_mean, y = compactness_mean, color = diagnosis)) +
geom_point()
K-Folds Cross Validation:
wbcd = read.csv("C:/Users/sahas/OneDrive/Documents/ISLRData/data.csv", stringsAsFactors = FALSE)
# Removing the 'X' column from the 'wbcd' dataset
wbcd$X = NULL
# Training a k-nearest neighbors (knn) model using the 'train' function from 'caret' package
fit <- train(
diagnosis ~ ., method = "knn", tuneGrid = expand.grid(k = 1:70), trControl = trainControl(method = "cv", number = 10), metric = "Accuracy", data = wbcd[, -1]
)
Retrieve the results
results <- fit$results
best_k <- fit$bestTune$k
Print the results table
print(results)
## k Accuracy Kappa AccuracySD KappaSD
## 1 1 0.9173743 0.8214028 0.02350543 0.05054478
## 2 2 0.9174682 0.8222554 0.01993043 0.04301568
## 3 3 0.9351072 0.8592993 0.02429340 0.05314472
## 4 4 0.9316286 0.8528427 0.02749441 0.05802215
## 5 5 0.9367081 0.8639830 0.02548378 0.05305371
## 6 6 0.9332307 0.8557280 0.02331774 0.05023147
## 7 7 0.9349851 0.8594529 0.02219954 0.04741365
## 8 8 0.9350153 0.8596740 0.02355979 0.04994585
## 9 9 0.9349851 0.8593845 0.02369001 0.05102852
## 10 10 0.9332296 0.8552975 0.02316747 0.04981634
## 11 11 0.9349235 0.8590273 0.02067371 0.04448934
## 12 12 0.9402482 0.8702771 0.01897607 0.04049362
## 13 13 0.9367081 0.8625168 0.02242255 0.04839577
## 14 14 0.9315055 0.8509312 0.02788767 0.06115760
## 15 15 0.9279341 0.8429443 0.02805059 0.06116059
## 16 16 0.9279654 0.8432029 0.02393962 0.05266600
## 17 17 0.9279654 0.8428436 0.02537651 0.05587704
## 18 18 0.9262412 0.8390187 0.02572809 0.05637077
## 19 19 0.9315055 0.8502005 0.02521901 0.05606449
## 20 20 0.9315055 0.8502005 0.02521901 0.05606449
## 21 21 0.9332599 0.8539212 0.02287859 0.05072240
## 22 22 0.9314742 0.8499285 0.02395668 0.05280239
## 23 23 0.9297198 0.8462078 0.02606779 0.05766754
## 24 24 0.9279654 0.8421618 0.02532790 0.05616518
## 25 25 0.9279654 0.8421618 0.02532790 0.05616518
## 26 26 0.9279654 0.8421618 0.02532790 0.05616518
## 27 27 0.9262110 0.8380312 0.02578823 0.05759923
## 28 28 0.9279654 0.8417520 0.02393962 0.05336501
## 29 29 0.9261797 0.8376818 0.02595091 0.05763294
## 30 30 0.9279038 0.8416499 0.02428290 0.05387065
## 31 31 0.9261797 0.8378250 0.02464530 0.05428923
## 32 32 0.9226396 0.8294574 0.02914644 0.06519704
## 33 33 0.9226396 0.8294574 0.02914644 0.06519704
## 34 34 0.9226396 0.8294574 0.02914644 0.06519704
## 35 35 0.9209154 0.8254892 0.03021814 0.06754738
## 36 36 0.9208841 0.8254747 0.03260226 0.07267218
## 37 37 0.9226698 0.8297069 0.02901472 0.06420271
## 38 38 0.9226698 0.8297069 0.02901472 0.06420271
## 39 39 0.9244555 0.8333955 0.02616440 0.05834203
## 40 40 0.9244253 0.8331460 0.02631267 0.05945169
## 41 41 0.9261494 0.8369709 0.02610262 0.05940539
## 42 42 0.9209165 0.8248743 0.02655136 0.06125409
## 43 43 0.9243950 0.8330067 0.02368551 0.05410749
## 44 44 0.9209165 0.8248743 0.02655136 0.06125409
## 45 45 0.9226407 0.8286991 0.02659715 0.06178080
## 46 46 0.9226407 0.8286991 0.02659715 0.06178080
## 47 47 0.9191308 0.8207974 0.02521159 0.05828835
## 48 48 0.9173764 0.8171386 0.02208886 0.05184012
## 49 49 0.9191308 0.8207974 0.02521159 0.05828835
## 50 50 0.9191308 0.8207974 0.02521159 0.05828835
## 51 51 0.9191308 0.8207974 0.02521159 0.05828835
## 52 52 0.9191308 0.8207974 0.02521159 0.05828835
## 53 53 0.9191308 0.8207974 0.02521159 0.05828835
## 54 54 0.9173764 0.8170725 0.02499424 0.05790253
## 55 55 0.9173764 0.8167585 0.02759540 0.06312184
## 56 56 0.9155907 0.8126001 0.02736730 0.06267279
## 57 57 0.9155907 0.8126001 0.02736730 0.06267279
## 58 58 0.9173138 0.8163950 0.02787665 0.06298493
## 59 59 0.9173451 0.8169170 0.02769194 0.06230404
## 60 60 0.9172835 0.8165043 0.03031278 0.06887258
## 61 61 0.9155291 0.8127722 0.02637346 0.06063564
## 62 62 0.9190692 0.8207419 0.02798144 0.06335796
## 63 63 0.9138050 0.8089473 0.02580462 0.05896187
## 64 64 0.9138050 0.8089473 0.02580462 0.05896187
## 65 65 0.9138050 0.8089473 0.02580462 0.05896187
## 66 66 0.9155907 0.8131850 0.02331905 0.05308922
## 67 67 0.9138050 0.8089473 0.02580462 0.05896187
## 68 68 0.9172835 0.8165043 0.03031278 0.06887258
## 69 69 0.9155291 0.8127722 0.02637346 0.06063564
## 70 70 0.9155291 0.8127722 0.02637346 0.06063564
Summarize results table
print(paste("Optimal K value:", best_k))
## [1] "Optimal K value: 12"
The optimal value of k turned out to be 12. The highest accuracy was achieved by testing 100 data points, resulting in an impressive 98% accuracy. However, the lowest accuracy was shown when k was set to 69, by testing 469 points, and resulting in 31.56% accuracy. Setting k to 69 over-fit the model by calculating too many neighboring points; there would be a lot of misdiagnosis, so using cross validation assures the amount of false negatives and positives are at a minimum.