Chapter 3: Classification using Nearest Neighbors

Example: Classifying Cancer Samples

Step 2: Exploring and preparing the data

getwd()

## [1] "C:/Users/m0961401/Downloads"

# import the CSV file
wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)

# examine the structure of the wbcd data frame
str(wbcd)

## 'data.frame':    569 obs. of  32 variables:
##  $ id               : int  87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
##  $ diagnosis        : chr  "B" "B" "B" "B" ...
##  $ radius_mean      : num  12.3 10.6 11 11.3 15.2 ...
##  $ texture_mean     : num  12.4 18.9 16.8 13.4 13.2 ...
##  $ perimeter_mean   : num  78.8 69.3 70.9 73 97.7 ...
##  $ area_mean        : num  464 346 373 385 712 ...
##  $ smoothness_mean  : num  0.1028 0.0969 0.1077 0.1164 0.0796 ...
##  $ compactness_mean : num  0.0698 0.1147 0.078 0.1136 0.0693 ...
##  $ concavity_mean   : num  0.0399 0.0639 0.0305 0.0464 0.0339 ...
##  $ points_mean      : num  0.037 0.0264 0.0248 0.048 0.0266 ...
##  $ symmetry_mean    : num  0.196 0.192 0.171 0.177 0.172 ...
##  $ dimension_mean   : num  0.0595 0.0649 0.0634 0.0607 0.0554 ...
##  $ radius_se        : num  0.236 0.451 0.197 0.338 0.178 ...
##  $ texture_se       : num  0.666 1.197 1.387 1.343 0.412 ...
##  $ perimeter_se     : num  1.67 3.43 1.34 1.85 1.34 ...
##  $ area_se          : num  17.4 27.1 13.5 26.3 17.7 ...
##  $ smoothness_se    : num  0.00805 0.00747 0.00516 0.01127 0.00501 ...
##  $ compactness_se   : num  0.0118 0.03581 0.00936 0.03498 0.01485 ...
##  $ concavity_se     : num  0.0168 0.0335 0.0106 0.0219 0.0155 ...
##  $ points_se        : num  0.01241 0.01365 0.00748 0.01965 0.00915 ...
##  $ symmetry_se      : num  0.0192 0.035 0.0172 0.0158 0.0165 ...
##  $ dimension_se     : num  0.00225 0.00332 0.0022 0.00344 0.00177 ...
##  $ radius_worst     : num  13.5 11.9 12.4 11.9 16.2 ...
##  $ texture_worst    : num  15.6 22.9 26.4 15.8 15.7 ...
##  $ perimeter_worst  : num  87 78.3 79.9 76.5 104.5 ...
##  $ area_worst       : num  549 425 471 434 819 ...
##  $ smoothness_worst : num  0.139 0.121 0.137 0.137 0.113 ...
##  $ compactness_worst: num  0.127 0.252 0.148 0.182 0.174 ...
##  $ concavity_worst  : num  0.1242 0.1916 0.1067 0.0867 0.1362 ...
##  $ points_worst     : num  0.0939 0.0793 0.0743 0.0861 0.0818 ...
##  $ symmetry_worst   : num  0.283 0.294 0.3 0.21 0.249 ...
##  $ dimension_worst  : num  0.0677 0.0759 0.0788 0.0678 0.0677 ...

# drop the id feature
wbcd <- wbcd[-1]

# table of diagnosis
table(wbcd$diagnosis)

## 
##   B   M 
## 357 212

#The diagnosis is our target variable, we intend to predict whether a patient has a benign or malignant tumour based on the features. This model will be used to diagnose cancer.

# recode diagnosis as a factor
wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"),
                         labels = c("Benign", "Malignant"))

# table or proportions with more informative labels
round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)

## 
##    Benign Malignant 
##      62.7      37.3

Here we are using the results in our dataset to compare the number of cancerous tumors in the data collected.

# summarize three numeric features
summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])

##   radius_mean       area_mean      smoothness_mean  
##  Min.   : 6.981   Min.   : 143.5   Min.   :0.05263  
##  1st Qu.:11.700   1st Qu.: 420.3   1st Qu.:0.08637  
##  Median :13.370   Median : 551.1   Median :0.09587  
##  Mean   :14.127   Mean   : 654.9   Mean   :0.09636  
##  3rd Qu.:15.780   3rd Qu.: 782.7   3rd Qu.:0.10530  
##  Max.   :28.110   Max.   :2501.0   Max.   :0.16340

Assuming that the area of the tumors are calculated in mm, the average mm of a biopsy is 14.12 mm. The area is about 654.9 in average, and the smoothness (not sure about the units) 0.09.

Based on my limited domain knowledge in oncology, the more texture that there is in a biopsy, the higher the chances that the tumor may be cancerous.

# create normalization function
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

We normalize to have all of our numbers in a range from 0-1. This will reduce the effect of outliers in our predictions.

# test normalization function - result should be identical
normalize(c(1, 2, 3, 4, 5))

## [1] 0.00 0.25 0.50 0.75 1.00

normalize(c(10, 20, 30, 40, 50))

## [1] 0.00 0.25 0.50 0.75 1.00

# normalize the wbcd data
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
wbcd_n

#Notice how now all of the values are in a range from 0-1

# confirm that normalization worked
summary(wbcd_n$area_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1174  0.1729  0.2169  0.2711  1.0000

Now the average area of a sample(biopsy) is around 0.21 mm (assumed)

# create training and test data
wbcd_train <- wbcd_n[1:456, ] # this is 80% of the data
wbcd_test <- wbcd_n[457:569, ] # 20% test size

# create labels for training and test data

wbcd_train_labels <- wbcd[1:456, 1] 
wbcd_test_labels <- wbcd[457:569, 1]

Step 3: Training a model on the data

# load the "class" library
library(class)

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                      cl = wbcd_train_labels, k = 21) # we are going to use 25 as the number of ideal neighboors to classify the data points

Step 4: Evaluating model performance

# load the "gmodels" library
library(gmodels)

## Warning: package 'gmodels' was built under R version 4.2.3

# Create the cross tabulation of predicted vs. actual
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
           prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  113 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        69 |         0 |        69 | 
##                  |     1.000 |     0.000 |     0.611 | 
##                  |     0.972 |     0.000 |           | 
##                  |     0.611 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         2 |        42 |        44 | 
##                  |     0.045 |     0.955 |     0.389 | 
##                  |     0.028 |     1.000 |           | 
##                  |     0.018 |     0.372 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        71 |        42 |       113 | 
##                  |     0.628 |     0.372 |           | 
## -----------------|-----------|-----------|-----------|
## 
##

Here we should focus on the benign predictions that were actually malignant. This only represents 1.7% of the sample.

## Step 5: Improving model performance

# use the scale() function to z-score standardize a data frame
wbcd_z <- as.data.frame(scale(wbcd[-1]))

# confirm that the transformation was applied correctly
summary(wbcd_z$area_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.4532 -0.6666 -0.2949  0.0000  0.3632  5.2459

we scale the area of the biopsies to and we set the mean to 0

# create training and test datasets
wbcd_train <- wbcd_z[1:456, ]
wbcd_test <- wbcd_z[457:569, ]

# re-classify test cases
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 20)

# Create the cross tabulation of predicted vs. actual
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
           prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  113 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        69 |         0 |        69 | 
##                  |     1.000 |     0.000 |     0.611 | 
##                  |     0.932 |     0.000 |           | 
##                  |     0.611 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         5 |        39 |        44 | 
##                  |     0.114 |     0.886 |     0.389 | 
##                  |     0.068 |     1.000 |           | 
##                  |     0.044 |     0.345 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        74 |        39 |       113 | 
##                  |     0.655 |     0.345 |           | 
## -----------------|-----------|-----------|-----------|
## 
##

# try several different values of k
wbcd_train <- wbcd_n[1:456, ]
wbcd_test <- wbcd_n[457:569, ]

#now we change the hyper-parameter k to 1 from 21
#K=1
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=1)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  113 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        66 |         3 |        69 | 
##                  |     0.957 |     0.043 |     0.611 | 
##                  |     0.985 |     0.065 |           | 
##                  |     0.584 |     0.027 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         1 |        43 |        44 | 
##                  |     0.023 |     0.977 |     0.389 | 
##                  |     0.015 |     0.935 |           | 
##                  |     0.009 |     0.381 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        67 |        46 |       113 | 
##                  |     0.593 |     0.407 |           | 
## -----------------|-----------|-----------|-----------|
## 
##

When k=1 decrease the number of false negatives to 1.

# We try a k of 5 to compare to 1
#k=5 
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=5)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  113 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        69 |         0 |        69 | 
##                  |     1.000 |     0.000 |     0.611 | 
##                  |     0.972 |     0.000 |           | 
##                  |     0.611 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         2 |        42 |        44 | 
##                  |     0.045 |     0.955 |     0.389 | 
##                  |     0.028 |     1.000 |           | 
##                  |     0.018 |     0.372 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        71 |        42 |       113 | 
##                  |     0.628 |     0.372 |           | 
## -----------------|-----------|-----------|-----------|
## 
##

When we increase the number of k to 11, the number of false negatives increase.

# Now we train the model using 11 neighbors
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=11)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  113 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        69 |         0 |        69 | 
##                  |     1.000 |     0.000 |     0.611 | 
##                  |     0.958 |     0.000 |           | 
##                  |     0.611 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         3 |        41 |        44 | 
##                  |     0.068 |     0.932 |     0.389 | 
##                  |     0.042 |     1.000 |           | 
##                  |     0.027 |     0.363 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        72 |        41 |       113 | 
##                  |     0.637 |     0.363 |           | 
## -----------------|-----------|-----------|-----------|
## 
##

# Now we train the model using 15 neighbors
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=15)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  113 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        69 |         0 |        69 | 
##                  |     1.000 |     0.000 |     0.611 | 
##                  |     0.958 |     0.000 |           | 
##                  |     0.611 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         3 |        41 |        44 | 
##                  |     0.068 |     0.932 |     0.389 | 
##                  |     0.042 |     1.000 |           | 
##                  |     0.027 |     0.363 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        72 |        41 |       113 | 
##                  |     0.637 |     0.363 |           | 
## -----------------|-----------|-----------|-----------|
## 
##

#the number of malignant predictions is still 3, which is 3%

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=21)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  113 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        69 |         0 |        69 | 
##                  |     1.000 |     0.000 |     0.611 | 
##                  |     0.972 |     0.000 |           | 
##                  |     0.611 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         2 |        42 |        44 | 
##                  |     0.045 |     0.955 |     0.389 | 
##                  |     0.028 |     1.000 |           | 
##                  |     0.018 |     0.372 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        71 |        42 |       113 | 
##                  |     0.628 |     0.372 |           | 
## -----------------|-----------|-----------|-----------|
## 
##

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=27)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  113 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        69 |         0 |        69 | 
##                  |     1.000 |     0.000 |     0.611 | 
##                  |     0.945 |     0.000 |           | 
##                  |     0.611 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         4 |        40 |        44 | 
##                  |     0.091 |     0.909 |     0.389 | 
##                  |     0.055 |     1.000 |           | 
##                  |     0.035 |     0.354 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        73 |        40 |       113 | 
##                  |     0.646 |     0.354 |           | 
## -----------------|-----------|-----------|-----------|
## 
##

#when we use 27 for our neighbors, the number of malignant predictions increases to 4.

In conclusion: The ideal number of neighbors for our data set of tumors is k=1 and k=5. We could conclude that the less number of neighbors arond our data, the less the number of false negatives which represents the number of patients that were diagnosed as “cancer free” when in reality they had the disease.

In this 2 cases, the precision for:

#k=1 is = 66/(66+3) = 95.6% which is an acceptable level of precision in the health care industry #k=5 is = 69/(69+0) = 100% In this case, I remain skeptical about the precision of this model since no model is 100% precise so I would move forward with the model that uses k=1 to predict the diagnostic of a patient.

the recall in this cases is:

#k=1 is = 66 /(66+1) = 98.5% #k=5 is = 69/(69+2) = 97.18%

Once again, based on the precision and the recall, the idea number for k is 1.

R Notebook