Hi, I’m Bright (that’s my name, not necessarily my description) and I’m on a personal machine learning journey. So, I want to bring you along with me. I’ll be doing a series (a la netflix style); each series covers a branch of machine learning. Starting with Classification algorithms! Plan is simple, I learn and then I share!
Also, use the code selector in the top right to show or hide the code. You can view ALL code, by setting the top right corner option to view all.
Ok so Firstly, what is k-nn? It stands for K nearest neighbours. Here’ a very brief simple explanation. It is a machine learning algorithm that basically:
For more in depth detail, I suggest you see the references and links section. Those are what I’ve used to gain an understanding of it too! I hope that my blog post gives you practical intuitive understanding of how it works and ho to apply it to a good cause! So let’s jump in –>
Task Detect cancerous cells from biopsied breast masses by using the k-NN classifier algoithm to accurately classify each biopsy record as ‘Malignant’ or ‘Benign’.
Data The dataset for this exercise contains 569 records of fine needle aspirate (FNA) biopsies of breast masses.There are 32 features in total describing each record. 30 are charateristics of the cell nuclei, 1 is the label of Benign or Malignant and 1 is the record ID.
Method Application of a supersvised machine learning algortithm called ‘K- Nearest Neighbour’
Disclaimer This exercise, is based on amazing work done by Brett Lantz in his book, Machine Learning with R. I am recreating it, but in my own way, with my own version of code and explanations, using my own dialect and understanding of the technique.
In R, as with other statistical analysis programming languages, you have to load in the packages you need. As a wise person once said: We are standing on the shoulders of Giants - Thanks to the amazing R community for these super awesome packages.
library(tidyr)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(lubridate)
library(skimr)
library(knitr)
library(readr)
library(class)## Warning: package 'class' was built under R version 3.6.3
library(gmodels)## Warning: package 'gmodels' was built under R version 3.6.3
I download the data from the required website into my working directory as csv then load it into my srudio session as a dataframe.
If you look closely at the glimpse of the dataset below, there are 569 observations (a.k.a rows, records or examples) and 32 variables (a.k.a features, dimensions, or characteristics).
The first two features are: an id column identifying each biopsied cell and the resultant diagnosis. The remaining 30 features actually workout as 3 different measurements mean, se and worst of the same 10 cell characteristics respectively.
# Pull in the data from local folder
breast_cancer_biopsy_data <- read.csv("C:/Users/uduji00b/Desktop/Training & Development/ML/Chapter03/wisc_bc_data.csv", stringsAsFactors = FALSE)
# Have a glimpse of the dataset
glimpse(breast_cancer_biopsy_data)## Observations: 569
## Variables: 32
## $ id <int> 87139402, 8910251, 905520, 868871, 9012568, 90653...
## $ diagnosis <chr> "B", "B", "B", "B", "B", "B", "B", "M", "B", "B",...
## $ radius_mean <dbl> 12.32, 10.60, 11.04, 11.28, 15.19, 11.57, 11.51, ...
## $ texture_mean <dbl> 12.39, 18.95, 16.83, 13.39, 13.21, 19.04, 23.93, ...
## $ perimeter_mean <dbl> 78.85, 69.28, 70.92, 73.00, 97.65, 74.20, 74.52, ...
## $ area_mean <dbl> 464.1, 346.4, 373.2, 384.8, 711.8, 409.7, 403.5, ...
## $ smoothness_mean <dbl> 0.10280, 0.09688, 0.10770, 0.11640, 0.07963, 0.08...
## $ compactness_mean <dbl> 0.06981, 0.11470, 0.07804, 0.11360, 0.06934, 0.07...
## $ concavity_mean <dbl> 0.039870, 0.063870, 0.030460, 0.046350, 0.033930,...
## $ points_mean <dbl> 0.037000, 0.026420, 0.024800, 0.047960, 0.026570,...
## $ symmetry_mean <dbl> 0.1959, 0.1922, 0.1714, 0.1771, 0.1721, 0.2031, 0...
## $ dimension_mean <dbl> 0.05955, 0.06491, 0.06340, 0.06072, 0.05544, 0.06...
## $ radius_se <dbl> 0.2360, 0.4505, 0.1967, 0.3384, 0.1783, 0.2864, 0...
## $ texture_se <dbl> 0.6656, 1.1970, 1.3870, 1.3430, 0.4125, 1.4400, 2...
## $ perimeter_se <dbl> 1.670, 3.430, 1.342, 1.851, 1.338, 2.206, 1.936, ...
## $ area_se <dbl> 17.43, 27.10, 13.54, 26.33, 17.72, 20.30, 16.97, ...
## $ smoothness_se <dbl> 0.008045, 0.007470, 0.005158, 0.011270, 0.005012,...
## $ compactness_se <dbl> 0.011800, 0.035810, 0.009355, 0.034980, 0.014850,...
## $ concavity_se <dbl> 0.016830, 0.033540, 0.010560, 0.021870, 0.015510,...
## $ points_se <dbl> 0.012410, 0.013650, 0.007483, 0.019650, 0.009155,...
## $ symmetry_se <dbl> 0.01924, 0.03504, 0.01718, 0.01580, 0.01647, 0.01...
## $ dimension_se <dbl> 0.002248, 0.003318, 0.002198, 0.003442, 0.001767,...
## $ radius_worst <dbl> 13.50, 11.88, 12.41, 11.92, 16.20, 13.07, 12.48, ...
## $ texture_worst <dbl> 15.64, 22.94, 26.44, 15.77, 15.73, 26.98, 37.16, ...
## $ perimeter_worst <dbl> 86.97, 78.28, 79.93, 76.53, 104.50, 86.43, 82.28,...
## $ area_worst <dbl> 549.1, 424.8, 471.4, 434.0, 819.1, 520.5, 474.2, ...
## $ smoothness_worst <dbl> 0.1385, 0.1213, 0.1369, 0.1367, 0.1126, 0.1249, 0...
## $ compactness_worst <dbl> 0.12660, 0.25150, 0.14820, 0.18220, 0.17370, 0.19...
## $ concavity_worst <dbl> 0.124200, 0.191600, 0.106700, 0.086690, 0.136200,...
## $ points_worst <dbl> 0.09391, 0.07926, 0.07431, 0.08611, 0.08178, 0.06...
## $ symmetry_worst <dbl> 0.2827, 0.2940, 0.2998, 0.2102, 0.2487, 0.3035, 0...
## $ dimension_worst <dbl> 0.06771, 0.07587, 0.07881, 0.06784, 0.06766, 0.08...
The id column, could actually cause an overfitting of the model because as is, it’s a nominal, numeric variable of high magnitude. Also, it is not needed for the nearest neighbour modelling. So, let’s deselect it.
breast_cancer_biopsy_data_no_id <- select(breast_cancer_biopsy_data, -id)
dim(breast_cancer_biopsy_data_no_id)## [1] 569 31
Remember, the feature that gives the labels to all the records in terms of if they are cancerous (malignant) or non-cancerous (benign)is the diagnosis variable.
Now, let’s make sure that the diagnosis feature is a factor type variable with it’s levels correctly labelled. This means that the different options, within this variable have an order/level or grouping to them. This is advised for better accuracy and ease of communication.
breast_cancer_biopsy_data_no_id <- breast_cancer_biopsy_data_no_id %>% mutate(diagnosis = factor(diagnosis,
levels = c("B", "M"),
labels = c("Benign", "Malignant")))
glimpse(breast_cancer_biopsy_data_no_id$diagnosis)## Factor w/ 2 levels "Benign","Malignant": 1 1 1 1 1 1 1 2 1 1 ...
Here we see the total number of malignant and benign records in the data
table(breast_cancer_biopsy_data_no_id$diagnosis)##
## Benign Malignant
## 357 212
Let’s see the same as a proportion of the total
round(prop.table(table(breast_cancer_biopsy_data_no_id$diagnosis)) * 100, digits = 1)##
## Benign Malignant
## 62.7 37.3
It’s worth noting the rest of the 30 varibales, after the first 2, are all numeric. But, how about the scale difference between them? let’s have a look at some of them:
breast_cancer_biopsy_data_no_id %>% summarise(avg_radius_mean = mean(radius_mean),
avg_radius_se = mean(radius_se),
avg_radius_worst = mean(radius_worst))## avg_radius_mean avg_radius_se avg_radius_worst
## 1 14.12729 0.4051721 16.26919
Now, the distance calculation for K-NN is very reliant on the measurement scale of inputs. For example,the impact of the radius_worst will have a disproportionately higher impact on distance compared to that of radius_se because of difference in measurement scale.
The solution to this is Normalization. This simply means rescaling to a common range of values.
To normalize, you use the following equation: x - min(x) / max(x) - min(x) across all relevant features.
# Create normalization function
normalization_function <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
# Test the function on varying scales
normalization_function(c(1, 2, 3, 4, 5))## [1] 0.00 0.25 0.50 0.75 1.00
normalization_function(c(100, 200, 300, 400, 500))## [1] 0.00 0.25 0.50 0.75 1.00
normalization_function(c(1000, 2000, 3000, 4000, 5000))## [1] 0.00 0.25 0.50 0.75 1.00
So now, I remove the diagnois column and then normalize all features required fo the model. I use mutate_all to apply the normalization function across all the necessary variables.
# Normalize the features required to train the model
breast_cancer_biopsy_data_no_id_normalized <- breast_cancer_biopsy_data_no_id %>%
select(-diagnosis) %>%
mutate_all(normalization_function)
# Test the normalization
summary(breast_cancer_biopsy_data_no_id_normalized$perimeter_mean) ## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2168 0.2933 0.3329 0.4168 1.0000
summary(breast_cancer_biopsy_data_no_id_normalized$perimeter_se)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.04000 0.07209 0.09938 0.12251 1.00000
Now that we’ve prepared the features, let’s seperate the dataset into training and testing portions. Why? to use the training portion to build the model and the testing portion to see how well the model does in terms of predictive accuracy.
#Now we create train and test datasets for the classification labels, in this case the *diagnosis* column
training_data_set <- breast_cancer_biopsy_data_no_id_normalized[1:450, ]
dim(training_data_set)## [1] 450 30
testing_data_set <- breast_cancer_biopsy_data_no_id_normalized[451:569, ]
dim(testing_data_set)## [1] 119 30
# Now create the training and test set labels
training_data_set_labels <- breast_cancer_biopsy_data_no_id[1:450, 1]
head(training_data_set_labels)## [1] Benign Benign Benign Benign Benign Benign
## Levels: Benign Malignant
testing_data_set_labels <- breast_cancer_biopsy_data_no_id[451:569, 1]
head(testing_data_set_labels)## [1] Benign Malignant Malignant Benign Benign Benign
## Levels: Benign Malignant
We will use the knn() function from the class package. However, there are other k-nn packages and functions.
Since our training daaset has 450 examples, a rough idea for what k should be is the square root of the number of rows in our training dataset: 21
# Apply the knn algorithm function to learn and model the data
breast_cancer_biopsy_data_predictions <- knn(train = training_data_set,
test = testing_data_set,
cl = training_data_set_labels,
k = 21)
# The function returns a vector of it's own classification predictions of the test dataset
glimpse(breast_cancer_biopsy_data_predictions)## Factor w/ 2 levels "Benign","Malignant": 1 2 2 1 1 1 1 2 1 1 ...
Now let’s evaluate the model performance compared to the actual test data results. We’ll do this by comparing the classifications of the model to the test data classfication labels.
# Use the table function to create a confusion matrix
confusion_matrix <- table(testing_data_set_labels, breast_cancer_biopsy_data_predictions)
confusion_matrix## breast_cancer_biopsy_data_predictions
## testing_data_set_labels Benign Malignant
## Benign 73 0
## Malignant 2 44
Let’s explain how the above CONFUSION MATRIX works:
# Accuracy function that divides the correct predictions by total number of predictions
accuracy <- function(x){
acc <- round(sum(diag(x)/(sum(rowSums(x)))) * 100, digits = 1)
return(acc)
}
accuracy(confusion_matrix)## [1] 98.3
So, our model, despite a 98.3% accuracy, made 2 FALSE POSITIVE predictions. This could be very dangerous (imagine telling two people they have malignant cancerous cells when they actually don’t!).
Now let’s try and see if we can reduce the number of wrongly predicted labels my trying two alterations to our model
The z-score, calculated as (x - Mean(x) / stdDev(x)), rescales records of Feature x by measuring how many standard deviations each is from x’s mean.Now let’s apply the z-score rescaling and re-apply the the K-nn algorithm
# Z-score rescale the features required to train the model
breast_cancer_biopsy_data_no_id_z_scored <- breast_cancer_biopsy_data_no_id %>%
select(-diagnosis) %>%
mutate_all(scale)
# Repeat data prep and model build with the z scored dataset
training_data_set_z_scored <- breast_cancer_biopsy_data_no_id_z_scored[1:450, ]
testing_data_set_z_scored <- breast_cancer_biopsy_data_no_id_z_scored[451:569, ]
# Labels dataset
training_data_set_labels_z_scored <- breast_cancer_biopsy_data_no_id[1:450, 1]
testing_data_set_labels_z_scored <- breast_cancer_biopsy_data_no_id[451:569, 1]
# Apply the knn algorithm function to learn and model the data
breast_cancer_biopsy_data_predictions_z_scored <- knn(train = training_data_set_z_scored,
test = testing_data_set_z_scored,
cl = training_data_set_labels_z_scored,
k = 21)
# Use the table function to create a confusion matrix
confusion_matrix_z_scored <- table(testing_data_set_labels_z_scored, breast_cancer_biopsy_data_predictions_z_scored)
confusion_matrix_z_scored## breast_cancer_biopsy_data_predictions_z_scored
## testing_data_set_labels_z_scored Benign Malignant
## Benign 73 0
## Malignant 5 41
# Apply accuracy function created earlier
accuracy(confusion_matrix_z_scored)## [1] 95.8
As you can see, our accuracy actually dropped from 98.3% to 95.8% and the False Positives increased from 2 to 5. So, not a better option in this instance.
# Function for trying different values of k
k <- c(5, 11, 16, 21, 26, 31, 36)
different_k_results <- function(k){
breast_cancer_biopsy_data_predictions <- knn(train = training_data_set,
test = testing_data_set,
cl = training_data_set_labels,
k = k)
confusion_matrix <- table(testing_data_set_labels, breast_cancer_biopsy_data_predictions)
acc <- round(sum(diag(confusion_matrix)/(sum(rowSums(confusion_matrix)))) * 100, digits = 1)
return(acc)
}
# Accuracy results for different values of k
lapply(k, different_k_results)## [[1]]
## [1] 98.3
##
## [[2]]
## [1] 97.5
##
## [[3]]
## [1] 97.5
##
## [[4]]
## [1] 98.3
##
## [[5]]
## [1] 97.5
##
## [[6]]
## [1] 96.6
##
## [[7]]
## [1] 96.6
The most accurate results where when ke was 5 and 21
In Closing
The difficulty wtth K-NN lies in explaining WHY. As you don’t really build a model, it’s hard to explain in more details, the reason as to why an accuracy % or prediction is what it is other than the adjusting k, distance method, or distance scale anomalies.
I do like the algorithm, it’s fun, intuitive, real world relatable, and relatively explainable. The challenges lie in:
Till next time, B