Machine Learning using R | Series 1

Introduction

Hi, I’m Bright (that’s my name, not necessarily my description) and I’m on a personal machine learning journey. So, I want to bring you along with me. I’ll be doing a series (a la netflix style); each series covers a branch of machine learning. Starting with Classification algorithms! Plan is simple, I learn and then I share!

Also, use the code selector in the top right to show or hide the code. You can view ALL code, by setting the top right corner option to view all.

Ok so Firstly, what is k-nn? It stands for K nearest neighbours. Here’ a very brief simple explanation. It is a machine learning algorithm that basically:

Takes each example in your training dataset,
Calculates the measured distance between examples across each feature
Says “hey, which example/s, based on all these distance measurements, is the closest to me?” (k is the number of neighbours you want it to compare to)
Then it takes a vote to get the closest neighbour and voila!, you have a classification label for that example

For more in depth detail, I suggest you see the references and links section. Those are what I’ve used to gain an understanding of it too! I hope that my blog post gives you practical intuitive understanding of how it works and ho to apply it to a good cause! So let’s jump in –>

STEP 1: Understanding the Task

Task Detect cancerous cells from biopsied breast masses by using the k-NN classifier algoithm to accurately classify each biopsy record as ‘Malignant’ or ‘Benign’.

Data The dataset for this exercise contains 569 records of fine needle aspirate (FNA) biopsies of breast masses.There are 32 features in total describing each record. 30 are charateristics of the cell nuclei, 1 is the label of Benign or Malignant and 1 is the record ID.

Method Application of a supersvised machine learning algortithm called ‘K- Nearest Neighbour’

Disclaimer This exercise, is based on amazing work done by Brett Lantz in his book, Machine Learning with R. I am recreating it, but in my own way, with my own version of code and explanations, using my own dialect and understanding of the technique.

STEP 2: Load required libraries

In R, as with other statistical analysis programming languages, you have to load in the packages you need. As a wise person once said: We are standing on the shoulders of Giants - Thanks to the amazing R community for these super awesome packages.

library(tidyr)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(lubridate)
library(skimr)
library(knitr)
library(readr)
library(class)

## Warning: package 'class' was built under R version 3.6.3

library(gmodels)

## Warning: package 'gmodels' was built under R version 3.6.3

STEP 3: Connect and explore data

I download the data from the required website into my working directory as csv then load it into my srudio session as a dataframe.

If you look closely at the glimpse of the dataset below, there are 569 observations (a.k.a rows, records or examples) and 32 variables (a.k.a features, dimensions, or characteristics).

The first two features are: an id column identifying each biopsied cell and the resultant diagnosis. The remaining 30 features actually workout as 3 different measurements mean, se and worst of the same 10 cell characteristics respectively.

# Pull in the data from local folder
breast_cancer_biopsy_data <- read.csv("C:/Users/uduji00b/Desktop/Training & Development/ML/Chapter03/wisc_bc_data.csv", stringsAsFactors = FALSE)

# Have a glimpse of the dataset
glimpse(breast_cancer_biopsy_data)

## Observations: 569
## Variables: 32
## $ id                <int> 87139402, 8910251, 905520, 868871, 9012568, 90653...
## $ diagnosis         <chr> "B", "B", "B", "B", "B", "B", "B", "M", "B", "B",...
## $ radius_mean       <dbl> 12.32, 10.60, 11.04, 11.28, 15.19, 11.57, 11.51, ...
## $ texture_mean      <dbl> 12.39, 18.95, 16.83, 13.39, 13.21, 19.04, 23.93, ...
## $ perimeter_mean    <dbl> 78.85, 69.28, 70.92, 73.00, 97.65, 74.20, 74.52, ...
## $ area_mean         <dbl> 464.1, 346.4, 373.2, 384.8, 711.8, 409.7, 403.5, ...
## $ smoothness_mean   <dbl> 0.10280, 0.09688, 0.10770, 0.11640, 0.07963, 0.08...
## $ compactness_mean  <dbl> 0.06981, 0.11470, 0.07804, 0.11360, 0.06934, 0.07...
## $ concavity_mean    <dbl> 0.039870, 0.063870, 0.030460, 0.046350, 0.033930,...
## $ points_mean       <dbl> 0.037000, 0.026420, 0.024800, 0.047960, 0.026570,...
## $ symmetry_mean     <dbl> 0.1959, 0.1922, 0.1714, 0.1771, 0.1721, 0.2031, 0...
## $ dimension_mean    <dbl> 0.05955, 0.06491, 0.06340, 0.06072, 0.05544, 0.06...
## $ radius_se         <dbl> 0.2360, 0.4505, 0.1967, 0.3384, 0.1783, 0.2864, 0...
## $ texture_se        <dbl> 0.6656, 1.1970, 1.3870, 1.3430, 0.4125, 1.4400, 2...
## $ perimeter_se      <dbl> 1.670, 3.430, 1.342, 1.851, 1.338, 2.206, 1.936, ...
## $ area_se           <dbl> 17.43, 27.10, 13.54, 26.33, 17.72, 20.30, 16.97, ...
## $ smoothness_se     <dbl> 0.008045, 0.007470, 0.005158, 0.011270, 0.005012,...
## $ compactness_se    <dbl> 0.011800, 0.035810, 0.009355, 0.034980, 0.014850,...
## $ concavity_se      <dbl> 0.016830, 0.033540, 0.010560, 0.021870, 0.015510,...
## $ points_se         <dbl> 0.012410, 0.013650, 0.007483, 0.019650, 0.009155,...
## $ symmetry_se       <dbl> 0.01924, 0.03504, 0.01718, 0.01580, 0.01647, 0.01...
## $ dimension_se      <dbl> 0.002248, 0.003318, 0.002198, 0.003442, 0.001767,...
## $ radius_worst      <dbl> 13.50, 11.88, 12.41, 11.92, 16.20, 13.07, 12.48, ...
## $ texture_worst     <dbl> 15.64, 22.94, 26.44, 15.77, 15.73, 26.98, 37.16, ...
## $ perimeter_worst   <dbl> 86.97, 78.28, 79.93, 76.53, 104.50, 86.43, 82.28,...
## $ area_worst        <dbl> 549.1, 424.8, 471.4, 434.0, 819.1, 520.5, 474.2, ...
## $ smoothness_worst  <dbl> 0.1385, 0.1213, 0.1369, 0.1367, 0.1126, 0.1249, 0...
## $ compactness_worst <dbl> 0.12660, 0.25150, 0.14820, 0.18220, 0.17370, 0.19...
## $ concavity_worst   <dbl> 0.124200, 0.191600, 0.106700, 0.086690, 0.136200,...
## $ points_worst      <dbl> 0.09391, 0.07926, 0.07431, 0.08611, 0.08178, 0.06...
## $ symmetry_worst    <dbl> 0.2827, 0.2940, 0.2998, 0.2102, 0.2487, 0.3035, 0...
## $ dimension_worst   <dbl> 0.06771, 0.07587, 0.07881, 0.06784, 0.06766, 0.08...

Making sure the labelled column is factored right

The id column, could actually cause an overfitting of the model because as is, it’s a nominal, numeric variable of high magnitude. Also, it is not needed for the nearest neighbour modelling. So, let’s deselect it.

breast_cancer_biopsy_data_no_id <- select(breast_cancer_biopsy_data, -id)
dim(breast_cancer_biopsy_data_no_id)

## [1] 569  31

Remember, the feature that gives the labels to all the records in terms of if they are cancerous (malignant) or non-cancerous (benign)is the diagnosis variable.

Now, let’s make sure that the diagnosis feature is a factor type variable with it’s levels correctly labelled. This means that the different options, within this variable have an order/level or grouping to them. This is advised for better accuracy and ease of communication.

breast_cancer_biopsy_data_no_id <- breast_cancer_biopsy_data_no_id %>% mutate(diagnosis = factor(diagnosis, 
                                                    levels = c("B", "M"),
                                                    labels = c("Benign", "Malignant")))
glimpse(breast_cancer_biopsy_data_no_id$diagnosis)

##  Factor w/ 2 levels "Benign","Malignant": 1 1 1 1 1 1 1 2 1 1 ...

Here we see the total number of malignant and benign records in the data

table(breast_cancer_biopsy_data_no_id$diagnosis)

## 
##    Benign Malignant 
##       357       212

Let’s see the same as a proportion of the total

round(prop.table(table(breast_cancer_biopsy_data_no_id$diagnosis)) * 100, digits = 1)

## 
##    Benign Malignant 
##      62.7      37.3

STEP 4: Tidy and Prep Data

It’s worth noting the rest of the 30 varibales, after the first 2, are all numeric. But, how about the scale difference between them? let’s have a look at some of them:

breast_cancer_biopsy_data_no_id %>% summarise(avg_radius_mean = mean(radius_mean),
                                              avg_radius_se = mean(radius_se),
                                              avg_radius_worst = mean(radius_worst))

##   avg_radius_mean avg_radius_se avg_radius_worst
## 1        14.12729     0.4051721         16.26919

Now, the distance calculation for K-NN is very reliant on the measurement scale of inputs. For example,the impact of the radius_worst will have a disproportionately higher impact on distance compared to that of radius_se because of difference in measurement scale.

The solution to this is Normalization. This simply means rescaling to a common range of values.

To normalize, you use the following equation: x - min(x) / max(x) - min(x) across all relevant features.

# Create normalization function
normalization_function <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}

# Test the function on varying scales 
normalization_function(c(1, 2, 3, 4, 5))

## [1] 0.00 0.25 0.50 0.75 1.00

normalization_function(c(100, 200, 300, 400, 500))

## [1] 0.00 0.25 0.50 0.75 1.00

normalization_function(c(1000, 2000, 3000, 4000, 5000))

## [1] 0.00 0.25 0.50 0.75 1.00

So now, I remove the diagnois column and then normalize all features required fo the model. I use mutate_all to apply the normalization function across all the necessary variables.

# Normalize the features required to train the model
breast_cancer_biopsy_data_no_id_normalized <- breast_cancer_biopsy_data_no_id %>% 
  select(-diagnosis) %>%
    mutate_all(normalization_function)

# Test the normalization
summary(breast_cancer_biopsy_data_no_id_normalized$perimeter_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2168  0.2933  0.3329  0.4168  1.0000

summary(breast_cancer_biopsy_data_no_id_normalized$perimeter_se)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.04000 0.07209 0.09938 0.12251 1.00000

Create Training and Test Sets

Now that we’ve prepared the features, let’s seperate the dataset into training and testing portions. Why? to use the training portion to build the model and the testing portion to see how well the model does in terms of predictive accuracy.

#Now we create train and test datasets for the classification labels, in this case the *diagnosis* column
training_data_set <- breast_cancer_biopsy_data_no_id_normalized[1:450, ]
dim(training_data_set)

## [1] 450  30

testing_data_set <- breast_cancer_biopsy_data_no_id_normalized[451:569, ]
dim(testing_data_set)

## [1] 119  30

# Now create the training and test set labels
training_data_set_labels <- breast_cancer_biopsy_data_no_id[1:450, 1]
head(training_data_set_labels)

## [1] Benign Benign Benign Benign Benign Benign
## Levels: Benign Malignant

testing_data_set_labels <- breast_cancer_biopsy_data_no_id[451:569, 1]
head(testing_data_set_labels)

## [1] Benign    Malignant Malignant Benign    Benign    Benign   
## Levels: Benign Malignant

STEP 5: Train Model

We will use the knn() function from the class package. However, there are other k-nn packages and functions.

Since our training daaset has 450 examples, a rough idea for what k should be is the square root of the number of rows in our training dataset: 21

# Apply the knn algorithm function to learn and model the data
breast_cancer_biopsy_data_predictions <- knn(train = training_data_set,
                                      test = testing_data_set,
                                      cl = training_data_set_labels,
                                      k = 21)

# The function returns a vector of it's own classification predictions of the test dataset
glimpse(breast_cancer_biopsy_data_predictions)

##  Factor w/ 2 levels "Benign","Malignant": 1 2 2 1 1 1 1 2 1 1 ...

STEP 6: Evaluate Model

Now let’s evaluate the model performance compared to the actual test data results. We’ll do this by comparing the classifications of the model to the test data classfication labels.

# Use the table function to create a confusion matrix
confusion_matrix <- table(testing_data_set_labels, breast_cancer_biopsy_data_predictions)
confusion_matrix

##                        breast_cancer_biopsy_data_predictions
## testing_data_set_labels Benign Malignant
##               Benign        73         0
##               Malignant      2        44

Let’s explain how the above CONFUSION MATRIX works:

The Top LEFT value indicates True Negatives i.e. Real negative results.
The Bottom LEFT value indicates False Negatives i.e. Fake negative results (REALLY DANGEROUS).
The Top RIGHT value indicates False Positives i.e. Fake positive results (A BIT LESS DANGEROUS).
The Bottom RIGHT value indicates True Positives i.e. Real positive results.

# Accuracy function that divides the correct predictions by total number of predictions 
accuracy <- function(x){
  acc <- round(sum(diag(x)/(sum(rowSums(x)))) * 100, digits = 1)
  return(acc)
  }
 accuracy(confusion_matrix)

## [1] 98.3

So, our model, despite a 98.3% accuracy, made 2 FALSE POSITIVE predictions. This could be very dangerous (imagine telling two people they have malignant cancerous cells when they actually don’t!).

STEP 7: Boost Model Performance

Now let’s try and see if we can reduce the number of wrongly predicted labels my trying two alterations to our model

Using z-score transoformation to rescale rather than normalization
Try different values of nearest nieghbours (k)

Z-score transformation

The z-score, calculated as (x - Mean(x) / stdDev(x)), rescales records of Feature x by measuring how many standard deviations each is from x’s mean.Now let’s apply the z-score rescaling and re-apply the the K-nn algorithm

# Z-score rescale the features required to train the model
breast_cancer_biopsy_data_no_id_z_scored <- breast_cancer_biopsy_data_no_id %>% 
  select(-diagnosis) %>%
    mutate_all(scale) 

# Repeat data prep and model build with the z scored dataset
training_data_set_z_scored <- breast_cancer_biopsy_data_no_id_z_scored[1:450, ]
testing_data_set_z_scored <- breast_cancer_biopsy_data_no_id_z_scored[451:569, ]

# Labels dataset
training_data_set_labels_z_scored <- breast_cancer_biopsy_data_no_id[1:450, 1]
testing_data_set_labels_z_scored <- breast_cancer_biopsy_data_no_id[451:569, 1]

# Apply the knn algorithm function to learn and model the data
breast_cancer_biopsy_data_predictions_z_scored <- knn(train = training_data_set_z_scored,
                                      test = testing_data_set_z_scored,
                                      cl = training_data_set_labels_z_scored,
                                      k = 21)

# Use the table function to create a confusion matrix
confusion_matrix_z_scored <- table(testing_data_set_labels_z_scored, breast_cancer_biopsy_data_predictions_z_scored)
confusion_matrix_z_scored

##                                 breast_cancer_biopsy_data_predictions_z_scored
## testing_data_set_labels_z_scored Benign Malignant
##                        Benign        73         0
##                        Malignant      5        41

# Apply accuracy function created earlier
accuracy(confusion_matrix_z_scored)

## [1] 95.8

As you can see, our accuracy actually dropped from 98.3% to 95.8% and the False Positives increased from 2 to 5. So, not a better option in this instance.

Try diffferent values of k

# Function for trying different values of k
k <- c(5, 11, 16, 21, 26, 31, 36)

different_k_results <- function(k){
  
    breast_cancer_biopsy_data_predictions <- knn(train = training_data_set,
                                      test = testing_data_set,
                                      cl = training_data_set_labels,
                                      k = k)
  
confusion_matrix <- table(testing_data_set_labels, breast_cancer_biopsy_data_predictions)

acc <- round(sum(diag(confusion_matrix)/(sum(rowSums(confusion_matrix)))) * 100, digits = 1)
  return(acc)
  
}

# Accuracy results for different values of k
lapply(k, different_k_results)

## [[1]]
## [1] 98.3
## 
## [[2]]
## [1] 97.5
## 
## [[3]]
## [1] 97.5
## 
## [[4]]
## [1] 98.3
## 
## [[5]]
## [1] 97.5
## 
## [[6]]
## [1] 96.6
## 
## [[7]]
## [1] 96.6

The most accurate results where when ke was 5 and 21

Conclusion

In Closing

The difficulty wtth K-NN lies in explaining WHY. As you don’t really build a model, it’s hard to explain in more details, the reason as to why an accuracy % or prediction is what it is other than the adjusting k, distance method, or distance scale anomalies.

I do like the algorithm, it’s fun, intuitive, real world relatable, and relatively explainable. The challenges lie in:

Choosing and explaining the varying results for different vales of k.
Applying a dummy matrix to character variables.
Choosign the right feature scaling methodology.
Choosing the right distance measuring type (this exercise uses the Eucledian).

Till next time, B

Machine Learning using R | Series 1 > Classifiers

Episode 1 > k-NN

Bright Uduji

2021-02-28