Introduction

This example is from Book: Machine learning with R by Brett Lantz, chapter 3.

A link to the book https://bit.ly/3gsf2e0

This project is for educational purpose only.

The aim is to diagnose breast cancer using K-NN algorithm, the classification based on the dignosis column. the diagnosis of breast tissues for abnormal lumps, which has two values M and B, M is for malignant, and B is for benign. malignant is the bad one!

Required packages

Class and gmodels packages are required.

library(class)
library(gmodels)

Step 1 - collecting data

dataset source from the UCI Machine Learning Repository.

Step 2 - exploring and preparing the data

#reading data 
wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)

#check data structure
str(wbcd)
## 'data.frame':    569 obs. of  32 variables:
##  $ id               : int  87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
##  $ diagnosis        : chr  "B" "B" "B" "B" ...
##  $ radius_mean      : num  12.3 10.6 11 11.3 15.2 ...
##  $ texture_mean     : num  12.4 18.9 16.8 13.4 13.2 ...
##  $ perimeter_mean   : num  78.8 69.3 70.9 73 97.7 ...
##  $ area_mean        : num  464 346 373 385 712 ...
##  $ smoothness_mean  : num  0.1028 0.0969 0.1077 0.1164 0.0796 ...
##  $ compactness_mean : num  0.0698 0.1147 0.078 0.1136 0.0693 ...
##  $ concavity_mean   : num  0.0399 0.0639 0.0305 0.0464 0.0339 ...
##  $ points_mean      : num  0.037 0.0264 0.0248 0.048 0.0266 ...
##  $ symmetry_mean    : num  0.196 0.192 0.171 0.177 0.172 ...
##  $ dimension_mean   : num  0.0595 0.0649 0.0634 0.0607 0.0554 ...
##  $ radius_se        : num  0.236 0.451 0.197 0.338 0.178 ...
##  $ texture_se       : num  0.666 1.197 1.387 1.343 0.412 ...
##  $ perimeter_se     : num  1.67 3.43 1.34 1.85 1.34 ...
##  $ area_se          : num  17.4 27.1 13.5 26.3 17.7 ...
##  $ smoothness_se    : num  0.00805 0.00747 0.00516 0.01127 0.00501 ...
##  $ compactness_se   : num  0.0118 0.03581 0.00936 0.03498 0.01485 ...
##  $ concavity_se     : num  0.0168 0.0335 0.0106 0.0219 0.0155 ...
##  $ points_se        : num  0.01241 0.01365 0.00748 0.01965 0.00915 ...
##  $ symmetry_se      : num  0.0192 0.035 0.0172 0.0158 0.0165 ...
##  $ dimension_se     : num  0.00225 0.00332 0.0022 0.00344 0.00177 ...
##  $ radius_worst     : num  13.5 11.9 12.4 11.9 16.2 ...
##  $ texture_worst    : num  15.6 22.9 26.4 15.8 15.7 ...
##  $ perimeter_worst  : num  87 78.3 79.9 76.5 104.5 ...
##  $ area_worst       : num  549 425 471 434 819 ...
##  $ smoothness_worst : num  0.139 0.121 0.137 0.137 0.113 ...
##  $ compactness_worst: num  0.127 0.252 0.148 0.182 0.174 ...
##  $ concavity_worst  : num  0.1242 0.1916 0.1067 0.0867 0.1362 ...
##  $ points_worst     : num  0.0939 0.0793 0.0743 0.0861 0.0818 ...
##  $ symmetry_worst   : num  0.283 0.294 0.3 0.21 0.249 ...
##  $ dimension_worst  : num  0.0677 0.0759 0.0788 0.0678 0.0677 ...

first feature is unique ID which has no relation with the classification problem, we will remove it

#Removing the ID column from the dataset
wbcd <- wbcd[-1]

the second feature is diagnosis, it is the outcome of the classification/prediction process.

#check dignosis column
table(wbcd$diagnosis)
## 
##   B   M 
## 357 212

as the dignosis column is a charcter type, we will convert it to factor

#converting dignosis column from character to factor with levels and labels
wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"), labels =  c("Benign", "Malignant"))

#check probability table for dianosis column

round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
## 
##    Benign Malignant 
##      62.7      37.3

Now we want to explore other numerical features in the data set

#Exploring the summary for each column shows they have different ranges and spectrum, which require a trnasformation and normalization
summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])
##   radius_mean       area_mean      smoothness_mean  
##  Min.   : 6.981   Min.   : 143.5   Min.   :0.05263  
##  1st Qu.:11.700   1st Qu.: 420.3   1st Qu.:0.08637  
##  Median :13.370   Median : 551.1   Median :0.09587  
##  Mean   :14.127   Mean   : 654.9   Mean   :0.09636  
##  3rd Qu.:15.780   3rd Qu.: 782.7   3rd Qu.:0.10530  
##  Max.   :28.110   Max.   :2501.0   Max.   :0.16340

Transformation - normalizing numeric data

we will create a normalization function and apply it to all numerical columns

normalize <- function(x){
    return ((x-min(x)) / (max(x)-min(x)))
}

#now we need to apply normalization on the numerical columns and update the dataset, converting the output from list to a dataframe 

wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))

#check if the transformation is done as we expect

summary(wbcd_n[c("radius_mean", "area_mean", "smoothness_mean")])
##   radius_mean       area_mean      smoothness_mean 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2233   1st Qu.:0.1174   1st Qu.:0.3046  
##  Median :0.3024   Median :0.1729   Median :0.3904  
##  Mean   :0.3382   Mean   :0.2169   Mean   :0.3948  
##  3rd Qu.:0.4164   3rd Qu.:0.2711   3rd Qu.:0.4755  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

Data preparation - creating training and test dataset

we will split the data to training and test and exclude target variable

# we are keeping 100 rows for testing, we can use random sampling if we want, rows already shuffled in the dataset
wbcd_train <- wbcd_n[1:469,]
wbcd_test <- wbcd_n[470:569,]

#excluding target variable diagnosis

wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]

Step 3 - training a model on the data

we weill use knn() function to run the Nearest Neighbour classification algrithem.

#knn function paramters : training dataset, test dataset, cl : factor vector of the classes for each row in the training dataset, and k which is number of nearest neighbors
# we will use k = 21 as it is square root of number of rows in the training dataset sqrt(649)
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 21)

Step 4 - evaluating model performance

we will use crossTable function from gmodels package

# we are not intrested in Chi-square values
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        61 |         0 |        61 | 
##                  |     1.000 |     0.000 |     0.610 | 
##                  |     0.968 |     0.000 |           | 
##                  |     0.610 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         2 |        37 |        39 | 
##                  |     0.051 |     0.949 |     0.390 | 
##                  |     0.032 |     1.000 |           | 
##                  |     0.020 |     0.370 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        63 |        37 |       100 | 
##                  |     0.630 |     0.370 |           | 
## -----------------|-----------|-----------|-----------|
## 
## 
# this looks like confusion matrix, we can see true positive and true negative across the diagonal. but we see also two rows labeled as Benign but they are actually Malignant. this type of error called false negative.

Step 5 - Improving model performance

Transformation - z-score standardization

we will use scale() function directly

wbcd_z <- as.data.frame(scale(wbcd[-1]))

#check if the transformation is done as we expect

summary(wbcd_z[c("radius_mean", "area_mean", "smoothness_mean")])
##   radius_mean        area_mean       smoothness_mean   
##  Min.   :-2.0279   Min.   :-1.4532   Min.   :-3.10935  
##  1st Qu.:-0.6888   1st Qu.:-0.6666   1st Qu.:-0.71034  
##  Median :-0.2149   Median :-0.2949   Median :-0.03486  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.4690   3rd Qu.: 0.3632   3rd Qu.: 0.63564  
##  Max.   : 3.9678   Max.   : 5.2459   Max.   : 4.76672

now we will build the model again after this transformation

# we are keeping 100 rows for testing, we can use random sampling if we want, rows already shuffled in the dataset
wbcd_train <- wbcd_z[1:469,]
wbcd_test <- wbcd_z[470:569,]

#excluding target variable diagnosis

wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 21)

CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        61 |         0 |        61 | 
##                  |     1.000 |     0.000 |     0.610 | 
##                  |     0.924 |     0.000 |           | 
##                  |     0.610 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         5 |        34 |        39 | 
##                  |     0.128 |     0.872 |     0.390 | 
##                  |     0.076 |     1.000 |           | 
##                  |     0.050 |     0.340 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        66 |        34 |       100 | 
##                  |     0.660 |     0.340 |           | 
## -----------------|-----------|-----------|-----------|
## 
## 

we can see the false negative increased, which is worse than earlier transformation.

Conclusion

classification using K-NN is a good start in machine learning projects, but there is a room for improvement by using different algorithms or approaches to solve the same problem.