This example is from Book: Machine learning with R by Brett Lantz, chapter 3.
A link to the book https://bit.ly/3gsf2e0
This project is for educational purpose only.
The aim is to diagnose breast cancer using K-NN algorithm, the classification based on the dignosis column. the diagnosis of breast tissues for abnormal lumps, which has two values M and B, M is for malignant, and B is for benign. malignant is the bad one!
Class and gmodels packages are required.
library(class)
library(gmodels)
dataset source from the UCI Machine Learning Repository.
#reading data
wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
#check data structure
str(wbcd)
## 'data.frame': 569 obs. of 32 variables:
## $ id : int 87139402 8910251 905520 868871 9012568 906539 925291 87880 862989 89827 ...
## $ diagnosis : chr "B" "B" "B" "B" ...
## $ radius_mean : num 12.3 10.6 11 11.3 15.2 ...
## $ texture_mean : num 12.4 18.9 16.8 13.4 13.2 ...
## $ perimeter_mean : num 78.8 69.3 70.9 73 97.7 ...
## $ area_mean : num 464 346 373 385 712 ...
## $ smoothness_mean : num 0.1028 0.0969 0.1077 0.1164 0.0796 ...
## $ compactness_mean : num 0.0698 0.1147 0.078 0.1136 0.0693 ...
## $ concavity_mean : num 0.0399 0.0639 0.0305 0.0464 0.0339 ...
## $ points_mean : num 0.037 0.0264 0.0248 0.048 0.0266 ...
## $ symmetry_mean : num 0.196 0.192 0.171 0.177 0.172 ...
## $ dimension_mean : num 0.0595 0.0649 0.0634 0.0607 0.0554 ...
## $ radius_se : num 0.236 0.451 0.197 0.338 0.178 ...
## $ texture_se : num 0.666 1.197 1.387 1.343 0.412 ...
## $ perimeter_se : num 1.67 3.43 1.34 1.85 1.34 ...
## $ area_se : num 17.4 27.1 13.5 26.3 17.7 ...
## $ smoothness_se : num 0.00805 0.00747 0.00516 0.01127 0.00501 ...
## $ compactness_se : num 0.0118 0.03581 0.00936 0.03498 0.01485 ...
## $ concavity_se : num 0.0168 0.0335 0.0106 0.0219 0.0155 ...
## $ points_se : num 0.01241 0.01365 0.00748 0.01965 0.00915 ...
## $ symmetry_se : num 0.0192 0.035 0.0172 0.0158 0.0165 ...
## $ dimension_se : num 0.00225 0.00332 0.0022 0.00344 0.00177 ...
## $ radius_worst : num 13.5 11.9 12.4 11.9 16.2 ...
## $ texture_worst : num 15.6 22.9 26.4 15.8 15.7 ...
## $ perimeter_worst : num 87 78.3 79.9 76.5 104.5 ...
## $ area_worst : num 549 425 471 434 819 ...
## $ smoothness_worst : num 0.139 0.121 0.137 0.137 0.113 ...
## $ compactness_worst: num 0.127 0.252 0.148 0.182 0.174 ...
## $ concavity_worst : num 0.1242 0.1916 0.1067 0.0867 0.1362 ...
## $ points_worst : num 0.0939 0.0793 0.0743 0.0861 0.0818 ...
## $ symmetry_worst : num 0.283 0.294 0.3 0.21 0.249 ...
## $ dimension_worst : num 0.0677 0.0759 0.0788 0.0678 0.0677 ...
first feature is unique ID which has no relation with the classification problem, we will remove it
#Removing the ID column from the dataset
wbcd <- wbcd[-1]
the second feature is diagnosis, it is the outcome of the classification/prediction process.
#check dignosis column
table(wbcd$diagnosis)
##
## B M
## 357 212
as the dignosis column is a charcter type, we will convert it to factor
#converting dignosis column from character to factor with levels and labels
wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant"))
#check probability table for dianosis column
round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
##
## Benign Malignant
## 62.7 37.3
Now we want to explore other numerical features in the data set
#Exploring the summary for each column shows they have different ranges and spectrum, which require a trnasformation and normalization
summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. : 6.981 Min. : 143.5 Min. :0.05263
## 1st Qu.:11.700 1st Qu.: 420.3 1st Qu.:0.08637
## Median :13.370 Median : 551.1 Median :0.09587
## Mean :14.127 Mean : 654.9 Mean :0.09636
## 3rd Qu.:15.780 3rd Qu.: 782.7 3rd Qu.:0.10530
## Max. :28.110 Max. :2501.0 Max. :0.16340
we will create a normalization function and apply it to all numerical columns
normalize <- function(x){
return ((x-min(x)) / (max(x)-min(x)))
}
#now we need to apply normalization on the numerical columns and update the dataset, converting the output from list to a dataframe
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
#check if the transformation is done as we expect
summary(wbcd_n[c("radius_mean", "area_mean", "smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.2233 1st Qu.:0.1174 1st Qu.:0.3046
## Median :0.3024 Median :0.1729 Median :0.3904
## Mean :0.3382 Mean :0.2169 Mean :0.3948
## 3rd Qu.:0.4164 3rd Qu.:0.2711 3rd Qu.:0.4755
## Max. :1.0000 Max. :1.0000 Max. :1.0000
we will split the data to training and test and exclude target variable
# we are keeping 100 rows for testing, we can use random sampling if we want, rows already shuffled in the dataset
wbcd_train <- wbcd_n[1:469,]
wbcd_test <- wbcd_n[470:569,]
#excluding target variable diagnosis
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
we weill use knn() function to run the Nearest Neighbour classification algrithem.
#knn function paramters : training dataset, test dataset, cl : factor vector of the classes for each row in the training dataset, and k which is number of nearest neighbors
# we will use k = 21 as it is square root of number of rows in the training dataset sqrt(649)
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 21)
we will use crossTable function from gmodels package
# we are not intrested in Chi-square values
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.968 | 0.000 | |
## | 0.610 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 2 | 37 | 39 |
## | 0.051 | 0.949 | 0.390 |
## | 0.032 | 1.000 | |
## | 0.020 | 0.370 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 63 | 37 | 100 |
## | 0.630 | 0.370 | |
## -----------------|-----------|-----------|-----------|
##
##
# this looks like confusion matrix, we can see true positive and true negative across the diagonal. but we see also two rows labeled as Benign but they are actually Malignant. this type of error called false negative.
we will use scale() function directly
wbcd_z <- as.data.frame(scale(wbcd[-1]))
#check if the transformation is done as we expect
summary(wbcd_z[c("radius_mean", "area_mean", "smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. :-2.0279 Min. :-1.4532 Min. :-3.10935
## 1st Qu.:-0.6888 1st Qu.:-0.6666 1st Qu.:-0.71034
## Median :-0.2149 Median :-0.2949 Median :-0.03486
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.4690 3rd Qu.: 0.3632 3rd Qu.: 0.63564
## Max. : 3.9678 Max. : 5.2459 Max. : 4.76672
now we will build the model again after this transformation
# we are keeping 100 rows for testing, we can use random sampling if we want, rows already shuffled in the dataset
wbcd_train <- wbcd_z[1:469,]
wbcd_test <- wbcd_z[470:569,]
#excluding target variable diagnosis
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 21)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.924 | 0.000 | |
## | 0.610 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 5 | 34 | 39 |
## | 0.128 | 0.872 | 0.390 |
## | 0.076 | 1.000 | |
## | 0.050 | 0.340 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 66 | 34 | 100 |
## | 0.660 | 0.340 | |
## -----------------|-----------|-----------|-----------|
##
##
we can see the false negative increased, which is worse than earlier transformation.
classification using K-NN is a good start in machine learning projects, but there is a room for improvement by using different algorithms or approaches to solve the same problem.