Introduction

We will compare the performance of the following classification models on cancer dataset:

K-Nearest Neighbour
Naive Bayes
Random Forest
Logistic Regression

Using the above models we will predicit the type of cancer i.e. “Benign” or “Malignant”

First, lets load the required packages.

packageslist <- list("readr", "caret", "MASS", "klaR", "randomForest")

load_packages <- lapply(packageslist, require, character.only = T)

Now that we have loaded the required packages, Lets import the dataset.

# import dataset
cancer_classify <- read_csv("C:/Users/welcome/Downloads/wisc_bc_data-KNN.csv")

head(cancer_classify, n = 5)

# A tibble: 5 x 32
        id diagnosis radius_mean texture_mean perimeter_mean area_mean
     <int>     <chr>       <dbl>        <dbl>          <dbl>     <dbl>
1   842302         M       17.99        10.38         122.80    1001.0
2   842517         M       20.57        17.77         132.90    1326.0
3 84300903         M       19.69        21.25         130.00    1203.0
4 84348301         M       11.42        20.38          77.58     386.1
5 84358402         M       20.29        14.34         135.10    1297.0
# ... with 26 more variables: smoothness_mean <dbl>,
#   compactness_mean <dbl>, concavity_mean <dbl>, `concave
#   points_mean` <dbl>, symmetry_mean <dbl>, fractal_dimension_mean <dbl>,
#   radius_se <dbl>, texture_se <dbl>, perimeter_se <dbl>, area_se <dbl>,
#   smoothness_se <dbl>, compactness_se <dbl>, concavity_se <dbl>,
#   `concave points_se` <dbl>, symmetry_se <dbl>,
#   fractal_dimension_se <dbl>, radius_worst <dbl>, texture_worst <dbl>,
#   perimeter_worst <dbl>, area_worst <dbl>, smoothness_worst <dbl>,
#   compactness_worst <dbl>, concavity_worst <dbl>, `concave
#   points_worst` <dbl>, symmetry_worst <dbl>,
#   fractal_dimension_worst <dbl>

Before going any further, lets do some sanity check on data. (eg. missing values)

missingcols <- sapply(cancer_classify, function(x) {
    any(is.na(x))
})  # no missing values

sum(missingcols)

[1] 0

No missing values found in the dataset. Next, we will remove unwanted variable from the dataset, in this case, it’s patient’s ID.

cancer_classify <- cancer_classify[, -1]  # removed patient ID variable

Data splitting

Split dataset in to train, validation and test data

Now, will we split dataset into train set, validation set and test set.

We will use train set and validation for building models through cross validation method.

The model that performs better on the validation set will be choosen to fit on test set.

# Split dataset into training set and test set

splitSample <- sample(1:2, size = nrow(cancer_classify), prob = c(0.7, 0.3), 
    replace = T)

# training set
train_set <- cancer_classify[splitSample == 1, ]  # 397 observations

intrain <- sample(1:2, size = nrow(train_set), prob = c(0.7, 0.3), replace = T)

trainset <- train_set[intrain == 1, ]  # 277 observations

# validation set
validset <- train_set[intrain == 2, ]  # 120 observations

# test set
testset <- cancer_classify[splitSample == 2, ]  # 172 observations

First we split the dataset in two parts, a train set and test set. The train set is further split in train set and validation set for crossvalidation.

Cross validation

We will be using K- fold cross validation, with number of folds (k) set at 10.

tcontrol <- trainControl(method = "cv", number = 10)
set.seed(1234)

Train data

Using the training partition we fit a few different classification model types using 10-fold cross validation.

# KNN
modelKNN <- train(diagnosis ~ ., data = trainset, method = "knn", preProcess = c("center", 
    "scale"), trControl = tcontrol)  # data is normalised using Preprocess
# Naive Bayes
modelNB <- train(diagnosis ~ ., data = trainset, method = "nb", trControl = tcontrol)
# Random Forest
modelRF <- train(diagnosis ~ ., data = trainset, method = "rf", ntree = 100, 
    importance = T, trControl = tcontrol)
# Logisitic Regression
modelLG <- train(diagnosis ~ ., data = trainset, method = "glm", family = binomial, 
    trControl = tcontrol)

Predict on validation set

We will make use of the train models and make predicitions on validation set.

# KNN
pKNN <- predict(modelKNN, validset)
# Naive Bayes
pNB <- predict(modelNB, validset)
# Random Forest
pRF <- predict(modelRF, validset)
# Logistic Regression
pLG <- predict(modelLG, validset)

Confusion matrix

Predictions from the validation set can be compared to the actual outcomes to create a confusion matrix for each model.

# KNN
cmKNN <- confusionMatrix(validset$diagnosis, pKNN)
# Naive Bayes
cmNB <- confusionMatrix(validset$diagnosis, pNB)
# Random Forest
cmRF <- confusionMatrix(validset$diagnosis, pRF)
# Logisitic Regression
cmLG <- confusionMatrix(validset$diagnosis, pLG)

Lets put all of this together in a table.

ModelType <- c("K nearest neighbor", "Naive Bayes", "Random forest", "Logistic regression")  # vector containing names of models

# Training classification accuracy
TrainAccuracy <- c(max(modelKNN$results$Accuracy), max(modelNB$results$Accuracy), 
    max(modelRF$results$Accuracy), max(modelLG$results$Accuracy))

# Training misclassification error
Train_missclass_Error <- 1 - TrainAccuracy

# validation classification accuracy
ValidationAccuracy <- c(cmKNN$overall[1], cmNB$overall[1], cmRF$overall[1], 
    cmLG$overall[1])

# Validation misclassification error or out-of-sample-error
Validation_missclass_Error <- 1 - ValidationAccuracy

metrics <- data.frame(ModelType, TrainAccuracy, Train_missclass_Error, ValidationAccuracy, 
    Validation_missclass_Error)  # data frame with above metrics

knitr::kable(metrics, digits = 5)  # print table using kable() from knitr package

ModelType	TrainAccuracy	Train_missclass_Error	ValidationAccuracy	Validation_missclass_Error
K nearest neighbor	0.97090	0.02910	0.96581	0.03419
Naive Bayes	0.94550	0.05450	0.94872	0.05128
Random forest	0.96413	0.03587	0.94872	0.05128
Logistic regression	0.93581	0.06419	0.95726	0.04274

The model built using the random forest method has the lowest out of sample error and is chosen to make the final predictions.

KNN comes close second to Random forest.

Predicting Test Values

We now use the random forest model to predict the 172 testing values.

pTestingRF <- predict(modelRF, testset)
pTestingRF

 [1] M M M M B B M M M M B B M B B B B B M M M B B M M B B B M M B B B B B
[36] M B B B M B B B B B B B B B M B B M B B B B M M M B M M M B B B B M M
[71] B B M M M
 [ reached getOption("max.print") -- omitted 98 entries ]
Levels: B M

pTestingKNN <- predict(modelKNN, testset)
pTestingKNN

 [1] M M M M B B M M M M B B M B B B B B M M M B M M M B B B M B B B B B B
[36] M B B M B B B B B B B B B B M B B M B B B M M M M B B M M B B B B M M
[71] B B M M M
 [ reached getOption("max.print") -- omitted 98 entries ]
Levels: B M

Comparison of Classification Models