We will compare the performance of the following classification models on cancer dataset:
Using the above models we will predicit the type of cancer i.e. “Benign” or “Malignant”
First, lets load the required packages.
packageslist <- list("readr", "caret", "MASS", "klaR", "randomForest")
load_packages <- lapply(packageslist, require, character.only = T)Now that we have loaded the required packages, Lets import the dataset.
# import dataset
cancer_classify <- read_csv("C:/Users/welcome/Downloads/wisc_bc_data-KNN.csv")
head(cancer_classify, n = 5)# A tibble: 5 x 32
id diagnosis radius_mean texture_mean perimeter_mean area_mean
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 842302 M 17.99 10.38 122.80 1001.0
2 842517 M 20.57 17.77 132.90 1326.0
3 84300903 M 19.69 21.25 130.00 1203.0
4 84348301 M 11.42 20.38 77.58 386.1
5 84358402 M 20.29 14.34 135.10 1297.0
# ... with 26 more variables: smoothness_mean <dbl>,
# compactness_mean <dbl>, concavity_mean <dbl>, `concave
# points_mean` <dbl>, symmetry_mean <dbl>, fractal_dimension_mean <dbl>,
# radius_se <dbl>, texture_se <dbl>, perimeter_se <dbl>, area_se <dbl>,
# smoothness_se <dbl>, compactness_se <dbl>, concavity_se <dbl>,
# `concave points_se` <dbl>, symmetry_se <dbl>,
# fractal_dimension_se <dbl>, radius_worst <dbl>, texture_worst <dbl>,
# perimeter_worst <dbl>, area_worst <dbl>, smoothness_worst <dbl>,
# compactness_worst <dbl>, concavity_worst <dbl>, `concave
# points_worst` <dbl>, symmetry_worst <dbl>,
# fractal_dimension_worst <dbl>
Before going any further, lets do some sanity check on data. (eg. missing values)
missingcols <- sapply(cancer_classify, function(x) {
any(is.na(x))
}) # no missing values
sum(missingcols)[1] 0
No missing values found in the dataset. Next, we will remove unwanted variable from the dataset, in this case, it’s patient’s ID.
cancer_classify <- cancer_classify[, -1] # removed patient ID variableNow, will we split dataset into train set, validation set and test set.
We will use train set and validation for building models through cross validation method.
The model that performs better on the validation set will be choosen to fit on test set.
# Split dataset into training set and test set
splitSample <- sample(1:2, size = nrow(cancer_classify), prob = c(0.7, 0.3),
replace = T)
# training set
train_set <- cancer_classify[splitSample == 1, ] # 397 observations
intrain <- sample(1:2, size = nrow(train_set), prob = c(0.7, 0.3), replace = T)
trainset <- train_set[intrain == 1, ] # 277 observations
# validation set
validset <- train_set[intrain == 2, ] # 120 observations
# test set
testset <- cancer_classify[splitSample == 2, ] # 172 observationsFirst we split the dataset in two parts, a train set and test set. The train set is further split in train set and validation set for crossvalidation.
We will be using K- fold cross validation, with number of folds (k) set at 10.
tcontrol <- trainControl(method = "cv", number = 10)
set.seed(1234)Using the training partition we fit a few different classification model types using 10-fold cross validation.
# KNN
modelKNN <- train(diagnosis ~ ., data = trainset, method = "knn", preProcess = c("center",
"scale"), trControl = tcontrol) # data is normalised using Preprocess
# Naive Bayes
modelNB <- train(diagnosis ~ ., data = trainset, method = "nb", trControl = tcontrol)
# Random Forest
modelRF <- train(diagnosis ~ ., data = trainset, method = "rf", ntree = 100,
importance = T, trControl = tcontrol)
# Logisitic Regression
modelLG <- train(diagnosis ~ ., data = trainset, method = "glm", family = binomial,
trControl = tcontrol)We will make use of the train models and make predicitions on validation set.
# KNN
pKNN <- predict(modelKNN, validset)
# Naive Bayes
pNB <- predict(modelNB, validset)
# Random Forest
pRF <- predict(modelRF, validset)
# Logistic Regression
pLG <- predict(modelLG, validset)Predictions from the validation set can be compared to the actual outcomes to create a confusion matrix for each model.
# KNN
cmKNN <- confusionMatrix(validset$diagnosis, pKNN)
# Naive Bayes
cmNB <- confusionMatrix(validset$diagnosis, pNB)
# Random Forest
cmRF <- confusionMatrix(validset$diagnosis, pRF)
# Logisitic Regression
cmLG <- confusionMatrix(validset$diagnosis, pLG)Lets put all of this together in a table.
ModelType <- c("K nearest neighbor", "Naive Bayes", "Random forest", "Logistic regression") # vector containing names of models
# Training classification accuracy
TrainAccuracy <- c(max(modelKNN$results$Accuracy), max(modelNB$results$Accuracy),
max(modelRF$results$Accuracy), max(modelLG$results$Accuracy))
# Training misclassification error
Train_missclass_Error <- 1 - TrainAccuracy
# validation classification accuracy
ValidationAccuracy <- c(cmKNN$overall[1], cmNB$overall[1], cmRF$overall[1],
cmLG$overall[1])
# Validation misclassification error or out-of-sample-error
Validation_missclass_Error <- 1 - ValidationAccuracy
metrics <- data.frame(ModelType, TrainAccuracy, Train_missclass_Error, ValidationAccuracy,
Validation_missclass_Error) # data frame with above metrics
knitr::kable(metrics, digits = 5) # print table using kable() from knitr package| ModelType | TrainAccuracy | Train_missclass_Error | ValidationAccuracy | Validation_missclass_Error |
|---|---|---|---|---|
| K nearest neighbor | 0.97090 | 0.02910 | 0.96581 | 0.03419 |
| Naive Bayes | 0.94550 | 0.05450 | 0.94872 | 0.05128 |
| Random forest | 0.96413 | 0.03587 | 0.94872 | 0.05128 |
| Logistic regression | 0.93581 | 0.06419 | 0.95726 | 0.04274 |
The model built using the random forest method has the lowest out of sample error and is chosen to make the final predictions.
KNN comes close second to Random forest.
We now use the random forest model to predict the 172 testing values.
pTestingRF <- predict(modelRF, testset)
pTestingRF [1] M M M M B B M M M M B B M B B B B B M M M B B M M B B B M M B B B B B
[36] M B B B M B B B B B B B B B M B B M B B B B M M M B M M M B B B B M M
[71] B B M M M
[ reached getOption("max.print") -- omitted 98 entries ]
Levels: B M
pTestingKNN <- predict(modelKNN, testset)
pTestingKNN [1] M M M M B B M M M M B B M B B B B B M M M B M M M B B B M B B B B B B
[36] M B B M B B B B B B B B B B M B B M B B B M M M M B B M M B B B B M M
[71] B B M M M
[ reached getOption("max.print") -- omitted 98 entries ]
Levels: B M