This Vignette is about designing a Classification model for predicting whether the tumour is malignant or benign using the default dataset “BreastCancer”.
Classification problem refers to predicting the target class for new observations, that is, predicting the output from a given set of predicting variables. For example, If the dataset consists of images of birds, plants and animals, the classification problem here is to classify whether the image is a bird/plant or animal (target class). This kind of prediction can be done by designing a classification model which is also known as a classifier. The classifier is designed using case samples (training and test datasets) from the population dataset. There are many classification algorithms such as logistic regression, Naïve Bayes, random forest and Decision tree to design a classifier.
The dataset consists of a sample of patients reported to Dr.Wolberg. The objective is to predict whether a new patient has a malignant tumour from a set of predicting variables.
#Install all the below packages using function install.packages()
library(mlbench) #Package which has dataset- BreastCancer
## Warning: package 'mlbench' was built under R version 3.4.4
library(caTools) #Package has split function which is used to split our dataset into training and test data.
## Warning: package 'caTools' was built under R version 3.4.4
library(caret) #Package has functions for training and plotting models
## Warning: package 'caret' was built under R version 3.4.4
## Loading required package: lattice
## Loading required package: ggplot2
library(mice) #Package has function to remove NA value in dataset
## Warning: package 'mice' was built under R version 3.4.4
library(e1071) #Package has function to implement naiveBayes classification algorithm
## Warning: package 'e1071' was built under R version 3.4.4
library(rpart) #Package has function to implement tree algorithm
library(randomForest) #Package has function to implement Random Forest Algorithm
## Warning: package 'randomForest' was built under R version 3.4.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
data("BreastCancer") #Loads dataset
The dataset consists of 699 obervations and the target class is BreastCancer$Class which specifies whether the observation has malignant tumor.
Detailed description of the dataset can be found in this link https://cran.r-project.org/web/packages/mlbench/mlbench.pdf.
str(BreastCancer) #Getting the structure of Dataset
## 'data.frame': 699 obs. of 11 variables:
## $ Id : chr "1000025" "1002945" "1015425" "1016277" ...
## $ Cl.thickness : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.shape : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
## $ Marg.adhesion : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
## $ Epith.c.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.nuclei : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
## $ Bl.cromatin : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
levels(BreastCancer$Class) #Finding the levels of target class
## [1] "benign" "malignant"
More detailed summary of the dataset is obtained by using summary() function
summary(BreastCancer) #Summary of Dataset
## Id Cl.thickness Cell.size Cell.shape
## Length:699 1 :145 1 :384 1 :353
## Class :character 5 :130 10 : 67 2 : 59
## Mode :character 3 :108 3 : 52 10 : 58
## 4 : 80 2 : 45 3 : 56
## 10 : 69 4 : 40 4 : 44
## 2 : 50 5 : 30 5 : 34
## (Other):117 (Other): 81 (Other): 95
## Marg.adhesion Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli
## 1 :407 2 :386 1 :402 2 :166 1 :443
## 2 : 58 3 : 72 10 :132 3 :165 10 : 61
## 3 : 58 4 : 48 2 : 30 1 :152 3 : 44
## 10 : 55 1 : 47 5 : 30 7 : 73 2 : 36
## 4 : 33 6 : 41 3 : 28 4 : 40 8 : 24
## 8 : 25 5 : 39 (Other): 61 5 : 34 6 : 22
## (Other): 63 (Other): 66 NA's : 16 (Other): 69 (Other): 69
## Mitoses Class
## 1 :579 benign :458
## 2 : 35 malignant:241
## 3 : 33
## 10 : 14
## 4 : 12
## 7 : 9
## (Other): 17
We can find 16 NA (missing values) in our dataset.
Missing values is a common problem faced by a dataset. There are several ways to overcome missing values such as omitting the observations, replacing the missing values with mean/mode of the variable. More techniques are mentioned in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/. Library mice is used to overcome the 16 NA by imputing the missing values with the most suited values considering all nine predicting columns in the dataset.
The Id column is filtered out as it is not needed for designing the classifier.
dataset_impute <- mice(BreastCancer[,2:10], print = FALSE) #Removing NA values and ID(1st column) from dataset using library mice
BreastCancer <- cbind(BreastCancer[,11, drop = FALSE], mice::complete(dataset_impute, 1)) #Adding Target class to the imputed dataset without NA
Checking summary of the modified dataset
summary(BreastCancer) #Summary of Dataset
## Class Cl.thickness Cell.size Cell.shape Marg.adhesion
## benign :458 1 :145 1 :384 1 :353 1 :407
## malignant:241 5 :130 10 : 67 2 : 59 2 : 58
## 3 :108 3 : 52 10 : 58 3 : 58
## 4 : 80 2 : 45 3 : 56 10 : 55
## 10 : 69 4 : 40 4 : 44 4 : 33
## 2 : 50 5 : 30 5 : 34 8 : 25
## (Other):117 (Other): 81 (Other): 95 (Other): 63
## Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses
## 2 :386 1 :412 2 :166 1 :443 1 :579
## 3 : 72 10 :133 3 :165 10 : 61 2 : 35
## 4 : 48 2 : 31 1 :152 3 : 44 3 : 33
## 1 : 47 5 : 30 7 : 73 2 : 36 10 : 14
## 6 : 41 3 : 29 4 : 40 8 : 24 4 : 12
## 5 : 39 8 : 21 5 : 34 6 : 22 7 : 9
## (Other): 66 (Other): 43 (Other): 69 (Other): 69 (Other): 17
set.seed(150)
split=sample.split(BreastCancer, SplitRatio = 0.7) # Splitting data into training and test dataset
training_set=subset(BreastCancer,split==TRUE) # Training dataset
test_set=subset(BreastCancer,split==FALSE) # Test dataset
dim(training_set) # Dimenstions of training dataset
## [1] 490 10
dim(test_set) # Dimesnions of test dataset
## [1] 209 10
topredict_set<-test_set[2:10] # Removing target class
dim(topredict_set)
## [1] 209 9
As all datasets are prepared the next step is to design the classification model using different algorithms and comparing the accuracy of the model. The different classification algorithms are explained in https://www.dezyre.com/article/top-10-machine-learning-algorithms/202. NaiveBayes, RandomForest and Decision tree algorithms are used here.
model_naive<- naiveBayes(Class ~ ., data = training_set) #Implementing NaiveBayes
preds_naive <- predict(model_naive, newdata = topredict_set) #Predicting target class for the Validation set
(conf_matrix_naive <- table(preds_naive, test_set$Class))
##
## preds_naive benign malignant
## benign 129 2
## malignant 6 72
The confusion matrix shows that Naive Bayes classifier predicted 129 benign cases correctly and two wrong predictions. Similarly, the model predicted 72 malignant cases correctly and 6 wrong predictions.
confusionMatrix(conf_matrix_naive) #Confusion matrix for finding Accuracy of the model
## Confusion Matrix and Statistics
##
##
## preds_naive benign malignant
## benign 129 2
## malignant 6 72
##
## Accuracy : 0.9617
## 95% CI : (0.926, 0.9833)
## No Information Rate : 0.6459
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9173
## Mcnemar's Test P-Value : 0.2888
##
## Sensitivity : 0.9556
## Specificity : 0.9730
## Pos Pred Value : 0.9847
## Neg Pred Value : 0.9231
## Prevalence : 0.6459
## Detection Rate : 0.6172
## Detection Prevalence : 0.6268
## Balanced Accuracy : 0.9643
##
## 'Positive' Class : benign
##
The accuracy of Naive Bayes Classifier is 96.17%
model_rf <- randomForest(Class ~ ., data = training_set, importance=TRUE, ntree = 5) # Implementing RandomForest
preds_rf <- predict(model_rf, topredict_set)
(conf_matrix_forest <- table(preds_rf, test_set$Class))
##
## preds_rf benign malignant
## benign 129 3
## malignant 6 71
The confusion matrix shows that Naive Bayes classifier predicted 129 benign cases correctly and 3 wrong predictions. Similarly, the model predicted 71 malignant cases correctly and 6 wrong predictions.
confusionMatrix(conf_matrix_forest) #Confusion matrix for finding Accuracy of the model
## Confusion Matrix and Statistics
##
##
## preds_rf benign malignant
## benign 129 3
## malignant 6 71
##
## Accuracy : 0.9569
## 95% CI : (0.9198, 0.9801)
## No Information Rate : 0.6459
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9067
## Mcnemar's Test P-Value : 0.505
##
## Sensitivity : 0.9556
## Specificity : 0.9595
## Pos Pred Value : 0.9773
## Neg Pred Value : 0.9221
## Prevalence : 0.6459
## Detection Rate : 0.6172
## Detection Prevalence : 0.6316
## Balanced Accuracy : 0.9575
##
## 'Positive' Class : benign
##
The accuracy of Random Forest Classifier is 95.69
model_dtree<- rpart(Class ~ ., data=training_set) #Implementing Decision Tree
preds_dtree <- predict(model_dtree,newdata=topredict_set, type = "class")
## Warning: contrasts dropped from factor Cl.thickness
## Warning: contrasts dropped from factor Cell.size
## Warning: contrasts dropped from factor Cell.shape
## Warning: contrasts dropped from factor Marg.adhesion
## Warning: contrasts dropped from factor Epith.c.size
## Warning: contrasts dropped from factor Bl.cromatin
## Warning: contrasts dropped from factor Normal.nucleoli
## Warning: contrasts dropped from factor Mitoses
#plot(preds_dtree, main="Decision tree created using rpart")
(conf_matrix_dtree <- table(preds_dtree, test_set$Class))
##
## preds_dtree benign malignant
## benign 127 5
## malignant 8 69
The confusion matrix shows that Naive Bayes classifier predicted 127 benign cases correctly and 5 wrong predictions. Similarly, the model predicted 69 malignant cases correctly and 8 wrong predictions.
confusionMatrix(conf_matrix_dtree) #Confusion matrix for finding Accuracy of the model
## Confusion Matrix and Statistics
##
##
## preds_dtree benign malignant
## benign 127 5
## malignant 8 69
##
## Accuracy : 0.9378
## 95% CI : (0.896, 0.9665)
## No Information Rate : 0.6459
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8652
## Mcnemar's Test P-Value : 0.5791
##
## Sensitivity : 0.9407
## Specificity : 0.9324
## Pos Pred Value : 0.9621
## Neg Pred Value : 0.8961
## Prevalence : 0.6459
## Detection Rate : 0.6077
## Detection Prevalence : 0.6316
## Balanced Accuracy : 0.9366
##
## 'Positive' Class : benign
##
The accuracy of Decision Tree Classifier is 93.78
Therefore comparing the Accuracy of models protrays that Naive Bayes Classifier Algorithm is a better Classifier.
https://docs.oracle.com/cd/B28359_01/datamine.111/b28129/classify.htm#DMCON034
https://shiring.github.io/machine_learning/2017/01/15/rfe_ga_post
Borges, Lucas Rodrigues, 2015.[Online] Available at https://www.researchgate.net/publication/311950799_Analysis_of_the_Wisconsin_Breast_Cancer_Dataset_and_Machine_Learning_for_Breast_Cancer_Detection