This Vignette is about designing a Classification model for predicting whether the tumour is malignant or benign using the default dataset “BreastCancer”.

Classification problem

Classification problem refers to predicting the target class for new observations, that is, predicting the output from a given set of predicting variables. For example, If the dataset consists of images of birds, plants and animals, the classification problem here is to classify whether the image is a bird/plant or animal (target class). This kind of prediction can be done by designing a classification model which is also known as a classifier. The classifier is designed using case samples (training and test datasets) from the population dataset. There are many classification algorithms such as logistic regression, Naïve Bayes, random forest and Decision tree to design a classifier.

Breast Cancer Classification problem

The dataset consists of a sample of patients reported to Dr.Wolberg. The objective is to predict whether a new patient has a malignant tumour from a set of predicting variables.

First Steps

#Install all the below packages using function install.packages()
library(mlbench)   #Package which has dataset- BreastCancer
## Warning: package 'mlbench' was built under R version 3.4.4
library(caTools)   #Package has split function which is used to split our dataset into training and test data.
## Warning: package 'caTools' was built under R version 3.4.4
library(caret)     #Package has functions for training and plotting models
## Warning: package 'caret' was built under R version 3.4.4
## Loading required package: lattice
## Loading required package: ggplot2
library(mice)      #Package has function to remove NA value in dataset
## Warning: package 'mice' was built under R version 3.4.4
library(e1071)     #Package has function to implement naiveBayes classification algorithm
## Warning: package 'e1071' was built under R version 3.4.4
library(rpart)     #Package has function to implement tree algorithm
library(randomForest) #Package has function to implement Random Forest Algorithm
## Warning: package 'randomForest' was built under R version 3.4.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
data("BreastCancer") #Loads dataset

Exploring DataSet

The dataset consists of 699 obervations and the target class is BreastCancer$Class which specifies whether the observation has malignant tumor.

Detailed description of the dataset can be found in this link https://cran.r-project.org/web/packages/mlbench/mlbench.pdf.

str(BreastCancer)    #Getting the structure of Dataset
## 'data.frame':    699 obs. of  11 variables:
##  $ Id             : chr  "1000025" "1002945" "1015425" "1016277" ...
##  $ Cl.thickness   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
##  $ Cell.size      : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
##  $ Cell.shape     : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
##  $ Marg.adhesion  : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
##  $ Epith.c.size   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare.nuclei    : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
##  $ Bl.cromatin    : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
##  $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses        : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
##  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
levels(BreastCancer$Class) #Finding the levels of target class
## [1] "benign"    "malignant"

More detailed summary of the dataset is obtained by using summary() function

summary(BreastCancer) #Summary of Dataset
##       Id             Cl.thickness   Cell.size     Cell.shape 
##  Length:699         1      :145   1      :384   1      :353  
##  Class :character   5      :130   10     : 67   2      : 59  
##  Mode  :character   3      :108   3      : 52   10     : 58  
##                     4      : 80   2      : 45   3      : 56  
##                     10     : 69   4      : 40   4      : 44  
##                     2      : 50   5      : 30   5      : 34  
##                     (Other):117   (Other): 81   (Other): 95  
##  Marg.adhesion  Epith.c.size  Bare.nuclei   Bl.cromatin  Normal.nucleoli
##  1      :407   2      :386   1      :402   2      :166   1      :443    
##  2      : 58   3      : 72   10     :132   3      :165   10     : 61    
##  3      : 58   4      : 48   2      : 30   1      :152   3      : 44    
##  10     : 55   1      : 47   5      : 30   7      : 73   2      : 36    
##  4      : 33   6      : 41   3      : 28   4      : 40   8      : 24    
##  8      : 25   5      : 39   (Other): 61   5      : 34   6      : 22    
##  (Other): 63   (Other): 66   NA's   : 16   (Other): 69   (Other): 69    
##     Mitoses          Class    
##  1      :579   benign   :458  
##  2      : 35   malignant:241  
##  3      : 33                  
##  10     : 14                  
##  4      : 12                  
##  7      :  9                  
##  (Other): 17

We can find 16 NA (missing values) in our dataset.

Data Cleaning

Missing values is a common problem faced by a dataset. There are several ways to overcome missing values such as omitting the observations, replacing the missing values with mean/mode of the variable. More techniques are mentioned in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/. Library mice is used to overcome the 16 NA by imputing the missing values with the most suited values considering all nine predicting columns in the dataset.

The Id column is filtered out as it is not needed for designing the classifier.

dataset_impute <- mice(BreastCancer[,2:10],  print = FALSE) #Removing NA values and ID(1st column) from dataset using library mice
BreastCancer <- cbind(BreastCancer[,11, drop = FALSE], mice::complete(dataset_impute, 1)) #Adding Target class to the imputed dataset without NA

Checking summary of the modified dataset

summary(BreastCancer) #Summary of Dataset
##        Class      Cl.thickness   Cell.size     Cell.shape  Marg.adhesion
##  benign   :458   1      :145   1      :384   1      :353   1      :407  
##  malignant:241   5      :130   10     : 67   2      : 59   2      : 58  
##                  3      :108   3      : 52   10     : 58   3      : 58  
##                  4      : 80   2      : 45   3      : 56   10     : 55  
##                  10     : 69   4      : 40   4      : 44   4      : 33  
##                  2      : 50   5      : 30   5      : 34   8      : 25  
##                  (Other):117   (Other): 81   (Other): 95   (Other): 63  
##   Epith.c.size  Bare.nuclei   Bl.cromatin  Normal.nucleoli    Mitoses   
##  2      :386   1      :412   2      :166   1      :443     1      :579  
##  3      : 72   10     :133   3      :165   10     : 61     2      : 35  
##  4      : 48   2      : 31   1      :152   3      : 44     3      : 33  
##  1      : 47   5      : 30   7      : 73   2      : 36     10     : 14  
##  6      : 41   3      : 29   4      : 40   8      : 24     4      : 12  
##  5      : 39   8      : 21   5      : 34   6      : 22     7      :  9  
##  (Other): 66   (Other): 43   (Other): 69   (Other): 69     (Other): 17

Splitting Dataset into training, test and to predict

set.seed(150)    
split=sample.split(BreastCancer, SplitRatio = 0.7)  # Splitting data into training and test dataset
training_set=subset(BreastCancer,split==TRUE)       # Training dataset
test_set=subset(BreastCancer,split==FALSE)          # Test dataset
dim(training_set)                                   # Dimenstions of training dataset
## [1] 490  10
dim(test_set)                                       # Dimesnions of test dataset
## [1] 209  10
topredict_set<-test_set[2:10]                       # Removing target class 
dim(topredict_set)
## [1] 209   9

As all datasets are prepared the next step is to design the classification model using different algorithms and comparing the accuracy of the model. The different classification algorithms are explained in https://www.dezyre.com/article/top-10-machine-learning-algorithms/202. NaiveBayes, RandomForest and Decision tree algorithms are used here.

Naive Bayes Classifier

model_naive<- naiveBayes(Class ~ ., data = training_set)  #Implementing NaiveBayes 

preds_naive <- predict(model_naive, newdata = topredict_set)        #Predicting target class for the Validation set

(conf_matrix_naive <- table(preds_naive, test_set$Class))       
##            
## preds_naive benign malignant
##   benign       129         2
##   malignant      6        72

The confusion matrix shows that Naive Bayes classifier predicted 129 benign cases correctly and two wrong predictions. Similarly, the model predicted 72 malignant cases correctly and 6 wrong predictions.

confusionMatrix(conf_matrix_naive)                        #Confusion matrix for finding Accuracy of the model
## Confusion Matrix and Statistics
## 
##            
## preds_naive benign malignant
##   benign       129         2
##   malignant      6        72
##                                          
##                Accuracy : 0.9617         
##                  95% CI : (0.926, 0.9833)
##     No Information Rate : 0.6459         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9173         
##  Mcnemar's Test P-Value : 0.2888         
##                                          
##             Sensitivity : 0.9556         
##             Specificity : 0.9730         
##          Pos Pred Value : 0.9847         
##          Neg Pred Value : 0.9231         
##              Prevalence : 0.6459         
##          Detection Rate : 0.6172         
##    Detection Prevalence : 0.6268         
##       Balanced Accuracy : 0.9643         
##                                          
##        'Positive' Class : benign         
## 

The accuracy of Naive Bayes Classifier is 96.17%

Random Forest Classifier

model_rf <- randomForest(Class ~ ., data = training_set, importance=TRUE, ntree = 5) # Implementing RandomForest

preds_rf <- predict(model_rf, topredict_set)              

(conf_matrix_forest <- table(preds_rf, test_set$Class))
##            
## preds_rf    benign malignant
##   benign       129         3
##   malignant      6        71

The confusion matrix shows that Naive Bayes classifier predicted 129 benign cases correctly and 3 wrong predictions. Similarly, the model predicted 71 malignant cases correctly and 6 wrong predictions.

confusionMatrix(conf_matrix_forest)                       #Confusion matrix for finding Accuracy of the model
## Confusion Matrix and Statistics
## 
##            
## preds_rf    benign malignant
##   benign       129         3
##   malignant      6        71
##                                           
##                Accuracy : 0.9569          
##                  95% CI : (0.9198, 0.9801)
##     No Information Rate : 0.6459          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9067          
##  Mcnemar's Test P-Value : 0.505           
##                                           
##             Sensitivity : 0.9556          
##             Specificity : 0.9595          
##          Pos Pred Value : 0.9773          
##          Neg Pred Value : 0.9221          
##              Prevalence : 0.6459          
##          Detection Rate : 0.6172          
##    Detection Prevalence : 0.6316          
##       Balanced Accuracy : 0.9575          
##                                           
##        'Positive' Class : benign          
## 

The accuracy of Random Forest Classifier is 95.69

DecisionTree Classifier

model_dtree<- rpart(Class ~ ., data=training_set)       #Implementing Decision Tree
preds_dtree <- predict(model_dtree,newdata=topredict_set, type = "class")
## Warning: contrasts dropped from factor Cl.thickness
## Warning: contrasts dropped from factor Cell.size
## Warning: contrasts dropped from factor Cell.shape
## Warning: contrasts dropped from factor Marg.adhesion
## Warning: contrasts dropped from factor Epith.c.size
## Warning: contrasts dropped from factor Bl.cromatin
## Warning: contrasts dropped from factor Normal.nucleoli
## Warning: contrasts dropped from factor Mitoses
#plot(preds_dtree, main="Decision tree created using rpart")
(conf_matrix_dtree <- table(preds_dtree, test_set$Class))
##            
## preds_dtree benign malignant
##   benign       127         5
##   malignant      8        69

The confusion matrix shows that Naive Bayes classifier predicted 127 benign cases correctly and 5 wrong predictions. Similarly, the model predicted 69 malignant cases correctly and 8 wrong predictions.

confusionMatrix(conf_matrix_dtree)                     #Confusion matrix for finding Accuracy of the model
## Confusion Matrix and Statistics
## 
##            
## preds_dtree benign malignant
##   benign       127         5
##   malignant      8        69
##                                          
##                Accuracy : 0.9378         
##                  95% CI : (0.896, 0.9665)
##     No Information Rate : 0.6459         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8652         
##  Mcnemar's Test P-Value : 0.5791         
##                                          
##             Sensitivity : 0.9407         
##             Specificity : 0.9324         
##          Pos Pred Value : 0.9621         
##          Neg Pred Value : 0.8961         
##              Prevalence : 0.6459         
##          Detection Rate : 0.6077         
##    Detection Prevalence : 0.6316         
##       Balanced Accuracy : 0.9366         
##                                          
##        'Positive' Class : benign         
## 

The accuracy of Decision Tree Classifier is 93.78

Therefore comparing the Accuracy of models protrays that Naive Bayes Classifier Algorithm is a better Classifier.

References

https://docs.oracle.com/cd/B28359_01/datamine.111/b28129/classify.htm#DMCON034

https://shiring.github.io/machine_learning/2017/01/15/rfe_ga_post

Borges, Lucas Rodrigues, 2015.[Online] Available at https://www.researchgate.net/publication/311950799_Analysis_of_the_Wisconsin_Breast_Cancer_Dataset_and_Machine_Learning_for_Breast_Cancer_Detection

https://cran.r-project.org/web/packages/mlbench/mlbench.pdf