Classification

About Dataset

Before we start with the analysis it is important to understand what exactly we are trying to predict and what the information provided, our variables of the dataset, mean. “Benign” refers to a type of medical condition or growth that is not cancerous or dangerous as opposed to “malignant”.

Libraries and Setup

We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.

library(dplyr)
library(inspectdf)

library(e1071)
library(caret)
library(partykit)
library(ROCR)
library(randomForest)

written library is very useful for the results of the analysis

Dataset

data = read.csv("E:/Algoritma/6_lbb_classification/breast cancer.csv")
head(data)
  • Clump thickness is a measure of how thick the cells are within a tumor. Benign cells tend to be grouped in mono-layers, while cancerous - in multi-layer.(Sarkar et al. 2017, p. 1)

  • Uniformity of cell size and uniformity of cell shape are two characteristics that can be used to describe the appearance of cells under a microscope. Here we are checking the degree to which the cells in a sample are similar in size and shape.

  • Marginal adhesion is the degree to which cells in a tissue sample adhere, or stick, to one another at the edges of the sample. Loss of adhesion might be a sign of malignancy.

  • Single epithelial cell size is the size of individual cells in an epithelial tissue sample. Epithelial tissue is a type of tissue that covers the surface of the body and lines internal organs and structures. It is made up of cells that are tightly packed together and held in place by specialized junctions.

  • Bare nuclei refers to cells in a tissue sample that are missing their cell membranes and cytoplasm, leaving only the nucleus visible.

  • Bland chromatin is the appearance of the genetic material (chromatin) in the nucleus of a cell under a microscope. Chromatin is made up of DNA and proteins, and it contains the genetic information that controls the cell’s functions. When the chromatin in a cell’s nucleus is compact and uniform in appearance, it is said to be “bland.”

  • Normal nucleoli are small, spherical structures found within the nucleus of a cell. They are composed of DNA, RNA, and proteins and are responsible for synthesizing ribosomes, which are the cellular structures that produce proteins. Nucleoli are usually visible under a microscope and can vary in size and appearance depending on the stage of the cell cycle and the cell’s function. In normal, healthy cells, nucleoli are usually small and have a distinct, well-defined border.

  • Mitosis is the process of cell division that occurs in all living organisms. During mitosis, a single cell divides into two daughter cells, each of which contains a copy of the parent cell’s DNA. The process of mitosis is essential for the growth and repair of tissues and the production of new cells.

  • Class These two values refer to ‘malignant’ = 1 or ‘benign’ = 0.

Data Preprocesing

Data Wrangling

glimpse(data)
## Rows: 683
## Columns: 10
## $ Clump.Thickness             <int> 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, …
## $ Uniformity.of.Cell.Size     <int> 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1,…
## $ Uniformity.of.Cell.Shape    <int> 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1,…
## $ Marginal.Adhesion           <int> 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, …
## $ Single.Epithelial.Cell.Size <int> 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, …
## $ Bare.Nuclei                 <int> 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, 1, 3, …
## $ Bland.Chromatin             <int> 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, …
## $ Normal.Nucleoli             <int> 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, …
## $ Mitoses                     <int> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, …
## $ Class                       <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, …

of the data that belongs to all of the integer type, therefore it is necessary to change the appropriate data type

data = data %>%
  mutate(
    Class = as.factor(Class)
  )
glimpse(data)
## Rows: 683
## Columns: 10
## $ Clump.Thickness             <int> 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, …
## $ Uniformity.of.Cell.Size     <int> 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1,…
## $ Uniformity.of.Cell.Shape    <int> 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1,…
## $ Marginal.Adhesion           <int> 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, …
## $ Single.Epithelial.Cell.Size <int> 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, …
## $ Bare.Nuclei                 <int> 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, 1, 3, …
## $ Bland.Chromatin             <int> 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, …
## $ Normal.Nucleoli             <int> 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, …
## $ Mitoses                     <int> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, …
## $ Class                       <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, …

because all the predictor variables are the result of the measurement, only the target variable with a categorical type is changed

Statistics Descriptive

summary(data)
##  Clump.Thickness  Uniformity.of.Cell.Size Uniformity.of.Cell.Shape
##  Min.   : 1.000   Min.   : 1.000          Min.   : 1.000          
##  1st Qu.: 2.000   1st Qu.: 1.000          1st Qu.: 1.000          
##  Median : 4.000   Median : 1.000          Median : 1.000          
##  Mean   : 4.442   Mean   : 3.151          Mean   : 3.215          
##  3rd Qu.: 6.000   3rd Qu.: 5.000          3rd Qu.: 5.000          
##  Max.   :10.000   Max.   :10.000          Max.   :10.000          
##  Marginal.Adhesion Single.Epithelial.Cell.Size  Bare.Nuclei    
##  Min.   : 1.00     Min.   : 1.000              Min.   : 1.000  
##  1st Qu.: 1.00     1st Qu.: 2.000              1st Qu.: 1.000  
##  Median : 1.00     Median : 2.000              Median : 1.000  
##  Mean   : 2.83     Mean   : 3.234              Mean   : 3.545  
##  3rd Qu.: 4.00     3rd Qu.: 4.000              3rd Qu.: 6.000  
##  Max.   :10.00     Max.   :10.000              Max.   :10.000  
##  Bland.Chromatin  Normal.Nucleoli    Mitoses       Class  
##  Min.   : 1.000   Min.   : 1.00   Min.   : 1.000   0:444  
##  1st Qu.: 2.000   1st Qu.: 1.00   1st Qu.: 1.000   1:239  
##  Median : 3.000   Median : 1.00   Median : 1.000          
##  Mean   : 3.445   Mean   : 2.87   Mean   : 1.603          
##  3rd Qu.: 5.000   3rd Qu.: 4.00   3rd Qu.: 1.000          
##  Max.   :10.000   Max.   :10.00   Max.   :10.000

Can be seen the descriptive statistics of each variable with benign on the target variable as many as 444 and malignant as many as 239

Missing Value

colSums(is.na(data))
##             Clump.Thickness     Uniformity.of.Cell.Size 
##                           0                           0 
##    Uniformity.of.Cell.Shape           Marginal.Adhesion 
##                           0                           0 
## Single.Epithelial.Cell.Size                 Bare.Nuclei 
##                           0                           0 
##             Bland.Chromatin             Normal.Nucleoli 
##                           0                           0 
##                     Mitoses                       Class 
##                           0                           0

the dataset that we have does not have a missing value so that further analysis can be carried out and no missing value handling is required

Dimention and Proportion of Dataset

dim(data)
## [1] 683  10
prop.table(table(data$Class))
## 
##         0         1 
## 0.6500732 0.3499268

the data used is 683 rows with 10 variables

the proportion between malignant (1) and benign (0), the result obtained is that the proportion is not balanced between malignant and benign.

Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(17)

index = sample(x = nrow(data), size = nrow(data)*0.8)
data_train <- data[index,]
data_test <- data[-index,]
nrow(data_train)
## [1] 546
nrow(data_test)
## [1] 137

The distribution of training and testing data is 80:20. and obtained 546 rows for training data and 137 rows for testing data.

Upsampling

RNGkind(sample.kind = "Rounding")
set.seed(7)

data_train <- upSample(x = data_train %>% select(-Class), 
                           y = data_train$Class, 
                           yname = "Class")
prop.table(table(data_train$Class))
## 
##   0   1 
## 0.5 0.5

this stage divides the proportion of training data equally by the upsampling method

Modeling

classification modeling using 3 methods : * Naive Bayes * Decision Tree * Random Forest

Naive Bayes

model_nb = naiveBayes(Class ~ .,data=data_train)
pred_nb = predict(model_nb,
                  newdata= data_test,
                  type = "class")
naive_matrix <-confusionMatrix(data = pred_nb,
                reference = data_test$Class, 
                positive = "0")
naive_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 87  1
##          1  1 48
##                                              
##                Accuracy : 0.9854             
##                  95% CI : (0.9483, 0.9982)   
##     No Information Rate : 0.6423             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.9682             
##                                              
##  Mcnemar's Test P-Value : 1                  
##                                              
##             Sensitivity : 0.9886             
##             Specificity : 0.9796             
##          Pos Pred Value : 0.9886             
##          Neg Pred Value : 0.9796             
##              Prevalence : 0.6423             
##          Detection Rate : 0.6350             
##    Detection Prevalence : 0.6423             
##       Balanced Accuracy : 0.9841             
##                                              
##        'Positive' Class : 0                  
## 

true positive (TP): Predicted malignant and true benign 48 true negative (TN): Predicted benign but benign 87 false positive (FP): Predicted malignant but benign 1 false negative (FN): Predicted benign but to malignant 1

Accuracy: the model used has an accuracy of 98.54% predicting the target class Sensitivity/ Recall: the size of the goodness of the model to the positive class is 98.86% Specificity: a measure of the goodness of the model to the negative class 97.96% Pos Pred Value/Precision: model precision measures predict positive class 98.86%

Decision Tree

Decision Tree Illustration

data_tree <- ctree(Class~., data = data_train)
plot(data_tree, type = "simple")

  • There were 313 respondents in the bare.nuclei class with a value of less than equal to 3 and bland chromatin with a value less than equal to 4 and bare nuclei less than equal to 5 and uniformity.of.cell.shape value less than equal to 2 with an error of 0.0%

  • There are 9 respondents in the class bare.nuclei worth more than 3 and bland chromatin worth less than equal to 4 and bare nuclei less than equal to 5 and uniformity.of.cell.shape value less than equal to 2 with an error of 11.1%

  • There were 8 respondents who entered the bland chromatin class with a value of more than 4 and bare nuclei less than 5 and uniformity.of.cell.shape with a value of less than 2 with an error of 50%.

  • there are 8 respondents who enter the bare nuclei class with a value of more than 5 and uniformity.of.cell.shape with a value of less than equal to 2 with an error of 0.0%

  • There are 14 respondents who enter the uniformity.of.cell.size class with a value less than 3 and bare nuclei with a value less than 1 and uniformity.of.cell.shape with a value of more than 2 with an error of 0.0%.

  • There are 20 respondents who enter the uniformity.of.cell.size class with a value of more than 3 and bare nuclei with a value of less than 1 and uniformity.of.cell.shape with a value of more than 2 with an error of 10%.

  • There are 15 respondents who enter the uniformity.of.cell.size class with a value less than equal to 4 and bare nuclei less than equal to 3 and bare nuclei with a value of more than 1 and uniformity.of.cell.size more than 2 with an error of 46.7%

  • there are 25 respondents in the uniformity.of.cell.size class with a value of more than 4 and a bare nuclei value of less than 3 and a bare nuclei value of more than 1 and uniformity.of.cell.size of more than 2 with an error of 0.0%

  • there are 300 respondents in the class of bare nuclei with a value of more than 3 and a bare nuclei of more than 1 and uniformity.of.cell.size of more than 2 with an error of 2.7%

Predict Decision Tree

pred_tree <- predict(object = data_tree,  
                          newdata = data_test)

tree_matrix <- confusionMatrix(data = pred_tree, 
                reference = data_test$Class, 
                positive = "0")
tree_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 85  2
##          1  3 47
##                                              
##                Accuracy : 0.9635             
##                  95% CI : (0.9169, 0.988)    
##     No Information Rate : 0.6423             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.9209             
##                                              
##  Mcnemar's Test P-Value : 1                  
##                                              
##             Sensitivity : 0.9659             
##             Specificity : 0.9592             
##          Pos Pred Value : 0.9770             
##          Neg Pred Value : 0.9400             
##              Prevalence : 0.6423             
##          Detection Rate : 0.6204             
##    Detection Prevalence : 0.6350             
##       Balanced Accuracy : 0.9625             
##                                              
##        'Positive' Class : 0                  
## 

true positive (TP): Predicted malignant and true benign 47 true negative (TN): Predicted benign but benign 85 false positive (FP): Predicted malignant but benign 3 false negative (FN): Predicted benign but to malignant 2

Accuracy: the model used has an accuracy of 96.35% predicting the target class Sensitivity/ Recall: the size of the goodness of the model to the positive class is 96.59% Specificity: a measure of the goodness of the model to the negative class 95.92% Pos Pred Value/Precision: model precision measures predict positive class 97.70%

Random Forest

K-Fold Cross Validation

set.seed(6)
 
ctrl <- trainControl(method = "repeatedcv",
                      number = 3, 
                      repeats = 5) 
 
data_forest <- train(Class ~ .,
                    data = data_train,
                    method = "rf", 
                    trControl = ctrl)
 
#saveRDS(data_forest, "data_forest_2.RDS") # simpan model
#data_forest <- readRDS("E:/Algoritma/6_lbb_classification/Classification/data_forest_2.RDS")
data_forest
## Random Forest 
## 
## 712 samples
##   9 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 5 times) 
## Summary of sample sizes: 475, 474, 475, 475, 474, 475, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9764020  0.9528021
##   5     0.9741528  0.9483033
##   9     0.9730252  0.9460484
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

K-Fold Cross Validation divides the data into k equal parts, with each part being used as testing data alternately. In the above model, several experiments were carried out by repeating the calculation of the number of random predictors used when splitting nodes. The selected model is mtry = 2 with the highest accuracy value is 0.9764020

Out Of Bag Error

data_forest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 2.39%
## Confusion matrix:
##     0   1 class.error
## 0 342  14 0.039325843
## 1   3 353 0.008426966

OOB is used for evaluation by calculating the error. from the output results above we get an OOB estimate of error 2.39%, in other words the accuracy of the model on OOB data is 97.61%

Predict Random Forest

pred_forest <- predict(data_forest, 
                              data_test)
forest_matrix <- confusionMatrix(data = pred_forest,
                  reference = data_test$Class)
forest_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 88  0
##          1  0 49
##                                                
##                Accuracy : 1                    
##                  95% CI : (0.9734, 1)          
##     No Information Rate : 0.6423               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 1                    
##                                                
##  Mcnemar's Test P-Value : NA                   
##                                                
##             Sensitivity : 1.0000               
##             Specificity : 1.0000               
##          Pos Pred Value : 1.0000               
##          Neg Pred Value : 1.0000               
##              Prevalence : 0.6423               
##          Detection Rate : 0.6423               
##    Detection Prevalence : 0.6423               
##       Balanced Accuracy : 1.0000               
##                                                
##        'Positive' Class : 0                    
## 

true positive (TP): Predicted malignant and true benign 88 true negative (TN): Predicted benign but benign 49 false positive (FP): Predicted malignant but benign 0 false negative (FN): Predicted benign but to malignant 0

Accuracy: the model used has an accuracy of 100% predicting the target class Sensitivity/ Recall: the size of the goodness of the model to the positive class is 100% Specificity: a measure of the goodness of the model to the negative class 100% Pos Pred Value/Precision: model precision measures predict positive class 100%

Model Comparison

tibble( accuracy_naive = naive_matrix$overall[1],
        accuracy_tree = tree_matrix$overall[1],
        accuracy_forest = forest_matrix$overall[1]
  )

From a comparison of the 3 models, the best model is Naive Bayes with an accuracy of 98.54%. random forest was not chosen because the model of random forest is not good from a business perspective random forest will repeat the decision tree model many times, so that the model will make a better pattern on the same test model

tibble( sensitivity_naive = naive_matrix$byClass[1],
        sensitivity_tree = tree_matrix$byClass[1],
        sensitivity_forest = forest_matrix$byClass[1]
  )

based on the confusion matrix, what you want to minimize is predict benign but malignant, so recall is used. Based on the sensitivity value, the best value is Naive Bayes. so the model used is Naive Bayes

Checking Model

Overfitting check

pred_naive_train <- predict(object = model_nb,  
                          newdata = data_train)

train_matrix = confusionMatrix(data = pred_naive_train, 
                  reference = data_train$Class, 
                  positive = "0")
train_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 337   7
##          1  19 349
##                                               
##                Accuracy : 0.9635              
##                  95% CI : (0.947, 0.976)      
##     No Information Rate : 0.5                 
##     P-Value [Acc > NIR] : < 0.0000000000000002
##                                               
##                   Kappa : 0.927               
##                                               
##  Mcnemar's Test P-Value : 0.03098             
##                                               
##             Sensitivity : 0.9466              
##             Specificity : 0.9803              
##          Pos Pred Value : 0.9797              
##          Neg Pred Value : 0.9484              
##              Prevalence : 0.5000              
##          Detection Rate : 0.4733              
##    Detection Prevalence : 0.4831              
##       Balanced Accuracy : 0.9635              
##                                               
##        'Positive' Class : 0                   
## 
tibble( accuracy_naive = naive_matrix$overall[1],
        accuracy_train = train_matrix$overall[1],
        sensitivity_naive = naive_matrix$byClass[1],
        sensitivity_train = train_matrix$byClass[1]
)

based on the predicted value using the Naive Bayes model, the accuracy value for the training data is 0.9634831 and the test data is 0.9854015 end then the Sensitivity value for the training data is 0.9466292 and the test data is 0.9886364 when each is subtracted, it gets 0.0219784 and 0.0420072. A model is to be overfit if the difference reaches more than 0.1 so that when compared 0.0219784 < 0.1 and 0.0420072 < 0.1.

It can be concluded that the model that has been created can accommodate the available test data, in other words, the model is very good for classifying users who buy or don’t buy a product.

ROC & AUC

roc_test <- predict(object = model_nb, 
                     newdata = data_test, 
                     type = "raw")
pred_prob <- roc_test[,2]
model_roc <- prediction(predictions =  pred_prob, 
                        labels =  data_test$Class)
model_roc_vec <- performance(prediction.obj = model_roc, 
                             measure = "tpr", 
                             x.measure = "fpr" 
                             )
plot(model_roc_vec)

abline(0,1 , lty = 2)

based on the plot above the formed roc it is known that the plot has a high True Positive Rate.

model_auc <- performance(model_roc,
                         measure = "auc")

model_auc@y.values
## [[1]]
## [1] 0.9997681

Based on the AUC value, it can be concluded that the model is very good at separating benign and malignant classes. Because the AUC value is 0.9997681 which is close to 1

Conclusion

Based on the analysis that has been done it can be concluded that :

  • the best model is Random Forest but from a business perspective random forest is not good because random forest will repeat the decision tree model many times, so that the model will make a better pattern on the same test model. And the random forest accuracy value is too perfect with an accuracy value of 100%. End then the best modeling used is Naive Bayes with an accuracy value of 98.54%, higher than the decision tree which is only 96.35%

  • the auc value obtained from Naive Bayes is also very good at separating benign and malignant classes with a value of 0.9997681

  • because the naive Bayes method only requires a small amount of training data to determine the parameter estimates needed in the classification process, so that in this case the accuracy of Naive Bayes is higher than the decision tree