Before we start with the analysis it is important to understand what exactly we are trying to predict and what the information provided, our variables of the dataset, mean. “Benign” refers to a type of medical condition or growth that is not cancerous or dangerous as opposed to “malignant”.
We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.
library(dplyr)
library(inspectdf)
library(e1071)
library(caret)
library(partykit)
library(ROCR)
library(randomForest)written library is very useful for the results of the analysis
data = read.csv("E:/Algoritma/6_lbb_classification/breast cancer.csv")
head(data)Clump thickness is a measure of how thick the cells
are within a tumor. Benign cells tend to be grouped in mono-layers,
while cancerous - in multi-layer.(Sarkar et al. 2017, p. 1)
Uniformity of cell size and uniformity of cell shape
are two characteristics that can be used to describe the appearance of
cells under a microscope. Here we are checking the degree to which the
cells in a sample are similar in size and shape.
Marginal adhesion is the degree to which cells in a
tissue sample adhere, or stick, to one another at the edges of the
sample. Loss of adhesion might be a sign of malignancy.
Single epithelial cell size is the size of
individual cells in an epithelial tissue sample. Epithelial tissue is a
type of tissue that covers the surface of the body and lines internal
organs and structures. It is made up of cells that are tightly packed
together and held in place by specialized junctions.
Bare nuclei refers to cells in a tissue sample that
are missing their cell membranes and cytoplasm, leaving only the nucleus
visible.
Bland chromatin is the appearance of the genetic
material (chromatin) in the nucleus of a cell under a microscope.
Chromatin is made up of DNA and proteins, and it contains the genetic
information that controls the cell’s functions. When the chromatin in a
cell’s nucleus is compact and uniform in appearance, it is said to be
“bland.”
Normal nucleoli are small, spherical structures
found within the nucleus of a cell. They are composed of DNA, RNA, and
proteins and are responsible for synthesizing ribosomes, which are the
cellular structures that produce proteins. Nucleoli are usually visible
under a microscope and can vary in size and appearance depending on the
stage of the cell cycle and the cell’s function. In normal, healthy
cells, nucleoli are usually small and have a distinct, well-defined
border.
Mitosis is the process of cell division that occurs
in all living organisms. During mitosis, a single cell divides into two
daughter cells, each of which contains a copy of the parent cell’s DNA.
The process of mitosis is essential for the growth and repair of tissues
and the production of new cells.
Class These two values refer to ‘malignant’ = 1 or
‘benign’ = 0.
glimpse(data)## Rows: 683
## Columns: 10
## $ Clump.Thickness <int> 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, …
## $ Uniformity.of.Cell.Size <int> 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1,…
## $ Uniformity.of.Cell.Shape <int> 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1,…
## $ Marginal.Adhesion <int> 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, …
## $ Single.Epithelial.Cell.Size <int> 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, …
## $ Bare.Nuclei <int> 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, 1, 3, …
## $ Bland.Chromatin <int> 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, …
## $ Normal.Nucleoli <int> 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, …
## $ Mitoses <int> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, …
## $ Class <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, …
of the data that belongs to all of the integer type, therefore it is necessary to change the appropriate data type
data = data %>%
mutate(
Class = as.factor(Class)
)
glimpse(data)## Rows: 683
## Columns: 10
## $ Clump.Thickness <int> 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, …
## $ Uniformity.of.Cell.Size <int> 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1,…
## $ Uniformity.of.Cell.Shape <int> 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1,…
## $ Marginal.Adhesion <int> 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, …
## $ Single.Epithelial.Cell.Size <int> 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, …
## $ Bare.Nuclei <int> 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, 1, 3, …
## $ Bland.Chromatin <int> 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, …
## $ Normal.Nucleoli <int> 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, …
## $ Mitoses <int> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, …
## $ Class <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, …
because all the predictor variables are the result of the measurement, only the target variable with a categorical type is changed
summary(data)## Clump.Thickness Uniformity.of.Cell.Size Uniformity.of.Cell.Shape
## Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 4.000 Median : 1.000 Median : 1.000
## Mean : 4.442 Mean : 3.151 Mean : 3.215
## 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000
## Max. :10.000 Max. :10.000 Max. :10.000
## Marginal.Adhesion Single.Epithelial.Cell.Size Bare.Nuclei
## Min. : 1.00 Min. : 1.000 Min. : 1.000
## 1st Qu.: 1.00 1st Qu.: 2.000 1st Qu.: 1.000
## Median : 1.00 Median : 2.000 Median : 1.000
## Mean : 2.83 Mean : 3.234 Mean : 3.545
## 3rd Qu.: 4.00 3rd Qu.: 4.000 3rd Qu.: 6.000
## Max. :10.00 Max. :10.000 Max. :10.000
## Bland.Chromatin Normal.Nucleoli Mitoses Class
## Min. : 1.000 Min. : 1.00 Min. : 1.000 0:444
## 1st Qu.: 2.000 1st Qu.: 1.00 1st Qu.: 1.000 1:239
## Median : 3.000 Median : 1.00 Median : 1.000
## Mean : 3.445 Mean : 2.87 Mean : 1.603
## 3rd Qu.: 5.000 3rd Qu.: 4.00 3rd Qu.: 1.000
## Max. :10.000 Max. :10.00 Max. :10.000
Can be seen the descriptive statistics of each variable with benign on the target variable as many as 444 and malignant as many as 239
colSums(is.na(data))## Clump.Thickness Uniformity.of.Cell.Size
## 0 0
## Uniformity.of.Cell.Shape Marginal.Adhesion
## 0 0
## Single.Epithelial.Cell.Size Bare.Nuclei
## 0 0
## Bland.Chromatin Normal.Nucleoli
## 0 0
## Mitoses Class
## 0 0
the dataset that we have does not have a missing value so that further analysis can be carried out and no missing value handling is required
dim(data)## [1] 683 10
prop.table(table(data$Class))##
## 0 1
## 0.6500732 0.3499268
the data used is 683 rows with 10 variables
the proportion between malignant (1) and benign (0), the result obtained is that the proportion is not balanced between malignant and benign.
RNGkind(sample.kind = "Rounding")
set.seed(17)
index = sample(x = nrow(data), size = nrow(data)*0.8)
data_train <- data[index,]
data_test <- data[-index,]
nrow(data_train)## [1] 546
nrow(data_test)## [1] 137
The distribution of training and testing data is 80:20. and obtained 546 rows for training data and 137 rows for testing data.
RNGkind(sample.kind = "Rounding")
set.seed(7)
data_train <- upSample(x = data_train %>% select(-Class),
y = data_train$Class,
yname = "Class")
prop.table(table(data_train$Class))##
## 0 1
## 0.5 0.5
this stage divides the proportion of training data equally by the upsampling method
classification modeling using 3 methods : * Naive Bayes
* Decision Tree * Random Forest
model_nb = naiveBayes(Class ~ .,data=data_train)
pred_nb = predict(model_nb,
newdata= data_test,
type = "class")
naive_matrix <-confusionMatrix(data = pred_nb,
reference = data_test$Class,
positive = "0")
naive_matrix## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 87 1
## 1 1 48
##
## Accuracy : 0.9854
## 95% CI : (0.9483, 0.9982)
## No Information Rate : 0.6423
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.9682
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9886
## Specificity : 0.9796
## Pos Pred Value : 0.9886
## Neg Pred Value : 0.9796
## Prevalence : 0.6423
## Detection Rate : 0.6350
## Detection Prevalence : 0.6423
## Balanced Accuracy : 0.9841
##
## 'Positive' Class : 0
##
true positive (TP): Predicted malignant and true benign 48 true negative (TN): Predicted benign but benign 87 false positive (FP): Predicted malignant but benign 1 false negative (FN): Predicted benign but to malignant 1
Accuracy: the model used has an accuracy of 98.54% predicting the target class Sensitivity/ Recall: the size of the goodness of the model to the positive class is 98.86% Specificity: a measure of the goodness of the model to the negative class 97.96% Pos Pred Value/Precision: model precision measures predict positive class 98.86%
data_tree <- ctree(Class~., data = data_train)
plot(data_tree, type = "simple")There were 313 respondents in the bare.nuclei class with a value of less than equal to 3 and bland chromatin with a value less than equal to 4 and bare nuclei less than equal to 5 and uniformity.of.cell.shape value less than equal to 2 with an error of 0.0%
There are 9 respondents in the class bare.nuclei worth more than 3 and bland chromatin worth less than equal to 4 and bare nuclei less than equal to 5 and uniformity.of.cell.shape value less than equal to 2 with an error of 11.1%
There were 8 respondents who entered the bland chromatin class with a value of more than 4 and bare nuclei less than 5 and uniformity.of.cell.shape with a value of less than 2 with an error of 50%.
there are 8 respondents who enter the bare nuclei class with a value of more than 5 and uniformity.of.cell.shape with a value of less than equal to 2 with an error of 0.0%
There are 14 respondents who enter the uniformity.of.cell.size class with a value less than 3 and bare nuclei with a value less than 1 and uniformity.of.cell.shape with a value of more than 2 with an error of 0.0%.
There are 20 respondents who enter the uniformity.of.cell.size class with a value of more than 3 and bare nuclei with a value of less than 1 and uniformity.of.cell.shape with a value of more than 2 with an error of 10%.
There are 15 respondents who enter the uniformity.of.cell.size class with a value less than equal to 4 and bare nuclei less than equal to 3 and bare nuclei with a value of more than 1 and uniformity.of.cell.size more than 2 with an error of 46.7%
there are 25 respondents in the uniformity.of.cell.size class with a value of more than 4 and a bare nuclei value of less than 3 and a bare nuclei value of more than 1 and uniformity.of.cell.size of more than 2 with an error of 0.0%
there are 300 respondents in the class of bare nuclei with a value of more than 3 and a bare nuclei of more than 1 and uniformity.of.cell.size of more than 2 with an error of 2.7%
pred_tree <- predict(object = data_tree,
newdata = data_test)
tree_matrix <- confusionMatrix(data = pred_tree,
reference = data_test$Class,
positive = "0")
tree_matrix## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 85 2
## 1 3 47
##
## Accuracy : 0.9635
## 95% CI : (0.9169, 0.988)
## No Information Rate : 0.6423
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.9209
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9659
## Specificity : 0.9592
## Pos Pred Value : 0.9770
## Neg Pred Value : 0.9400
## Prevalence : 0.6423
## Detection Rate : 0.6204
## Detection Prevalence : 0.6350
## Balanced Accuracy : 0.9625
##
## 'Positive' Class : 0
##
true positive (TP): Predicted malignant and true benign 47 true negative (TN): Predicted benign but benign 85 false positive (FP): Predicted malignant but benign 3 false negative (FN): Predicted benign but to malignant 2
Accuracy: the model used has an accuracy of 96.35% predicting the target class Sensitivity/ Recall: the size of the goodness of the model to the positive class is 96.59% Specificity: a measure of the goodness of the model to the negative class 95.92% Pos Pred Value/Precision: model precision measures predict positive class 97.70%
set.seed(6)
ctrl <- trainControl(method = "repeatedcv",
number = 3,
repeats = 5)
data_forest <- train(Class ~ .,
data = data_train,
method = "rf",
trControl = ctrl)
#saveRDS(data_forest, "data_forest_2.RDS") # simpan model
#data_forest <- readRDS("E:/Algoritma/6_lbb_classification/Classification/data_forest_2.RDS")
data_forest## Random Forest
##
## 712 samples
## 9 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 5 times)
## Summary of sample sizes: 475, 474, 475, 475, 474, 475, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9764020 0.9528021
## 5 0.9741528 0.9483033
## 9 0.9730252 0.9460484
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
K-Fold Cross Validation divides the data into k equal parts, with each part being used as testing data alternately. In the above model, several experiments were carried out by repeating the calculation of the number of random predictors used when splitting nodes. The selected model is mtry = 2 with the highest accuracy value is 0.9764020
data_forest$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 2.39%
## Confusion matrix:
## 0 1 class.error
## 0 342 14 0.039325843
## 1 3 353 0.008426966
OOB is used for evaluation by calculating the error. from the output results above we get an OOB estimate of error 2.39%, in other words the accuracy of the model on OOB data is 97.61%
pred_forest <- predict(data_forest,
data_test)
forest_matrix <- confusionMatrix(data = pred_forest,
reference = data_test$Class)
forest_matrix## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 88 0
## 1 0 49
##
## Accuracy : 1
## 95% CI : (0.9734, 1)
## No Information Rate : 0.6423
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6423
## Detection Rate : 0.6423
## Detection Prevalence : 0.6423
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 0
##
true positive (TP): Predicted malignant and true benign 88 true negative (TN): Predicted benign but benign 49 false positive (FP): Predicted malignant but benign 0 false negative (FN): Predicted benign but to malignant 0
Accuracy: the model used has an accuracy of 100% predicting the target class Sensitivity/ Recall: the size of the goodness of the model to the positive class is 100% Specificity: a measure of the goodness of the model to the negative class 100% Pos Pred Value/Precision: model precision measures predict positive class 100%
tibble( accuracy_naive = naive_matrix$overall[1],
accuracy_tree = tree_matrix$overall[1],
accuracy_forest = forest_matrix$overall[1]
)From a comparison of the 3 models, the best model is Naive Bayes with an accuracy of 98.54%. random forest was not chosen because the model of random forest is not good from a business perspective random forest will repeat the decision tree model many times, so that the model will make a better pattern on the same test model
tibble( sensitivity_naive = naive_matrix$byClass[1],
sensitivity_tree = tree_matrix$byClass[1],
sensitivity_forest = forest_matrix$byClass[1]
)based on the confusion matrix, what you want to minimize is predict benign but malignant, so recall is used. Based on the sensitivity value, the best value is Naive Bayes. so the model used is Naive Bayes
pred_naive_train <- predict(object = model_nb,
newdata = data_train)
train_matrix = confusionMatrix(data = pred_naive_train,
reference = data_train$Class,
positive = "0")
train_matrix## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 337 7
## 1 19 349
##
## Accuracy : 0.9635
## 95% CI : (0.947, 0.976)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 0.0000000000000002
##
## Kappa : 0.927
##
## Mcnemar's Test P-Value : 0.03098
##
## Sensitivity : 0.9466
## Specificity : 0.9803
## Pos Pred Value : 0.9797
## Neg Pred Value : 0.9484
## Prevalence : 0.5000
## Detection Rate : 0.4733
## Detection Prevalence : 0.4831
## Balanced Accuracy : 0.9635
##
## 'Positive' Class : 0
##
tibble( accuracy_naive = naive_matrix$overall[1],
accuracy_train = train_matrix$overall[1],
sensitivity_naive = naive_matrix$byClass[1],
sensitivity_train = train_matrix$byClass[1]
)based on the predicted value using the Naive Bayes model, the accuracy value for the training data is 0.9634831 and the test data is 0.9854015 end then the Sensitivity value for the training data is 0.9466292 and the test data is 0.9886364 when each is subtracted, it gets 0.0219784 and 0.0420072. A model is to be overfit if the difference reaches more than 0.1 so that when compared 0.0219784 < 0.1 and 0.0420072 < 0.1.
It can be concluded that the model that has been created can accommodate the available test data, in other words, the model is very good for classifying users who buy or don’t buy a product.
roc_test <- predict(object = model_nb,
newdata = data_test,
type = "raw")
pred_prob <- roc_test[,2]
model_roc <- prediction(predictions = pred_prob,
labels = data_test$Class)
model_roc_vec <- performance(prediction.obj = model_roc,
measure = "tpr",
x.measure = "fpr"
)
plot(model_roc_vec)
abline(0,1 , lty = 2)based on the plot above the formed roc it is known that the plot has a high True Positive Rate.
model_auc <- performance(model_roc,
measure = "auc")
model_auc@y.values## [[1]]
## [1] 0.9997681
Based on the AUC value, it can be concluded that the model is very good at separating benign and malignant classes. Because the AUC value is 0.9997681 which is close to 1
Based on the analysis that has been done it can be concluded that :
the best model is Random Forest but from a business perspective random forest is not good because random forest will repeat the decision tree model many times, so that the model will make a better pattern on the same test model. And the random forest accuracy value is too perfect with an accuracy value of 100%. End then the best modeling used is Naive Bayes with an accuracy value of 98.54%, higher than the decision tree which is only 96.35%
the auc value obtained from Naive Bayes is also very good at separating benign and malignant classes with a value of 0.9997681
because the naive Bayes method only requires a small amount of training data to determine the parameter estimates needed in the classification process, so that in this case the accuracy of Naive Bayes is higher than the decision tree