Data Source : https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Introduction
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/
Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Attribute Information:
- ID number
- Diagnosis (M = malignant, B = benign) 3-32)
Ten real-valued features are computed for each cell nucleus:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension (“coastline approximation” - 1)
The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
All feature values are recoded with four significant digits.
Missing attribute values: none
Class distribution: 357 benign, 212 malignant
Load the data requirement
library(tidyverse)
library(readr)
library(GGally)
# Naive Bayes
library(e1071)
# confusionMatrix
library(caret)
# Split data Train & Test
library(rsample)
library(ROCR)
# Decision Tree
library(partykit)
# randomForest
library(randomForest)
Read Data
Data Frame
cancer <- read.csv("breastcancer.csv")
head(cancer,3)
## id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1 842302 M 17.99 10.38 122.8 1001
## 2 842517 M 20.57 17.77 132.9 1326
## 3 84300903 M 19.69 21.25 130.0 1203
## smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1 0.11840 0.27760 0.3001 0.14710
## 2 0.08474 0.07864 0.0869 0.07017
## 3 0.10960 0.15990 0.1974 0.12790
## symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1 0.2419 0.07871 1.0950 0.9053 8.589
## 2 0.1812 0.05667 0.5435 0.7339 3.398
## 3 0.2069 0.05999 0.7456 0.7869 4.585
## area_se smoothness_se compactness_se concavity_se concave.points_se
## 1 153.40 0.006399 0.04904 0.05373 0.01587
## 2 74.08 0.005225 0.01308 0.01860 0.01340
## 3 94.03 0.006150 0.04006 0.03832 0.02058
## symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1 0.03003 0.006193 25.38 17.33 184.6
## 2 0.01389 0.003532 24.99 23.41 158.8
## 3 0.02250 0.004571 23.57 25.53 152.5
## area_worst smoothness_worst compactness_worst concavity_worst
## 1 2019 0.1622 0.6656 0.7119
## 2 1956 0.1238 0.1866 0.2416
## 3 1709 0.1444 0.4245 0.4504
## concave.points_worst symmetry_worst fractal_dimension_worst X
## 1 0.2654 0.4601 0.11890 NA
## 2 0.1860 0.2750 0.08902 NA
## 3 0.2430 0.3613 0.08758 NA
Data Type, Rows and Columns
glimpse(cancer)
## Rows: 569
## Columns: 33
## $ id <int> 842302, 842517, 84300903, 84348301, 84358402, ~
## $ diagnosis <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "~
## $ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450~
## $ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9~
## $ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, ~
## $ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, ~
## $ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0~
## $ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0~
## $ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0~
## $ concave.points_mean <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0~
## $ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087~
## $ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0~
## $ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345~
## $ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902~
## $ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18~
## $ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.~
## $ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114~
## $ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246~
## $ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0~
## $ concave.points_se <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188~
## $ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0~
## $ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051~
## $ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8~
## $ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6~
## $ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,~
## $ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, ~
## $ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791~
## $ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249~
## $ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0~
## $ concave.points_worst <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0~
## $ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985~
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0~
## $ X <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
Data pre-processing
Data wrangling
since we don’t need X variable because it is NA therefore we need to remove the X variable and ID as well it helps later on when we wrangling data
cancer <- cancer %>%
select(-c(id,X)) %>%
mutate(diagnosis = as.factor(diagnosis))
head(cancer,3)
## diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 1 M 17.99 10.38 122.8 1001 0.11840
## 2 M 20.57 17.77 132.9 1326 0.08474
## 3 M 19.69 21.25 130.0 1203 0.10960
## compactness_mean concavity_mean concave.points_mean symmetry_mean
## 1 0.27760 0.3001 0.14710 0.2419
## 2 0.07864 0.0869 0.07017 0.1812
## 3 0.15990 0.1974 0.12790 0.2069
## fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 1 0.07871 1.0950 0.9053 8.589 153.40
## 2 0.05667 0.5435 0.7339 3.398 74.08
## 3 0.05999 0.7456 0.7869 4.585 94.03
## smoothness_se compactness_se concavity_se concave.points_se symmetry_se
## 1 0.006399 0.04904 0.05373 0.01587 0.03003
## 2 0.005225 0.01308 0.01860 0.01340 0.01389
## 3 0.006150 0.04006 0.03832 0.02058 0.02250
## fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst
## 1 0.006193 25.38 17.33 184.6 2019
## 2 0.003532 24.99 23.41 158.8 1956
## 3 0.004571 23.57 25.53 152.5 1709
## smoothness_worst compactness_worst concavity_worst concave.points_worst
## 1 0.1622 0.6656 0.7119 0.2654
## 2 0.1238 0.1866 0.2416 0.1860
## 3 0.1444 0.4245 0.4504 0.2430
## symmetry_worst fractal_dimension_worst
## 1 0.4601 0.11890
## 2 0.2750 0.08902
## 3 0.3613 0.08758
Check missing value
before we start customize or explore our data make sure that our data has no missing value
colSums(is.na(cancer))
## diagnosis radius_mean texture_mean
## 0 0 0
## perimeter_mean area_mean smoothness_mean
## 0 0 0
## compactness_mean concavity_mean concave.points_mean
## 0 0 0
## symmetry_mean fractal_dimension_mean radius_se
## 0 0 0
## texture_se perimeter_se area_se
## 0 0 0
## smoothness_se compactness_se concavity_se
## 0 0 0
## concave.points_se symmetry_se fractal_dimension_se
## 0 0 0
## radius_worst texture_worst perimeter_worst
## 0 0 0
## area_worst smoothness_worst compactness_worst
## 0 0 0
## concavity_worst concave.points_worst symmetry_worst
## 0 0 0
## fractal_dimension_worst
## 0
since there is no NA or zero in our variables, we can proceed to the next step
Columns Names
colnames(cancer)
## [1] "diagnosis" "radius_mean"
## [3] "texture_mean" "perimeter_mean"
## [5] "area_mean" "smoothness_mean"
## [7] "compactness_mean" "concavity_mean"
## [9] "concave.points_mean" "symmetry_mean"
## [11] "fractal_dimension_mean" "radius_se"
## [13] "texture_se" "perimeter_se"
## [15] "area_se" "smoothness_se"
## [17] "compactness_se" "concavity_se"
## [19] "concave.points_se" "symmetry_se"
## [21] "fractal_dimension_se" "radius_worst"
## [23] "texture_worst" "perimeter_worst"
## [25] "area_worst" "smoothness_worst"
## [27] "compactness_worst" "concavity_worst"
## [29] "concave.points_worst" "symmetry_worst"
## [31] "fractal_dimension_worst"
colnames to show what columns name in data frame
Exploratory analysis data (EDA)
Before we proceed any further, it is essential to perform a basic exploratory data analysis. Since both outcomes are fairly represented in data, oversampling won’t be required. Also, there is nothing unusual in distribution plots.
Correlation plot
ggcorr(
cancer,
label = TRUE,
label_size = 3,
hjust = 1,
layout.exp = 4
)
EDA each value
ggplot(gather(cancer[,2:ncol(cancer)]), aes(value)) +
geom_histogram(bins = 5, fill = "blue", alpha = 0.6) +
facet_wrap(~key, scales = 'free_x')
Distribution of the objective variable
p1 <- ggplot(cancer, aes(x = diagnosis, fill = diagnosis)) +
geom_bar(stat = "count", position = "stack", show.legend = FALSE) +
theme_minimal(base_size = 16) +
geom_label(stat = "count", aes(label = ..count..), position = position_stack(vjust = 0.5),
size = 5, show.legend = FALSE)
p1
Density plot
#high-dimension 4-at-a-time
densityplot <- function(cancer, i=3) {
featurePlot(x = cancer[, i:(ifelse(i+3<ncol(cancer),i+3, ncol(cancer)))],
y = cancer$diagnosis,
plot = "density",
auto.key = list(columns = 2))
}
#i in seq(3, ncol(data),4)
densityplot(cancer, 7)
Naive Bayes Model
Naive Bayes classifier is based on Bayes’ theorem with the independence assumptions between predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly well and is widely used because it often outperforms more sophisticated classification methods.
Data Manipluation / Cross-validation
We now need to ensure our data is in correct format and split it into training and validation datsets. We have used a 80-20 split here.
set.seed(777)
RNGkind(sample.kind = "Rounding")
index <- initial_split(cancer, prop = 0.8, strata = "diagnosis")
data_test <- testing(index)
data_train <- training(index)
Proportion tables
now we check to make sure our data is split
cat(
prop.table(table(data_test$diagnosis)),
prop.table(table(data_train$diagnosis))
)
## 0.626087 0.373913 0.6277533 0.3722467
Model Fitting
set.seed(777)
RNGkind(sample.kind = "Rounding")
naive_model <- naiveBayes(diagnosis ~ . , data_train, laplace = 1)
data_test$pred <-
predict(naive_model, newdata = data_test, type = "class")
type = "class"(default threshold 0.5)
Result of Naive Bayes Model prediction
Confusion Matrix gives us a good understanding of how well the model is doing. While overall accuracy is 92% (0.9217) , sensitivity and precision is 90% and 88% respectively. in addition, we still can try another model and see which one is more accurate and better
set.seed(777)
RNGkind(sample.kind = "Rounding")
confusionMatrix(data_test$pred, reference = data_test$diagnosis,
positive = "M")
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 67 4
## M 5 39
##
## Accuracy : 0.9217
## 95% CI : (0.8566, 0.9636)
## No Information Rate : 0.6261
## P-Value [Acc > NIR] : 3.235e-13
##
## Kappa : 0.8336
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9070
## Specificity : 0.9306
## Pos Pred Value : 0.8864
## Neg Pred Value : 0.9437
## Prevalence : 0.3739
## Detection Rate : 0.3391
## Detection Prevalence : 0.3826
## Balanced Accuracy : 0.9188
##
## 'Positive' Class : M
##
Evaluation ROC and AUC
Return Matrix of Class Probabilities
ROC is a curve represents relation between True Positive Rate (Sensitivity atau Recall) with False Positive Rate (1-Specificity) which each threshold. the good model idealy has True Positive Rate high and False Positive Rate low.
To plot our ROC Curve, we can have our naive bayes model return the a-posterior probabilities for each class instead of the “class” with maximal probability.
# change prediction to probability
data_test$pred <-
predict(naive_model, newdata = data_test, type = "raw")
The predict function allows you to specify whether you want the most probable class or if you want to get the probability for every class. Nothing changes with the exception being the type parameter is set to “raw”.
data_test$actual <- ifelse(data_test$diagnosis == "B", 1 , 0)
change labels to 1 or 0 as a classification (category)
Creating our ROC Curve:
# object prediction
cancer_pred <- prediction(prediction = data_test$pred[, 1],
labels = data_test$actual)
# ROC Curve
plot(performance(prediction.obj = cancer_pred, measure = "tpr",
x.measure = "fpr"))
AUC score
auc_score <-
performance(prediction.obj = cancer_pred, measure = "auc")
auc_score@y.values
## [[1]]
## [1] 0.9838501
Decision tree model
Decision tree algorithms use the training data to segment the predictor space into non-overlapping regions, the nodes of the tree. Each node is described by a set of rules which are then used to predict new responses. The predicted value for each node is the most common response in the node (classification), or mean response in the node (regression).
tree_cancer <- ctree(diagnosis ~ . , cancer)
plot(tree_cancer, type = "simple")
Decision tree model prediction on data test
set.seed(777)
RNGkind(sample.kind = "Rounding")
# prediksi kelas on data test
tree_cancer$pred <- predict(tree_cancer, newdata = data_test,
type = "response")
# confusion matrix data test
confusionMatrix(data = tree_cancer$pred,
reference = data_test$diagnosis,
positive = "M")
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 68 3
## M 4 40
##
## Accuracy : 0.9391
## 95% CI : (0.8786, 0.9752)
## No Information Rate : 0.6261
## P-Value [Acc > NIR] : 5.446e-15
##
## Kappa : 0.8706
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9302
## Specificity : 0.9444
## Pos Pred Value : 0.9091
## Neg Pred Value : 0.9577
## Prevalence : 0.3739
## Detection Rate : 0.3478
## Detection Prevalence : 0.3826
## Balanced Accuracy : 0.9373
##
## 'Positive' Class : M
##
Decision tree model prediction on data train
set.seed(777)
RNGkind(sample.kind = "Rounding")
pred_train <- predict(tree_cancer, newdata = data_train,
type = "response")
confusionMatrix(pred_train,
reference = data_train$diagnosis,
positive = "M")
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 278 13
## M 7 156
##
## Accuracy : 0.9559
## 95% CI : (0.9328, 0.9729)
## No Information Rate : 0.6278
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9051
##
## Mcnemar's Test P-Value : 0.2636
##
## Sensitivity : 0.9231
## Specificity : 0.9754
## Pos Pred Value : 0.9571
## Neg Pred Value : 0.9553
## Prevalence : 0.3722
## Detection Rate : 0.3436
## Detection Prevalence : 0.3590
## Balanced Accuracy : 0.9493
##
## 'Positive' Class : M
##
- Accuracy on data train: 0.9391
- Accuracy on data test: 0.9559
Random Forest model
Random Forest algorithms are using decisions trees about splitting the data into classifications to then have an algorithm that will predict where new points of data will land. Those classifications are based on values of the independent and dependent variables. Now, that’s the decision tree process; the forest is multiple trees! Each tree is a random subset of the data. You can have a Random Forest that is a collection of the same tree or a collection of many types of trees working as one forest.
Model Fitting
set.seed(777)
RNGkind(sample.kind = "Rounding")
model_rf<- randomForest(diagnosis ~ . ,
data = data_train)
Prediction
model_rf_pred <- predict(model_rf, data_test)
model_rf_pred
## 5 9 10 14 16 19 27 35 42 45 47 59 70 71 77 89 91 93 100 104
## M M M M M M M M B M B B B M B B B B M B
## 105 108 110 113 114 119 125 133 134 137 145 150 153 155 161 164 183 186 188 189
## B B B M B M B M M B B B B B B B M B B B
## 190 191 208 211 214 216 224 228 231 232 234 235 245 248 255 257 285 286 289 298
## B M M M M M M B M B M B M B M M B B B B
## 302 307 309 311 315 316 322 331 340 346 350 351 356 360 365 367 370 377 383 389
## B B B B B B M M M B B B B B B M M B B B
## 393 394 397 399 401 407 416 420 425 429 432 433 440 447 450 453 454 455 462 465
## M M B B M M B B B B B M B M M B B B M B
## 472 473 479 483 492 495 499 500 525 527 540 555 560 565 568
## B B B B M B M M B B B B B M M
## Levels: B M
Result of Random Forest model prediction
confusionMatrix(model_rf_pred,
data_test$diagnosis,
positive = "M")
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 68 2
## M 4 41
##
## Accuracy : 0.9478
## 95% CI : (0.8899, 0.9806)
## No Information Rate : 0.6261
## P-Value [Acc > NIR] : 5.754e-16
##
## Kappa : 0.8896
##
## Mcnemar's Test P-Value : 0.6831
##
## Sensitivity : 0.9535
## Specificity : 0.9444
## Pos Pred Value : 0.9111
## Neg Pred Value : 0.9714
## Prevalence : 0.3739
## Detection Rate : 0.3565
## Detection Prevalence : 0.3913
## Balanced Accuracy : 0.9490
##
## 'Positive' Class : M
##
Summary
In conclusion, I build a Naive Bayes, Decision Tree and Random Forest Classifier to predict the diagnosis (M = malignant, B = benign) outcome, i built 3 models each model to predict the outcome of Malignant, there of these models have the Accuracy of prediction is about ~92% precisely correct. but most importantly, if we can see the Accuracy from the Random forest has the highest ~94% accuracy is better than the other (Naive Bayes and Decision tree) perspectively.