Prediction of BreastCancer

Harnsen

24/12/2021

Data Source : https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Introduction

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

  1. ID number
  2. Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

  1. radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (perimeter^2 / area - 1.0)
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension (“coastline approximation” - 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

Load the data requirement

library(tidyverse)
library(readr)
library(GGally)

# Naive Bayes
library(e1071)

# confusionMatrix
library(caret)

# Split data Train & Test
library(rsample)
library(ROCR)

# Decision Tree
library(partykit)

# randomForest
library(randomForest)

Read Data

Data Frame

cancer <- read.csv("breastcancer.csv")
head(cancer,3)
##         id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1   842302         M       17.99        10.38          122.8      1001
## 2   842517         M       20.57        17.77          132.9      1326
## 3 84300903         M       19.69        21.25          130.0      1203
##   smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1         0.11840          0.27760         0.3001             0.14710
## 2         0.08474          0.07864         0.0869             0.07017
## 3         0.10960          0.15990         0.1974             0.12790
##   symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1        0.2419                0.07871    1.0950     0.9053        8.589
## 2        0.1812                0.05667    0.5435     0.7339        3.398
## 3        0.2069                0.05999    0.7456     0.7869        4.585
##   area_se smoothness_se compactness_se concavity_se concave.points_se
## 1  153.40      0.006399        0.04904      0.05373           0.01587
## 2   74.08      0.005225        0.01308      0.01860           0.01340
## 3   94.03      0.006150        0.04006      0.03832           0.02058
##   symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1     0.03003             0.006193        25.38         17.33           184.6
## 2     0.01389             0.003532        24.99         23.41           158.8
## 3     0.02250             0.004571        23.57         25.53           152.5
##   area_worst smoothness_worst compactness_worst concavity_worst
## 1       2019           0.1622            0.6656          0.7119
## 2       1956           0.1238            0.1866          0.2416
## 3       1709           0.1444            0.4245          0.4504
##   concave.points_worst symmetry_worst fractal_dimension_worst  X
## 1               0.2654         0.4601                 0.11890 NA
## 2               0.1860         0.2750                 0.08902 NA
## 3               0.2430         0.3613                 0.08758 NA

Data Type, Rows and Columns

glimpse(cancer)
## Rows: 569
## Columns: 33
## $ id                      <int> 842302, 842517, 84300903, 84348301, 84358402, ~
## $ diagnosis               <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "~
## $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450~
## $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9~
## $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, ~
## $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, ~
## $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0~
## $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0~
## $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0~
## $ concave.points_mean     <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0~
## $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087~
## $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0~
## $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345~
## $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902~
## $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18~
## $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.~
## $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114~
## $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246~
## $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0~
## $ concave.points_se       <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188~
## $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0~
## $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051~
## $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8~
## $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6~
## $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,~
## $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, ~
## $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791~
## $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249~
## $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0~
## $ concave.points_worst    <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0~
## $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985~
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0~
## $ X                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~

Data pre-processing

Data wrangling

since we don’t need X variable because it is NA therefore we need to remove the X variable and ID as well it helps later on when we wrangling data

cancer <- cancer %>%
  select(-c(id,X)) %>%
  mutate(diagnosis = as.factor(diagnosis))
head(cancer,3)
##   diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 1         M       17.99        10.38          122.8      1001         0.11840
## 2         M       20.57        17.77          132.9      1326         0.08474
## 3         M       19.69        21.25          130.0      1203         0.10960
##   compactness_mean concavity_mean concave.points_mean symmetry_mean
## 1          0.27760         0.3001             0.14710        0.2419
## 2          0.07864         0.0869             0.07017        0.1812
## 3          0.15990         0.1974             0.12790        0.2069
##   fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 1                0.07871    1.0950     0.9053        8.589  153.40
## 2                0.05667    0.5435     0.7339        3.398   74.08
## 3                0.05999    0.7456     0.7869        4.585   94.03
##   smoothness_se compactness_se concavity_se concave.points_se symmetry_se
## 1      0.006399        0.04904      0.05373           0.01587     0.03003
## 2      0.005225        0.01308      0.01860           0.01340     0.01389
## 3      0.006150        0.04006      0.03832           0.02058     0.02250
##   fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst
## 1             0.006193        25.38         17.33           184.6       2019
## 2             0.003532        24.99         23.41           158.8       1956
## 3             0.004571        23.57         25.53           152.5       1709
##   smoothness_worst compactness_worst concavity_worst concave.points_worst
## 1           0.1622            0.6656          0.7119               0.2654
## 2           0.1238            0.1866          0.2416               0.1860
## 3           0.1444            0.4245          0.4504               0.2430
##   symmetry_worst fractal_dimension_worst
## 1         0.4601                 0.11890
## 2         0.2750                 0.08902
## 3         0.3613                 0.08758

Check missing value

before we start customize or explore our data make sure that our data has no missing value

colSums(is.na(cancer))
##               diagnosis             radius_mean            texture_mean 
##                       0                       0                       0 
##          perimeter_mean               area_mean         smoothness_mean 
##                       0                       0                       0 
##        compactness_mean          concavity_mean     concave.points_mean 
##                       0                       0                       0 
##           symmetry_mean  fractal_dimension_mean               radius_se 
##                       0                       0                       0 
##              texture_se            perimeter_se                 area_se 
##                       0                       0                       0 
##           smoothness_se          compactness_se            concavity_se 
##                       0                       0                       0 
##       concave.points_se             symmetry_se    fractal_dimension_se 
##                       0                       0                       0 
##            radius_worst           texture_worst         perimeter_worst 
##                       0                       0                       0 
##              area_worst        smoothness_worst       compactness_worst 
##                       0                       0                       0 
##         concavity_worst    concave.points_worst          symmetry_worst 
##                       0                       0                       0 
## fractal_dimension_worst 
##                       0

since there is no NA or zero in our variables, we can proceed to the next step

Columns Names

colnames(cancer)
##  [1] "diagnosis"               "radius_mean"            
##  [3] "texture_mean"            "perimeter_mean"         
##  [5] "area_mean"               "smoothness_mean"        
##  [7] "compactness_mean"        "concavity_mean"         
##  [9] "concave.points_mean"     "symmetry_mean"          
## [11] "fractal_dimension_mean"  "radius_se"              
## [13] "texture_se"              "perimeter_se"           
## [15] "area_se"                 "smoothness_se"          
## [17] "compactness_se"          "concavity_se"           
## [19] "concave.points_se"       "symmetry_se"            
## [21] "fractal_dimension_se"    "radius_worst"           
## [23] "texture_worst"           "perimeter_worst"        
## [25] "area_worst"              "smoothness_worst"       
## [27] "compactness_worst"       "concavity_worst"        
## [29] "concave.points_worst"    "symmetry_worst"         
## [31] "fractal_dimension_worst"

colnames to show what columns name in data frame

Exploratory analysis data (EDA)

Before we proceed any further, it is essential to perform a basic exploratory data analysis. Since both outcomes are fairly represented in data, oversampling won’t be required. Also, there is nothing unusual in distribution plots.

Correlation plot

ggcorr(
  cancer,
  label = TRUE,
  label_size = 3,
  hjust = 1,
  layout.exp = 4
)

EDA each value

ggplot(gather(cancer[,2:ncol(cancer)]), aes(value)) + 
  geom_histogram(bins = 5, fill = "blue", alpha = 0.6) + 
  facet_wrap(~key, scales = 'free_x')

Distribution of the objective variable

p1 <- ggplot(cancer, aes(x = diagnosis, fill = diagnosis)) +
  geom_bar(stat = "count", position = "stack", show.legend = FALSE) +
  theme_minimal(base_size = 16) +
  geom_label(stat = "count", aes(label = ..count..), position = position_stack(vjust = 0.5),
             size = 5, show.legend = FALSE)

p1 

Density plot

#high-dimension 4-at-a-time
densityplot <- function(cancer, i=3)  {
  featurePlot(x = cancer[, i:(ifelse(i+3<ncol(cancer),i+3, ncol(cancer)))], 
              y = cancer$diagnosis, 
              plot = "density",
              auto.key = list(columns = 2))
}
#i in seq(3, ncol(data),4)
densityplot(cancer, 7)

Naive Bayes Model

Naive Bayes classifier is based on Bayes’ theorem with the independence assumptions between predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly well and is widely used because it often outperforms more sophisticated classification methods.

Data Manipluation / Cross-validation

We now need to ensure our data is in correct format and split it into training and validation datsets. We have used a 80-20 split here.

set.seed(777)
RNGkind(sample.kind = "Rounding")
index <- initial_split(cancer, prop = 0.8, strata = "diagnosis")
data_test <- testing(index)
data_train <- training(index)

Proportion tables

now we check to make sure our data is split

cat(
prop.table(table(data_test$diagnosis)),
prop.table(table(data_train$diagnosis))
)
## 0.626087 0.373913 0.6277533 0.3722467

Model Fitting

set.seed(777)
RNGkind(sample.kind = "Rounding")
naive_model <- naiveBayes(diagnosis ~ . , data_train, laplace = 1)
data_test$pred <-
  predict(naive_model, newdata = data_test, type = "class")
  • type = "class" (default threshold 0.5)

Result of Naive Bayes Model prediction

Confusion Matrix gives us a good understanding of how well the model is doing. While overall accuracy is 92% (0.9217) , sensitivity and precision is 90% and 88% respectively. in addition, we still can try another model and see which one is more accurate and better

set.seed(777)
RNGkind(sample.kind = "Rounding")
confusionMatrix(data_test$pred, reference = data_test$diagnosis,
                positive = "M")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 67  4
##          M  5 39
##                                           
##                Accuracy : 0.9217          
##                  95% CI : (0.8566, 0.9636)
##     No Information Rate : 0.6261          
##     P-Value [Acc > NIR] : 3.235e-13       
##                                           
##                   Kappa : 0.8336          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9070          
##             Specificity : 0.9306          
##          Pos Pred Value : 0.8864          
##          Neg Pred Value : 0.9437          
##              Prevalence : 0.3739          
##          Detection Rate : 0.3391          
##    Detection Prevalence : 0.3826          
##       Balanced Accuracy : 0.9188          
##                                           
##        'Positive' Class : M               
## 

Evaluation ROC and AUC

Return Matrix of Class Probabilities

ROC is a curve represents relation between True Positive Rate (Sensitivity atau Recall) with False Positive Rate (1-Specificity) which each threshold. the good model idealy has True Positive Rate high and False Positive Rate low.

To plot our ROC Curve, we can have our naive bayes model return the a-posterior probabilities for each class instead of the “class” with maximal probability.

# change prediction to probability
data_test$pred <-
  predict(naive_model, newdata = data_test, type = "raw")

The predict function allows you to specify whether you want the most probable class or if you want to get the probability for every class. Nothing changes with the exception being the type parameter is set to “raw”.

data_test$actual <- ifelse(data_test$diagnosis == "B", 1 , 0)

change labels to 1 or 0 as a classification (category)

Creating our ROC Curve:

# object prediction
cancer_pred <- prediction(prediction = data_test$pred[, 1],
                          labels = data_test$actual)

# ROC Curve
plot(performance(prediction.obj = cancer_pred, measure = "tpr",
                 x.measure = "fpr"))

AUC score

auc_score <-
  performance(prediction.obj = cancer_pred, measure = "auc")
auc_score@y.values
## [[1]]
## [1] 0.9838501

Decision tree model

Decision tree algorithms use the training data to segment the predictor space into non-overlapping regions, the nodes of the tree. Each node is described by a set of rules which are then used to predict new responses. The predicted value for each node is the most common response in the node (classification), or mean response in the node (regression).

tree_cancer <- ctree(diagnosis ~ . , cancer)
plot(tree_cancer, type = "simple")  

Decision tree model prediction on data test

set.seed(777)
RNGkind(sample.kind = "Rounding")

# prediksi kelas on data test

tree_cancer$pred <- predict(tree_cancer, newdata = data_test,
                            type = "response")

# confusion matrix data test

confusionMatrix(data = tree_cancer$pred,
                reference = data_test$diagnosis,
                positive = "M")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 68  3
##          M  4 40
##                                           
##                Accuracy : 0.9391          
##                  95% CI : (0.8786, 0.9752)
##     No Information Rate : 0.6261          
##     P-Value [Acc > NIR] : 5.446e-15       
##                                           
##                   Kappa : 0.8706          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9302          
##             Specificity : 0.9444          
##          Pos Pred Value : 0.9091          
##          Neg Pred Value : 0.9577          
##              Prevalence : 0.3739          
##          Detection Rate : 0.3478          
##    Detection Prevalence : 0.3826          
##       Balanced Accuracy : 0.9373          
##                                           
##        'Positive' Class : M               
## 

Decision tree model prediction on data train

set.seed(777)
RNGkind(sample.kind = "Rounding")
pred_train <- predict(tree_cancer, newdata = data_train,
                      type = "response")

confusionMatrix(pred_train,
                reference = data_train$diagnosis,
                positive = "M")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 278  13
##          M   7 156
##                                           
##                Accuracy : 0.9559          
##                  95% CI : (0.9328, 0.9729)
##     No Information Rate : 0.6278          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9051          
##                                           
##  Mcnemar's Test P-Value : 0.2636          
##                                           
##             Sensitivity : 0.9231          
##             Specificity : 0.9754          
##          Pos Pred Value : 0.9571          
##          Neg Pred Value : 0.9553          
##              Prevalence : 0.3722          
##          Detection Rate : 0.3436          
##    Detection Prevalence : 0.3590          
##       Balanced Accuracy : 0.9493          
##                                           
##        'Positive' Class : M               
## 
  • Accuracy on data train: 0.9391
  • Accuracy on data test: 0.9559

Random Forest model

Random Forest algorithms are using decisions trees about splitting the data into classifications to then have an algorithm that will predict where new points of data will land. Those classifications are based on values of the independent and dependent variables. Now, that’s the decision tree process; the forest is multiple trees! Each tree is a random subset of the data. You can have a Random Forest that is a collection of the same tree or a collection of many types of trees working as one forest.

Model Fitting

set.seed(777)
RNGkind(sample.kind = "Rounding")
model_rf<- randomForest(diagnosis ~ . , 
                         data = data_train)

Prediction

model_rf_pred <- predict(model_rf, data_test)
model_rf_pred
##   5   9  10  14  16  19  27  35  42  45  47  59  70  71  77  89  91  93 100 104 
##   M   M   M   M   M   M   M   M   B   M   B   B   B   M   B   B   B   B   M   B 
## 105 108 110 113 114 119 125 133 134 137 145 150 153 155 161 164 183 186 188 189 
##   B   B   B   M   B   M   B   M   M   B   B   B   B   B   B   B   M   B   B   B 
## 190 191 208 211 214 216 224 228 231 232 234 235 245 248 255 257 285 286 289 298 
##   B   M   M   M   M   M   M   B   M   B   M   B   M   B   M   M   B   B   B   B 
## 302 307 309 311 315 316 322 331 340 346 350 351 356 360 365 367 370 377 383 389 
##   B   B   B   B   B   B   M   M   M   B   B   B   B   B   B   M   M   B   B   B 
## 393 394 397 399 401 407 416 420 425 429 432 433 440 447 450 453 454 455 462 465 
##   M   M   B   B   M   M   B   B   B   B   B   M   B   M   M   B   B   B   M   B 
## 472 473 479 483 492 495 499 500 525 527 540 555 560 565 568 
##   B   B   B   B   M   B   M   M   B   B   B   B   B   M   M 
## Levels: B M

Result of Random Forest model prediction

confusionMatrix(model_rf_pred,
                data_test$diagnosis,
                positive = "M")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 68  2
##          M  4 41
##                                           
##                Accuracy : 0.9478          
##                  95% CI : (0.8899, 0.9806)
##     No Information Rate : 0.6261          
##     P-Value [Acc > NIR] : 5.754e-16       
##                                           
##                   Kappa : 0.8896          
##                                           
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.9535          
##             Specificity : 0.9444          
##          Pos Pred Value : 0.9111          
##          Neg Pred Value : 0.9714          
##              Prevalence : 0.3739          
##          Detection Rate : 0.3565          
##    Detection Prevalence : 0.3913          
##       Balanced Accuracy : 0.9490          
##                                           
##        'Positive' Class : M               
## 

Summary

In conclusion, I build a Naive Bayes, Decision Tree and Random Forest Classifier to predict the diagnosis (M = malignant, B = benign) outcome, i built 3 models each model to predict the outcome of Malignant, there of these models have the Accuracy of prediction is about ~92% precisely correct. but most importantly, if we can see the Accuracy from the Random forest has the highest ~94% accuracy is better than the other (Naive Bayes and Decision tree) perspectively.