Breast Cancer

Introduction

Wellcome back to my rpubs and now we will learn a breast cancer with machine learning. to day we will make classification model for this data to learn how the disease occurs

About Breast Cancer

Breast cancer is a disease in which cells in the breast grow out of control. There are different kinds of breast cancer. The kind of breast cancer depends on which cells in the breast turn into cancer.

Breast cancer can begin in different parts of the breast. A breast is made up of three main parts: lobules, ducts, and connective tissue. The lobules are the glands that produce milk. The ducts are tubes that carry milk to the nipple. The connective tissue (which consists of fibrous and fatty tissue) surrounds and holds everything together. Most breast cancers begin in the ducts or lobules.

Source : https://www.cdc.gov/cancer/breast/basic_info/what-is-breast-cancer.htm

Prepare Data

Import Dataset

dataset = read.csv("breast-cancer-wisconsin.csv")
head(dataset)

##   X1000025 X5 X1 X1.1 X1.2 X2 X1.3 X3 X1.4 X1.5 X2.1
## 1  1002945  5  4    4    5  7   10  3    2    1    2
## 2  1015425  3  1    1    1  2    2  3    1    1    2
## 3  1016277  6  8    8    1  3    4  3    7    1    2
## 4  1017023  4  1    1    3  2    1  3    1    1    2
## 5  1017122  8 10   10    8  7   10  9    7    1    4
## 6  1018099  1  1    1    1  2   10  3    1    1    2

we can see the name of the column haven’t a corect name, and we need rename this column and then we need see the information the data on this Site, theres so many information about this data

glimpse(dataset)

## Rows: 698
## Columns: 11
## $ Sample_code_number          <int> 1002945, 1015425, 1016277, 1017023, 101...
## $ Clump_Thickness             <int> 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, ...
## $ Uniformity_of_Cell_Size     <int> 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1,...
## $ Uniformity_of_Cell_Shape    <int> 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1,...
## $ Marginal_Adhesion           <int> 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, ...
## $ Single_Epithelial_Cell_Size <int> 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, ...
## $ Bare_Nuclei                 <chr> "10", "2", "4", "1", "10", "10", "1", "...
## $ Bland_Chromatin             <int> 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, ...
## $ Normal_Nucleoli             <int> 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, ...
## $ Mitoses                     <int> 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, ...
## $ Class                       <int> 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4, 2, ...

Cleaning Dataset

library(dplyr)
library(tidyr)
dataset = dataset %>%
  mutate(Class = as.factor(Class),
         Bare_Nuclei = suppressWarnings(as.numeric(Bare_Nuclei))) %>%
  drop_na() %>%
  select(- Sample_code_number)

colSums(is.na(dataset))

##             Clump_Thickness     Uniformity_of_Cell_Size 
##                           0                           0 
##    Uniformity_of_Cell_Shape           Marginal_Adhesion 
##                           0                           0 
## Single_Epithelial_Cell_Size                 Bare_Nuclei 
##                           0                           0 
##             Bland_Chromatin             Normal_Nucleoli 
##                           0                           0 
##                     Mitoses                       Class 
##                           0                           0

check proportion class data

before we spliit we need see our propotion class in data whether it’s balanced or not. becouse if not balance we need to balancing data in next step.

prop.table(table(dataset$Class))

## 
##         2         4 
## 0.6495601 0.3504399

the proportion is quite good, maybe you are confused when you see that number. dont worry i will explain the number 2 meaning the cancer is benign and number 4 meaning the cancer is malignant and the proportion is 65% to benign and 35% to malignant

Splitting Dataset

library(caTools)
set.seed(123)
split = sample.split(dataset, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Our goal

from this data we will predicting someone who has breast cancer is benign or malignant, and we will make two model of classification and that is logistik regression and KNN. why i just make two model?? thats becouse from our data we just have numeric or integer for predictor wich is thats model have a good with this predictor. then lets start

Logistic Regression

create a logistic regression

logistic = glm(Class ~ ., training_set , family = "binomial")
summary(logistic)

## 
## Call:
## glm(formula = Class ~ ., family = "binomial", data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5989  -0.1241  -0.0624   0.0163   2.3436  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -10.04176    1.37542  -7.301 2.86e-13 ***
## Clump_Thickness               0.54341    0.16450   3.303 0.000955 ***
## Uniformity_of_Cell_Size      -0.04772    0.31313  -0.152 0.878878    
## Uniformity_of_Cell_Shape      0.19168    0.33922   0.565 0.572041    
## Marginal_Adhesion             0.30149    0.14112   2.136 0.032654 *  
## Single_Epithelial_Cell_Size   0.09594    0.18477   0.519 0.603593    
## Bare_Nuclei                   0.44786    0.11485   3.900 9.63e-05 ***
## Bland_Chromatin               0.36847    0.19614   1.879 0.060296 .  
## Normal_Nucleoli               0.30874    0.15303   2.018 0.043639 *  
## Mitoses                       0.75888    0.34638   2.191 0.028460 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 614.773  on 477  degrees of freedom
## Residual deviance:  73.615  on 468  degrees of freedom
## AIC: 93.615
## 
## Number of Fisher Scoring iterations: 8

Predict a data test with logistic regression model

log.Risk <- predict(logistic, newdata = test_set, type = "response")

log.Label <- ifelse(log.Risk < 0.5 , 2, 4)
log.Label = as.factor(log.Label)

Confusion Matrix Logistic Regression

suppressMessages(library(caret))
logistic_confus = confusionMatrix(data = log.Label , reference = test_set$Class , positive = '4')
logistic_confus

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   2   4
##          2 127   3
##          4   2  72
##                                          
##                Accuracy : 0.9755         
##                  95% CI : (0.9437, 0.992)
##     No Information Rate : 0.6324         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9471         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9600         
##             Specificity : 0.9845         
##          Pos Pred Value : 0.9730         
##          Neg Pred Value : 0.9769         
##              Prevalence : 0.3676         
##          Detection Rate : 0.3529         
##    Detection Prevalence : 0.3627         
##       Balanced Accuracy : 0.9722         
##                                          
##        'Positive' Class : 4              
##

K-NN

Split data as Predictor and target

if data have a dependent range we will scale data with normalization and if not we scale with standardization and in this case we know a data have dependent range 1 to 10 of course a target not include and we will use normalization

# prediktor
data_train_x <- training_set %>% select(- Class)

data_test_x <- test_set %>% select(- Class)

# target
data_train_y <- training_set %>% select(Class)

data_test_y <- test_set %>% select(Class)

create a K-NN

library(class)
knn_pred <- knn(train = data_train_x,
                  test = data_test_x,
                 cl = data_train_y$Class,
                 k = 21)

Confusion mattrix KNN

knn_confus = confusionMatrix(data = knn_pred , reference = data_test_y$Class, positive = "4")
knn_confus

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   2   4
##          2 127   5
##          4   2  70
##                                           
##                Accuracy : 0.9657          
##                  95% CI : (0.9306, 0.9861)
##     No Information Rate : 0.6324          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9256          
##                                           
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.9333          
##             Specificity : 0.9845          
##          Pos Pred Value : 0.9722          
##          Neg Pred Value : 0.9621          
##              Prevalence : 0.3676          
##          Detection Rate : 0.3431          
##    Detection Prevalence : 0.3529          
##       Balanced Accuracy : 0.9589          
##                                           
##        'Positive' Class : 4               
##

Ratio of Logistic Regression and K-NN

if we compare a model we must see score in confusion matrix between model, and in confusion matrix to many
information we have. but dont worry we just have to know 4 point in confusion matris, that is :

Re-call/Sensitivity = of all the positive actual data, how capable is the proportion of my model to guess right.

Specificity = of all the negative actual data, how capable is the proportion of my model to guess the right one.

Precision/Pos Pred Value = of all the predicted results, how capable is my model to correctly guess the positive class.

Accuracy = how able is my model to correctly guess the target Y.

eval_logit <- data_frame(Accuracy = logistic_confus$overall[1],
           Recall = logistic_confus$byClass[1],
           Specificity = logistic_confus$byClass[2],
           Precision = logistic_confus$byClass[3])

## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

eval_knn <- data_frame(Accuracy = knn_confus$overall[1],
           Recall = knn_confus$byClass[1],
           Specificity = knn_confus$byClass[2],
           Precision = knn_confus$byClass[3])

eval_logit

## # A tibble: 1 x 4
##   Accuracy Recall Specificity Precision
##      <dbl>  <dbl>       <dbl>     <dbl>
## 1    0.975   0.96       0.984     0.973

eval_knn

## # A tibble: 1 x 4
##   Accuracy Recall Specificity Precision
##      <dbl>  <dbl>       <dbl>     <dbl>
## 1    0.966  0.933       0.984     0.972

in this case we want predict malignant or benign, that mean do not want to overestimate either benign or malignant. becouse this case we will use Accurasy between the two model. and we can see the Accuracy of logistic regression is higher then KNN then i will use Logistic regression

Conclusion

If I describe myself as a doctor. Where is the treatment I will do to my patient who is benign with a
malignant one is very different, where is the treatment I will do to my patient who is benign with a malignant one is no different. because it will not be detrimental for the benign, at first, although benign or not, it needs an advanced stage for the breast cancer process. I will really look at the existing accuracy metric, where I will put the case in place