Breast Cancer
Introduction
Wellcome back to my rpubs and now we will learn a breast cancer with machine learning. to day we will make classification model for this data to learn how the disease occurs
About Breast Cancer
Breast cancer is a disease in which cells in the breast grow out of control. There are different kinds of breast cancer. The kind of breast cancer depends on which cells in the breast turn into cancer.
Breast cancer can begin in different parts of the breast. A breast is made up of three main parts: lobules, ducts, and connective tissue. The lobules are the glands that produce milk. The ducts are tubes that carry milk to the nipple. The connective tissue (which consists of fibrous and fatty tissue) surrounds and holds everything together. Most breast cancers begin in the ducts or lobules.
Source : https://www.cdc.gov/cancer/breast/basic_info/what-is-breast-cancer.htm
Prepare Data
Import Dataset
## X1000025 X5 X1 X1.1 X1.2 X2 X1.3 X3 X1.4 X1.5 X2.1
## 1 1002945 5 4 4 5 7 10 3 2 1 2
## 2 1015425 3 1 1 1 2 2 3 1 1 2
## 3 1016277 6 8 8 1 3 4 3 7 1 2
## 4 1017023 4 1 1 3 2 1 3 1 1 2
## 5 1017122 8 10 10 8 7 10 9 7 1 4
## 6 1018099 1 1 1 1 2 10 3 1 1 2
we can see the name of the column haven’t a corect name, and we need rename this column and then we need see the information the data on this Site, theres so many information about this data
## Rows: 698
## Columns: 11
## $ Sample_code_number <int> 1002945, 1015425, 1016277, 1017023, 101...
## $ Clump_Thickness <int> 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1, ...
## $ Uniformity_of_Cell_Size <int> 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, 1,...
## $ Uniformity_of_Cell_Shape <int> 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, 1,...
## $ Marginal_Adhesion <int> 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1, ...
## $ Single_Epithelial_Cell_Size <int> 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2, ...
## $ Bare_Nuclei <chr> "10", "2", "4", "1", "10", "10", "1", "...
## $ Bland_Chromatin <int> 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3, ...
## $ Normal_Nucleoli <int> 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1, ...
## $ Mitoses <int> 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, ...
## $ Class <int> 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4, 2, ...
Cleaning Dataset
library(dplyr)
library(tidyr)
dataset = dataset %>%
mutate(Class = as.factor(Class),
Bare_Nuclei = suppressWarnings(as.numeric(Bare_Nuclei))) %>%
drop_na() %>%
select(- Sample_code_number)
colSums(is.na(dataset))## Clump_Thickness Uniformity_of_Cell_Size
## 0 0
## Uniformity_of_Cell_Shape Marginal_Adhesion
## 0 0
## Single_Epithelial_Cell_Size Bare_Nuclei
## 0 0
## Bland_Chromatin Normal_Nucleoli
## 0 0
## Mitoses Class
## 0 0
check proportion class data
before we spliit we need see our propotion class in data whether it’s balanced or not. becouse if not balance we need to balancing data in next step.
##
## 2 4
## 0.6495601 0.3504399
the proportion is quite good, maybe you are confused when you see that number. dont worry i will explain the number 2 meaning the cancer is benign and number 4 meaning the cancer is malignant and the proportion is 65% to benign and 35% to malignant
Our goal
from this data we will predicting someone who has breast cancer is benign or malignant, and we will make two model of classification and that is logistik regression and KNN. why i just make two model?? thats becouse from our data we just have numeric or integer for predictor wich is thats model have a good with this predictor. then lets start
Logistic Regression
create a logistic regression
##
## Call:
## glm(formula = Class ~ ., family = "binomial", data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5989 -0.1241 -0.0624 0.0163 2.3436
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.04176 1.37542 -7.301 2.86e-13 ***
## Clump_Thickness 0.54341 0.16450 3.303 0.000955 ***
## Uniformity_of_Cell_Size -0.04772 0.31313 -0.152 0.878878
## Uniformity_of_Cell_Shape 0.19168 0.33922 0.565 0.572041
## Marginal_Adhesion 0.30149 0.14112 2.136 0.032654 *
## Single_Epithelial_Cell_Size 0.09594 0.18477 0.519 0.603593
## Bare_Nuclei 0.44786 0.11485 3.900 9.63e-05 ***
## Bland_Chromatin 0.36847 0.19614 1.879 0.060296 .
## Normal_Nucleoli 0.30874 0.15303 2.018 0.043639 *
## Mitoses 0.75888 0.34638 2.191 0.028460 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 614.773 on 477 degrees of freedom
## Residual deviance: 73.615 on 468 degrees of freedom
## AIC: 93.615
##
## Number of Fisher Scoring iterations: 8
Predict a data test with logistic regression model
Confusion Matrix Logistic Regression
suppressMessages(library(caret))
logistic_confus = confusionMatrix(data = log.Label , reference = test_set$Class , positive = '4')
logistic_confus## Confusion Matrix and Statistics
##
## Reference
## Prediction 2 4
## 2 127 3
## 4 2 72
##
## Accuracy : 0.9755
## 95% CI : (0.9437, 0.992)
## No Information Rate : 0.6324
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9471
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9600
## Specificity : 0.9845
## Pos Pred Value : 0.9730
## Neg Pred Value : 0.9769
## Prevalence : 0.3676
## Detection Rate : 0.3529
## Detection Prevalence : 0.3627
## Balanced Accuracy : 0.9722
##
## 'Positive' Class : 4
##
K-NN
Split data as Predictor and target
if data have a dependent range we will scale data with normalization and if not we scale with standardization and in this case we know a data have dependent range 1 to 10 of course a target not include and we will use normalization
create a K-NN
Confusion mattrix KNN
knn_confus = confusionMatrix(data = knn_pred , reference = data_test_y$Class, positive = "4")
knn_confus## Confusion Matrix and Statistics
##
## Reference
## Prediction 2 4
## 2 127 5
## 4 2 70
##
## Accuracy : 0.9657
## 95% CI : (0.9306, 0.9861)
## No Information Rate : 0.6324
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9256
##
## Mcnemar's Test P-Value : 0.4497
##
## Sensitivity : 0.9333
## Specificity : 0.9845
## Pos Pred Value : 0.9722
## Neg Pred Value : 0.9621
## Prevalence : 0.3676
## Detection Rate : 0.3431
## Detection Prevalence : 0.3529
## Balanced Accuracy : 0.9589
##
## 'Positive' Class : 4
##
Ratio of Logistic Regression and K-NN
if we compare a model we must see score in confusion matrix between model, and in confusion matrix to many
information we have. but dont worry we just have to know 4 point in confusion matris, that is :
Re-call/Sensitivity = of all the positive actual data, how capable is the proportion of my model to guess right. Specificity = of all the negative actual data, how capable is the proportion of my model to guess the right one. Precision/Pos Pred Value = of all the predicted results, how capable is my model to correctly guess the positive class. Accuracy = how able is my model to correctly guess the target Y.
eval_logit <- data_frame(Accuracy = logistic_confus$overall[1],
Recall = logistic_confus$byClass[1],
Specificity = logistic_confus$byClass[2],
Precision = logistic_confus$byClass[3])## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
eval_knn <- data_frame(Accuracy = knn_confus$overall[1],
Recall = knn_confus$byClass[1],
Specificity = knn_confus$byClass[2],
Precision = knn_confus$byClass[3])## # A tibble: 1 x 4
## Accuracy Recall Specificity Precision
## <dbl> <dbl> <dbl> <dbl>
## 1 0.975 0.96 0.984 0.973
## # A tibble: 1 x 4
## Accuracy Recall Specificity Precision
## <dbl> <dbl> <dbl> <dbl>
## 1 0.966 0.933 0.984 0.972
in this case we want predict malignant or benign, that mean do not want to overestimate either benign or malignant. becouse this case we will use Accurasy between the two model. and we can see the Accuracy of logistic regression is higher then KNN then i will use Logistic regression
Conclusion
If I describe myself as a doctor. Where is the treatment I will do to my patient who is benign with a
malignant one is very different, where is the treatment I will do to my patient who is benign with a malignant one is no different. because it will not be detrimental for the benign, at first, although benign or not, it needs an advanced stage for the breast cancer process. I will really look at the existing accuracy metric, where I will put the case in place