Breast Cancer Classification
Introduction
In this project, we will try to predict type of breast cancer for some patience in a hospital, whether the cancer is malignant or benign. We will use classification model such as Naive Bayes, Decision Tree, and Random Forest.
Library
library(dplyr) # for data wrangling
library(caret) #for Upsampling the data train
library(e1071) #for Naive Bayes Model
library(partykit) # for Decision Tree model
library(randomForest) # to check the summary of our random forest modelData Exploration
breast <- read.csv("breast_cancer_diagnose.csv")
glimpse(breast)## Rows: 569
## Columns: 33
## $ id <int> 842302, 842517, 84300903, 84348301, 84358402, ~
## $ diagnosis <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "~
## $ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450~
## $ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9~
## $ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, ~
## $ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, ~
## $ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0~
## $ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0~
## $ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0~
## $ concave.points_mean <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0~
## $ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087~
## $ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0~
## $ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345~
## $ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902~
## $ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18~
## $ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.~
## $ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114~
## $ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246~
## $ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0~
## $ concave.points_se <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188~
## $ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0~
## $ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051~
## $ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8~
## $ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6~
## $ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,~
## $ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, ~
## $ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791~
## $ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249~
## $ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0~
## $ concave.points_worst <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0~
## $ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985~
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0~
## $ X <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
Based on our investigation above, the breast cancer data consists of 569 observations and 33 variables. The description of each feature explained below:
- ID number
- Diagnosis (M = malignant, B = benign)
Ten real-valued features are computed for each cell nucleus:
a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)
The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
We will use variable diagnosis as the class target that we will try to predict.
Data Pre-Processing
We will convert diagnosis variable into factors. We will also eliminate id and X variable since we don’t need it.
breast <- breast %>%
mutate(diagnosis = factor(diagnosis, levels = c("B", "M"), labels = c("benign","malignant"))) %>%
select(-c(id,X))We will try to check whether the data has missing value.
colSums(is.na(breast))## diagnosis radius_mean texture_mean
## 0 0 0
## perimeter_mean area_mean smoothness_mean
## 0 0 0
## compactness_mean concavity_mean concave.points_mean
## 0 0 0
## symmetry_mean fractal_dimension_mean radius_se
## 0 0 0
## texture_se perimeter_se area_se
## 0 0 0
## smoothness_se compactness_se concavity_se
## 0 0 0
## concave.points_se symmetry_se fractal_dimension_se
## 0 0 0
## radius_worst texture_worst perimeter_worst
## 0 0 0
## area_worst smoothness_worst compactness_worst
## 0 0 0
## concavity_worst concave.points_worst symmetry_worst
## 0 0 0
## fractal_dimension_worst
## 0
Cross Validation
Before we build our model, we should split the dataset into training and test data. We will split the data into 75% training and 25% test using sample() function, set.seed(100), and store it as data_train and data_test.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index <- sample(nrow(breast), nrow(breast)*0.75)
data_train <- breast[index,]
data_test <- breast[-index,]Let’s look at the proportion of our target classes in train data using prop.table(table(object$target)) to make sure we have a balanced proportion in train data.
prop.table(table(data_train$diagnosis))##
## benign malignant
## 0.6267606 0.3732394
Based on the proportion above, we can conclude that our target variable can be considered imbalanced; hence we will have to balance the data before using it for our models. One important thing to be kept in mind is that all sub-sampling operations have to be applied only to training dataset. So we will using the upSample() function from the caret package, for the data train.
set.seed(100)
# your code here
data_train_down <-upSample(x = data_train %>% select(-diagnosis),
y = data_train$diagnosis,
yname = "diagnosis")
prop.table(table(data_train_down$diagnosis))##
## benign malignant
## 0.5 0.5
Algorithms
NAIVE BAYES
After splitting our data into train and test set and also upsample our train data, let’s build our first model of Naive Bayes. There are several advantages in using this model, for example:
- The model is relatively fast to train
- It is estimating a probabilistic prediction
- It can handle irrelevant features
We will build a Naive Bayes model using naiveBayes() function from the e1071 package. We will set the laplace parameter as 1.
model_naive <- naiveBayes(x = data_train_down %>% select(-diagnosis),
y = data_train_down$diagnosis,
laplace = 1)Model Prediction
We will try to predict our data test using model_naive and use type = "class" to obtain class prediction.
pred_naive <- predict(object = model_naive,
newdata = data_test,
type="class")Model Evaluation
The last part of model building would be the model evaluation. We can check the model performance for the Naive Bayes model using confusionMatrix() and compare the predicted class (pred_naive) with the actual label in data_test. We will use malignant as the positive class.
There are some evaluating classifiers that can we use to interpret the model:
- Sensitivity/recall = the proportion of positives that are correctly identified as positive.
- Pos Pred Value/Precision = the proportion of correctly identified as positives from all classified as positive.
- Accuracy: the proportion of correctly identified cases from all cases.
- Specificity: the proportion of negatives that are correctly identified as negative.
confusionMatrix(data = pred_naive,
reference = data_test$diagnosis,
positive = "malignant")## Confusion Matrix and Statistics
##
## Reference
## Prediction benign malignant
## benign 85 5
## malignant 5 48
##
## Accuracy : 0.9301
## 95% CI : (0.8752, 0.966)
## No Information Rate : 0.6294
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8501
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9057
## Specificity : 0.9444
## Pos Pred Value : 0.9057
## Neg Pred Value : 0.9444
## Prevalence : 0.3706
## Detection Rate : 0.3357
## Detection Prevalence : 0.3706
## Balanced Accuracy : 0.9251
##
## 'Positive' Class : malignant
##
Observe from the confusion matrix that:
- Out of 53 actual “malignant”, we classified 48 of them correctly.
- Out of 90 actual “benign”, we classified 85 of them correctly.
- Out of 143 cases of heart disease in our data test, we classified 133 of them correctly.
- The accuracy is 93.01%
- The sensitivity or recall is 90.57%
- The pos pred value or precision is 90.57%
DECISION TREE
The next model we’re trying to build is Decision Tree. Use ctree() function to build the model. To tune our model, we will set the parameter mincriterion = 0.99.
set.seed(100)
# your code here
model_dt <-ctree(formula = diagnosis~.,
data = data_train_down,
control = ctree_control(
mincriterion = 0.99
))Now, we will try to visualize the model to have a better understanding.
plot(model_dt, type="simple")plot## function (x, y, ...)
## UseMethod("plot")
## <bytecode: 0x0000000015f4d260>
## <environment: namespace:base>
Based on the visualization, we can interpret it as:
- a patient who has
concave.points_worst> 0.142, withconcavity_se> 0.104, is expected to malignant - a patient who has
concave.points_worst> 0.142, withconcavity_se<= 0.104, andconcave.points_mean> 0.052 is expected to malignant. - a patient who has
concave.points_worst<= 0.142, witharea_worst> 862.1, andtexture_mean> 18.29 is expected to malignant. - a patient who has
concave.points_worst<= 0.142, witharea_worst<= 862.1, andarea_se> 32.96 or <= 32 is expected to benign.
Model Prediction
Now that we have the model, we will predict towards the data test based on model_dt using predict() function and set the parameter type = "response" to obtain class prediction.
# your code here
pred_dt <-predict(object = model_dt,
newdata = data_test,
type="response")Model Evaluation
We will use confusionMatrix() to get our model performance. We will us malignant as the positive class.
# your code here
confusionMatrix(data = pred_dt,
reference = data_test$diagnosis,
positive = "malignant")## Confusion Matrix and Statistics
##
## Reference
## Prediction benign malignant
## benign 86 5
## malignant 4 48
##
## Accuracy : 0.9371
## 95% CI : (0.8839, 0.9708)
## No Information Rate : 0.6294
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8646
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9057
## Specificity : 0.9556
## Pos Pred Value : 0.9231
## Neg Pred Value : 0.9451
## Prevalence : 0.3706
## Detection Rate : 0.3357
## Detection Prevalence : 0.3636
## Balanced Accuracy : 0.9306
##
## 'Positive' Class : malignant
##
Observe from the confusion matrix that:
- Out of 53 actual “malignant”, we classified 48 of them correctly.
- Out of 90 actual “benign”, we classified 86 of them correctly.
- Out of 143 cases of heart disease in our data test, we classified 134 of them correctly.
- The accuracy is 93.71%
- The sensitivity or recall is 90.57%
- The pos pred value or precision is 92.31%
RANDOM FOREST
The last model that we want to build is Random Forest. The following are among the advantages of the random forest model:
- Reduce bias in a model as it aggregates multiple decision trees
- Automatic feature selection
- It generates an unbiased estimate of the out-of-box error
We will built Random Forest model using hyperparameter: - set.seed(417) # the seed number - number = 5 # the number of k-fold cross-validation - repeats = 3# the number of the iteration
At last, we will save the model into our local environment so we don’t have to run the model fitting syntax again.
set.seed(417)
ctrl <- trainControl(method="repeatedcv",
number = 5,
repeats = 3)
breast_rf <- train(diagnosis ~ .,
data = data_train_down,
method = "rf",
trControl = ctrl) ## Warning in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL) :
## non-uniform 'Rounding' sampler used
saveRDS(breast_rf, "breast_rf.RDS")Now check the summary of the final model using model_rf$finalModel.
breast_rf$finalModel##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 2.25%
## Confusion matrix:
## benign malignant class.error
## benign 259 8 0.02996255
## malignant 4 263 0.01498127
Based on the model_rf$finalModel summary above, we have 2.25% error of our unseen data.
Model Prediction
After building the model, we can now predict the data test based on model_rf. We can use predict() function and set the parameter type = "raw" to obtain class prediction.
rf_pred <- predict(breast_rf, data_test, type="raw")Model Evaluation
Next, we will evaluate the random forest model we built using confusionMatrix().
# your code here
confusionMatrix(data = rf_pred,
reference = data_test$diagnosis,
positive = "malignant")## Confusion Matrix and Statistics
##
## Reference
## Prediction benign malignant
## benign 85 4
## malignant 5 49
##
## Accuracy : 0.9371
## 95% CI : (0.8839, 0.9708)
## No Information Rate : 0.6294
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8656
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9245
## Specificity : 0.9444
## Pos Pred Value : 0.9074
## Neg Pred Value : 0.9551
## Prevalence : 0.3706
## Detection Rate : 0.3427
## Detection Prevalence : 0.3776
## Balanced Accuracy : 0.9345
##
## 'Positive' Class : malignant
##
Observe from the confusion matrix that:
- Out of 53 actual “malignant”, we classified 49 of them correctly.
- Out of 90 actual “benign”, we classified 85 of them correctly.
- Out of 143 cases of heart disease in our data test, we classified 134 of them correctly.
- The accuracy is 93.71%
- The sensitivity or recall is 92.45%
- The pos pred value or precision is 90.74%
Conclusion
If we see from the perspective of the doctors. We will try to minimize cases that we failed to predict. For example, there is a patient that has a breast cancer, but we mistaken the malignant cases as benign cases, then the cancer will worsen and the patient can’t we handle correctly. So, We will focus on higher the recall value and lower the precision value. As we can see from all of the Model Evaluation, random forest has bigger sensitivity percentage. So, random forest model will be the best model in this project that can predict the breast cancer.