Breast Cancer Classification

Introduction

In this project, we will try to predict type of breast cancer for some patience in a hospital, whether the cancer is malignant or benign. We will use classification model such as Naive Bayes, Decision Tree, and Random Forest.

Library

library(dplyr) # for data wrangling
library(caret) #for Upsampling the data train
library(e1071) #for Naive Bayes Model
library(partykit) # for Decision Tree model
library(randomForest) # to check the summary of our random forest model

Data Exploration

breast <- read.csv("breast_cancer_diagnose.csv")
glimpse(breast)

## Rows: 569
## Columns: 33
## $ id                      <int> 842302, 842517, 84300903, 84348301, 84358402, ~
## $ diagnosis               <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "~
## $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12.450~
## $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 19.9~
## $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.57, ~
## $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477.1, ~
## $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030, 0~
## $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280, 0~
## $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800, 0~
## $ concave.points_mean     <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430, 0~
## $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2087~
## $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883, 0~
## $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3345~
## $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8902~
## $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3.18~
## $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, 53.~
## $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0114~
## $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0246~
## $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688, 0~
## $ concave.points_se       <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0188~
## $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756, 0~
## $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0051~
## $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 22.8~
## $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 27.6~
## $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103.40,~
## $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741.6, ~
## $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1791~
## $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5249~
## $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000, 0~
## $ concave.points_worst    <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250, 0~
## $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3985~
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678, 0~
## $ X                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~

Based on our investigation above, the breast cancer data consists of 569 observations and 33 variables. The description of each feature explained below:

ID number
Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)

The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

We will use variable diagnosis as the class target that we will try to predict.

Data Pre-Processing

We will convert diagnosis variable into factors. We will also eliminate id and X variable since we don’t need it.

breast <- breast %>% 
  mutate(diagnosis = factor(diagnosis, levels = c("B", "M"), labels = c("benign","malignant"))) %>% 
  select(-c(id,X))

We will try to check whether the data has missing value.

colSums(is.na(breast))

##               diagnosis             radius_mean            texture_mean 
##                       0                       0                       0 
##          perimeter_mean               area_mean         smoothness_mean 
##                       0                       0                       0 
##        compactness_mean          concavity_mean     concave.points_mean 
##                       0                       0                       0 
##           symmetry_mean  fractal_dimension_mean               radius_se 
##                       0                       0                       0 
##              texture_se            perimeter_se                 area_se 
##                       0                       0                       0 
##           smoothness_se          compactness_se            concavity_se 
##                       0                       0                       0 
##       concave.points_se             symmetry_se    fractal_dimension_se 
##                       0                       0                       0 
##            radius_worst           texture_worst         perimeter_worst 
##                       0                       0                       0 
##              area_worst        smoothness_worst       compactness_worst 
##                       0                       0                       0 
##         concavity_worst    concave.points_worst          symmetry_worst 
##                       0                       0                       0 
## fractal_dimension_worst 
##                       0

Cross Validation

Before we build our model, we should split the dataset into training and test data. We will split the data into 75% training and 25% test using sample() function, set.seed(100), and store it as data_train and data_test.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

index <- sample(nrow(breast), nrow(breast)*0.75)

data_train <- breast[index,]
data_test <- breast[-index,]

Let’s look at the proportion of our target classes in train data using prop.table(table(object$target)) to make sure we have a balanced proportion in train data.

prop.table(table(data_train$diagnosis))

## 
##    benign malignant 
## 0.6267606 0.3732394

Based on the proportion above, we can conclude that our target variable can be considered imbalanced; hence we will have to balance the data before using it for our models. One important thing to be kept in mind is that all sub-sampling operations have to be applied only to training dataset. So we will using the upSample() function from the caret package, for the data train.

set.seed(100)
# your code here
data_train_down <-upSample(x = data_train %>% select(-diagnosis), 
                             y = data_train$diagnosis,
                             yname = "diagnosis")

prop.table(table(data_train_down$diagnosis))

## 
##    benign malignant 
##       0.5       0.5

Algorithms

NAIVE BAYES

After splitting our data into train and test set and also upsample our train data, let’s build our first model of Naive Bayes. There are several advantages in using this model, for example:

The model is relatively fast to train
It is estimating a probabilistic prediction
It can handle irrelevant features

We will build a Naive Bayes model using naiveBayes() function from the e1071 package. We will set the laplace parameter as 1.

model_naive <- naiveBayes(x = data_train_down %>% select(-diagnosis),
                          y = data_train_down$diagnosis,
                          laplace = 1)

Model Prediction

We will try to predict our data test using model_naive and use type = "class" to obtain class prediction.

pred_naive <- predict(object = model_naive, 
                      newdata = data_test, 
                      type="class")

Model Evaluation

The last part of model building would be the model evaluation. We can check the model performance for the Naive Bayes model using confusionMatrix() and compare the predicted class (pred_naive) with the actual label in data_test. We will use malignant as the positive class.

There are some evaluating classifiers that can we use to interpret the model:

Sensitivity/recall = the proportion of positives that are correctly identified as positive.
Pos Pred Value/Precision = the proportion of correctly identified as positives from all classified as positive.
Accuracy: the proportion of correctly identified cases from all cases.
Specificity: the proportion of negatives that are correctly identified as negative.

confusionMatrix(data = pred_naive, 
                reference = data_test$diagnosis,
                positive = "malignant")

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign        85         5
##   malignant      5        48
##                                          
##                Accuracy : 0.9301         
##                  95% CI : (0.8752, 0.966)
##     No Information Rate : 0.6294         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8501         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9057         
##             Specificity : 0.9444         
##          Pos Pred Value : 0.9057         
##          Neg Pred Value : 0.9444         
##              Prevalence : 0.3706         
##          Detection Rate : 0.3357         
##    Detection Prevalence : 0.3706         
##       Balanced Accuracy : 0.9251         
##                                          
##        'Positive' Class : malignant      
##

Observe from the confusion matrix that:

Out of 53 actual “malignant”, we classified 48 of them correctly.
Out of 90 actual “benign”, we classified 85 of them correctly.
Out of 143 cases of heart disease in our data test, we classified 133 of them correctly.
The accuracy is 93.01%
The sensitivity or recall is 90.57%
The pos pred value or precision is 90.57%

DECISION TREE

The next model we’re trying to build is Decision Tree. Use ctree() function to build the model. To tune our model, we will set the parameter mincriterion = 0.99.

set.seed(100)
# your code here
model_dt <-ctree(formula = diagnosis~.,
                 data = data_train_down,
                 control = ctree_control(
                   mincriterion = 0.99
                 ))

Now, we will try to visualize the model to have a better understanding.

plot(model_dt, type="simple")

plot

## function (x, y, ...) 
## UseMethod("plot")
## <bytecode: 0x0000000015f4d260>
## <environment: namespace:base>

Based on the visualization, we can interpret it as:

a patient who has concave.points_worst > 0.142, with concavity_se > 0.104, is expected to malignant
a patient who has concave.points_worst > 0.142, with concavity_se <= 0.104, and concave.points_mean > 0.052 is expected to malignant.
a patient who has concave.points_worst <= 0.142, with area_worst > 862.1, and texture_mean > 18.29 is expected to malignant.
a patient who has concave.points_worst <= 0.142, with area_worst <= 862.1, and area_se > 32.96 or <= 32 is expected to benign.

Model Prediction

Now that we have the model, we will predict towards the data test based on model_dt using predict() function and set the parameter type = "response" to obtain class prediction.

# your code here
pred_dt <-predict(object = model_dt, 
                  newdata = data_test,
                  type="response")

Model Evaluation

We will use confusionMatrix() to get our model performance. We will us malignant as the positive class.

# your code here
confusionMatrix(data = pred_dt,
                reference = data_test$diagnosis,
                positive = "malignant")

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign        86         5
##   malignant      4        48
##                                           
##                Accuracy : 0.9371          
##                  95% CI : (0.8839, 0.9708)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8646          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9057          
##             Specificity : 0.9556          
##          Pos Pred Value : 0.9231          
##          Neg Pred Value : 0.9451          
##              Prevalence : 0.3706          
##          Detection Rate : 0.3357          
##    Detection Prevalence : 0.3636          
##       Balanced Accuracy : 0.9306          
##                                           
##        'Positive' Class : malignant       
##

Observe from the confusion matrix that:

Out of 53 actual “malignant”, we classified 48 of them correctly.
Out of 90 actual “benign”, we classified 86 of them correctly.
Out of 143 cases of heart disease in our data test, we classified 134 of them correctly.
The accuracy is 93.71%
The sensitivity or recall is 90.57%
The pos pred value or precision is 92.31%

RANDOM FOREST

The last model that we want to build is Random Forest. The following are among the advantages of the random forest model:

Reduce bias in a model as it aggregates multiple decision trees
Automatic feature selection
It generates an unbiased estimate of the out-of-box error

We will built Random Forest model using hyperparameter: - set.seed(417) # the seed number - number = 5 # the number of k-fold cross-validation - repeats = 3# the number of the iteration

At last, we will save the model into our local environment so we don’t have to run the model fitting syntax again.

set.seed(417)

ctrl <- trainControl(method="repeatedcv", 
                     number = 5, 
                     repeats = 3) 

breast_rf <- train(diagnosis ~ .,  
                   data = data_train_down, 
                   method = "rf", 
                   trControl = ctrl)

## Warning in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL) :
## non-uniform 'Rounding' sampler used

saveRDS(breast_rf, "breast_rf.RDS")

Now check the summary of the final model using model_rf$finalModel.

breast_rf$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 2.25%
## Confusion matrix:
##           benign malignant class.error
## benign       259         8  0.02996255
## malignant      4       263  0.01498127

Based on the model_rf$finalModel summary above, we have 2.25% error of our unseen data.

Model Prediction

After building the model, we can now predict the data test based on model_rf. We can use predict() function and set the parameter type = "raw" to obtain class prediction.

rf_pred <- predict(breast_rf, data_test, type="raw")

Model Evaluation

Next, we will evaluate the random forest model we built using confusionMatrix().

# your code here
confusionMatrix(data = rf_pred,
                reference = data_test$diagnosis,
                positive = "malignant")

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  benign malignant
##   benign        85         4
##   malignant      5        49
##                                           
##                Accuracy : 0.9371          
##                  95% CI : (0.8839, 0.9708)
##     No Information Rate : 0.6294          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8656          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9245          
##             Specificity : 0.9444          
##          Pos Pred Value : 0.9074          
##          Neg Pred Value : 0.9551          
##              Prevalence : 0.3706          
##          Detection Rate : 0.3427          
##    Detection Prevalence : 0.3776          
##       Balanced Accuracy : 0.9345          
##                                           
##        'Positive' Class : malignant       
##

Observe from the confusion matrix that:

Out of 53 actual “malignant”, we classified 49 of them correctly.
Out of 90 actual “benign”, we classified 85 of them correctly.
Out of 143 cases of heart disease in our data test, we classified 134 of them correctly.
The accuracy is 93.71%
The sensitivity or recall is 92.45%
The pos pred value or precision is 90.74%

Conclusion

If we see from the perspective of the doctors. We will try to minimize cases that we failed to predict. For example, there is a patient that has a breast cancer, but we mistaken the malignant cases as benign cases, then the cancer will worsen and the patient can’t we handle correctly. So, We will focus on higher the recall value and lower the precision value. As we can see from all of the Model Evaluation, random forest has bigger sensitivity percentage. So, random forest model will be the best model in this project that can predict the breast cancer.