Breast Cancer Wisconsin (Diagnostic)

Introduction

Breast cancer is cancer that develops from breast tissue. Worldwide, breast cancer is the most-common invasive cancer in women.It affects 1 in 7 (14%) of women worldwide.

Most types of breast cancer are easy to diagnose by microscopic analysis of a sample - or biopsy - of the affected area of the breast. The two most commonly used screening methods, physical examination of the breasts by a healthcare provider and mammography, can offer an approximate likelihood that a lump is cancer, and may also detect some other lesions, such as a simple cyst. When these examinations are inconclusive, a healthcare provider can remove a sample of the fluid in the lump for microscopic analysis (a procedure known as fine needle aspiration, or fine needle aspiration and cytology, FNAC) to help establish the diagnosis.

In this project, we’ll explore the Breast Cancer Wisconsin (Diagnostic) data set, which contains diagnosis and features computed from a digitized image of FNA, to discover the underlying pattern between FNA feature and diagnosis.

Data Preparation

Load data

Orginal data set comes from kaggle.

## Warning: Missing column names filled in: 'X33' [33]
## Warning: 569 parsing failures.
## row col   expected     actual       file
##   1  -- 33 columns 32 columns 'data.csv'
##   2  -- 33 columns 32 columns 'data.csv'
##   3  -- 33 columns 32 columns 'data.csv'
##   4  -- 33 columns 32 columns 'data.csv'
##   5  -- 33 columns 32 columns 'data.csv'
## ... ... .......... .......... ..........
## See problems(...) for more details.

There are 569 cases of diagnosis in the data set, ten real-valued features are computed for each cell nucleus:

  1. radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (perimeter^2 / area - 1.0)
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension (“coastline approximation” - 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.

Data Preparation

In this procedure, we convert response variable diagnosis to categorical to faciliate our analysis. Besides, we also remove all white spaces in variable names to make them consistent.

There is no missing value or abnormal value in the data set.

breastcancer <- breastcancer %>%
  mutate(diagnosis = as.factor(diagnosis))

colnames(breastcancer)
##  [1] "id"                      "diagnosis"              
##  [3] "radius_mean"             "texture_mean"           
##  [5] "perimeter_mean"          "area_mean"              
##  [7] "smoothness_mean"         "compactness_mean"       
##  [9] "concavity_mean"          "concave points_mean"    
## [11] "symmetry_mean"           "fractal_dimension_mean" 
## [13] "radius_se"               "texture_se"             
## [15] "perimeter_se"            "area_se"                
## [17] "smoothness_se"           "compactness_se"         
## [19] "concavity_se"            "concave points_se"      
## [21] "symmetry_se"             "fractal_dimension_se"   
## [23] "radius_worst"            "texture_worst"          
## [25] "perimeter_worst"         "area_worst"             
## [27] "smoothness_worst"        "compactness_worst"      
## [29] "concavity_worst"         "concave points_worst"   
## [31] "symmetry_worst"          "fractal_dimension_worst"
colnames(breastcancer)[10] = "concave_points_mean" 
colnames(breastcancer)[20] = "concave_points_se" 
colnames(breastcancer)[30] = "concave_points_worst" 

We could also check on the distribution of response variable, 0.6274165 are Benign, 0.3725835 are Malignant.

table(breastcancer$diagnosis)
## 
##   B   M 
## 357 212

Data Descrption

Variable Name Description
id ID number
diagnosis The diagnosis of breast tissues (M = malignant, B = benign)
radius_mean mean of distances from center to points on the perimeter
exture_mean mean of gray-scale values
perimeter_mean mean size of the core tumor
area_mean mean size of area
smoothness_mean mean of local variation in radius lengths
compactness_mean mean of perimeter^2 / area - 1.0
concavity_mean mean of severity of concave portions of the contour
concave_points_mean mean for number of concave portions of the contour
symmetry_mean mean of symmetry
fractal_dimension_mean mean for “coastline approximation” - 1
radius_se standard error for the mean of distances from center to points on the perimeter
texture_se standard error for standard deviation of gray-scale values
perimeter_se standard error for standard deviation of the core tumor
area_se tandard deviation of size of area
smoothness_se standard error for local variation in radius lengths
compactness_se standard error for perimeter^2 / area - 1.0
concavity_se standard error for severity of concave portions of the contour
concave_points_se standard error for number of concave portions of the contour
symmetry_se tandard deviation of symmetry
fractal_dimension_se standard error for “coastline approximation” - 1
radius_worst “worst” or largest mean value for mean of distances from center to points on the perimeter
texture_worst “worst” or largest mean value for standard deviation of gray-scale values
perimeter_worst “worst” or largest mean value for standard deviation of the core tumor
smoothness_worst “worst” or largest mean value for local variation in radius lengths
compactness_worst “worst” or largest mean value for perimeter^2 / area - 1.0
concavity_worst “worst” or largest mean value for severity of concave portions of the contour
concave points_worst “worst” or largest mean value for number of concave portions of the contour
symmetry_worst worst" or largest mean value for number of symmetry
fractal_dimension_worst “worst” or largest mean value for “coastline approximation” - 1

EDA

First, let’s look at the distribution of response.

ggplot(breastcancer, aes(x = diagnosis, fill = diagnosis)) +
  geom_bar() +
  labs(title = 'Number of Benign v.s. Malignant')

Then let’s look at the density plot of each predictor by diagnosis types.

feature_names <- names(breastcancer)[3:32]

breastcancer %>%
  select(c('diagnosis', feature_names)) %>%
  pivot_longer(cols = feature_names) %>%
  ggplot(aes(x = value)) +
  geom_density(aes(fill = diagnosis), alpha = 0.5) +
  facet_wrap(~name, scales = 'free') +
  labs(title = 'Variable Density Plot',
       subtitle = 'by Diagnosis',
       x = '', y = 'density') +
  theme(axis.text.y = element_blank())

From figure above we can see that in many predictors, different diagnosis types have very different distributions. For example, in area_mean, area_se, area_worsr, cancave_point_mean amd cancave_point_worst, Benigh group has significant smaller mean that the Malignat group. Those predictor could be very useful when we build a classigication model.

Let’s also check the corelation between predictors and draw a corrplot.

breastcancer %>%
  select(feature_names) %>%
  cor() %>%
  corrplot::corrplot(method = 'color', 
                     order = 'hclust',  
                     type = 'upper', 
           diag = F, 
           main = 'Correlation between Predictors',
           mar = c(.5,.5,.5,.5))

It’s obvious that there exist strong positive correlation between area_mean, radius_mean, perimeter_mean. ares_worst and perimeter_worst.

Supervised Learning

Next procedure is to build classification model. In this part, we’ll build random forest, support vector machine and ……

Conduct train-test split first for model validation.

set.seed(7047)
idx <- sample(1:nrow(breastcancer), nrow(breastcancer)*0.75)

bcancer <- breastcancer[,-1]

bcancer_train <- bcancer[idx,]
bcancer_test <- bcancer[-idx,]

random-forest

m = sqrt(p) = 6

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
cancer_rf <- randomForest(diagnosis~., data = bcancer_train, mtry = 6)
cancer_rf
## 
## Call:
##  randomForest(formula = diagnosis ~ ., data = bcancer_train, mtry = 6) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 4.69%
## Confusion matrix:
##     B   M class.error
## B 266   9  0.03272727
## M  11 140  0.07284768
# define cost function
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
cost_auc <- function(prob, obs) {
  pred <- prediction(prob, obs)
  perf <- performance(pred, "tpr", "fpr")
  cost <- slot(performance(pred, "auc"), "y.values")[[1]]
  
  return(cost)
}

#in-sample AUC
cancer_rf_prob_train <- predict(cancer_rf, type = "prob")[,2]
cost_auc(cancer_rf_prob_train, bcancer_train$diagnosis)
## [1] 0.989693
#out-of-sample AUC
cancer_rf_prob_test <- predict(cancer_rf, newdata = bcancer_test,type = "prob")[,2]
cost_auc(cancer_rf_prob_test, bcancer_test$diagnosis)
## [1] 0.9880048

SVM

library(e1071)
## Warning: package 'e1071' was built under R version 3.6.3
cancer_svm <- svm(diagnosis ~ ., data = bcancer_train)

# in-sample
pred_svm_train <-  predict(cancer_svm)
table(pred_svm_train, bcancer_train$diagnosis)
##               
## pred_svm_train   B   M
##              B 275   6
##              M   0 145
mean(pred_svm_train != bcancer_train$diagnosis)
## [1] 0.01408451
# out-of-sample 
pred_svm_test  <-  predict(cancer_svm, bcancer_test)
table(pred_svm_test, bcancer_test$diagnosis)
##              
## pred_svm_test  B  M
##             B 82  1
##             M  0 60
mean(pred_svm_test != bcancer_test$diagnosis)
## [1] 0.006993007