Breast Cancer Wisconsin (Diagnostic)

Introduction

Breast cancer is cancer that develops from breast tissue. Worldwide, breast cancer is the most-common invasive cancer in women.It affects 1 in 7 (14%) of women worldwide.

Most types of breast cancer are easy to diagnose by microscopic analysis of a sample - or biopsy - of the affected area of the breast. The two most commonly used screening methods, physical examination of the breasts by a healthcare provider and mammography, can offer an approximate likelihood that a lump is cancer, and may also detect some other lesions, such as a simple cyst. When these examinations are inconclusive, a healthcare provider can remove a sample of the fluid in the lump for microscopic analysis (a procedure known as fine needle aspiration, or fine needle aspiration and cytology, FNAC) to help establish the diagnosis.

In this project, we’ll explore the Breast Cancer Wisconsin (Diagnostic) data set, which contains diagnosis and features computed from a digitized image of FNA, to discover the underlying pattern between FNA feature and diagnosis.

Data Preparation

Load data

Orginal data set comes from kaggle.

## Warning: Missing column names filled in: 'X33' [33]

## Warning: 569 parsing failures.
## row col   expected     actual       file
##   1  -- 33 columns 32 columns 'data.csv'
##   2  -- 33 columns 32 columns 'data.csv'
##   3  -- 33 columns 32 columns 'data.csv'
##   4  -- 33 columns 32 columns 'data.csv'
##   5  -- 33 columns 32 columns 'data.csv'
## ... ... .......... .......... ..........
## See problems(...) for more details.

There are 569 cases of diagnosis in the data set, ten real-valued features are computed for each cell nucleus:

radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter^2 / area - 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension (“coastline approximation” - 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.

Data Preparation

In this procedure, we convert response variable diagnosis to categorical to faciliate our analysis. Besides, we also remove all white spaces in variable names to make them consistent.

There is no missing value or abnormal value in the data set.

breastcancer <- breastcancer %>%
  mutate(diagnosis = as.factor(diagnosis))

colnames(breastcancer)

##  [1] "id"                      "diagnosis"              
##  [3] "radius_mean"             "texture_mean"           
##  [5] "perimeter_mean"          "area_mean"              
##  [7] "smoothness_mean"         "compactness_mean"       
##  [9] "concavity_mean"          "concave points_mean"    
## [11] "symmetry_mean"           "fractal_dimension_mean" 
## [13] "radius_se"               "texture_se"             
## [15] "perimeter_se"            "area_se"                
## [17] "smoothness_se"           "compactness_se"         
## [19] "concavity_se"            "concave points_se"      
## [21] "symmetry_se"             "fractal_dimension_se"   
## [23] "radius_worst"            "texture_worst"          
## [25] "perimeter_worst"         "area_worst"             
## [27] "smoothness_worst"        "compactness_worst"      
## [29] "concavity_worst"         "concave points_worst"   
## [31] "symmetry_worst"          "fractal_dimension_worst"

colnames(breastcancer)[10] = "concave_points_mean" 
colnames(breastcancer)[20] = "concave_points_se" 
colnames(breastcancer)[30] = "concave_points_worst"

We could also check on the distribution of response variable, 0.6274165 are Benign, 0.3725835 are Malignant.

table(breastcancer$diagnosis)

## 
##   B   M 
## 357 212

Data Descrption

Variable Name	Description
id	ID number
diagnosis	The diagnosis of breast tissues (M = malignant, B = benign)
radius_mean	mean of distances from center to points on the perimeter
exture_mean	mean of gray-scale values
perimeter_mean	mean size of the core tumor
area_mean	mean size of area
smoothness_mean	mean of local variation in radius lengths
compactness_mean	mean of perimeter^2 / area - 1.0
concavity_mean	mean of severity of concave portions of the contour
concave_points_mean	mean for number of concave portions of the contour
symmetry_mean	mean of symmetry
fractal_dimension_mean	mean for “coastline approximation” - 1
radius_se	standard error for the mean of distances from center to points on the perimeter
texture_se	standard error for standard deviation of gray-scale values
perimeter_se	standard error for standard deviation of the core tumor
area_se	tandard deviation of size of area
smoothness_se	standard error for local variation in radius lengths
compactness_se	standard error for perimeter^2 / area - 1.0
concavity_se	standard error for severity of concave portions of the contour
concave_points_se	standard error for number of concave portions of the contour
symmetry_se	tandard deviation of symmetry
fractal_dimension_se	standard error for “coastline approximation” - 1
radius_worst	“worst” or largest mean value for mean of distances from center to points on the perimeter
texture_worst	“worst” or largest mean value for standard deviation of gray-scale values
perimeter_worst	“worst” or largest mean value for standard deviation of the core tumor
smoothness_worst	“worst” or largest mean value for local variation in radius lengths
compactness_worst	“worst” or largest mean value for perimeter^2 / area - 1.0
concavity_worst	“worst” or largest mean value for severity of concave portions of the contour
concave points_worst	“worst” or largest mean value for number of concave portions of the contour
symmetry_worst	worst" or largest mean value for number of symmetry
fractal_dimension_worst	“worst” or largest mean value for “coastline approximation” - 1

EDA

First, let’s look at the distribution of response.

ggplot(breastcancer, aes(x = diagnosis, fill = diagnosis)) +
  geom_bar() +
  labs(title = 'Number of Benign v.s. Malignant')

Then let’s look at the density plot of each predictor by diagnosis types.

feature_names <- names(breastcancer)[3:32]

breastcancer %>%
  select(c('diagnosis', feature_names)) %>%
  pivot_longer(cols = feature_names) %>%
  ggplot(aes(x = value)) +
  geom_density(aes(fill = diagnosis), alpha = 0.5) +
  facet_wrap(~name, scales = 'free') +
  labs(title = 'Variable Density Plot',
       subtitle = 'by Diagnosis',
       x = '', y = 'density') +
  theme(axis.text.y = element_blank())

From figure above we can see that in many predictors, different diagnosis types have very different distributions. For example, in area_mean, area_se, area_worsr, cancave_point_mean amd cancave_point_worst, Benigh group has significant smaller mean that the Malignat group. Those predictor could be very useful when we build a classigication model.

Let’s also check the corelation between predictors and draw a corrplot.

breastcancer %>%
  select(feature_names) %>%
  cor() %>%
  corrplot::corrplot(method = 'color', 
                     order = 'hclust',  
                     type = 'upper', 
           diag = F, 
           main = 'Correlation between Predictors',
           mar = c(.5,.5,.5,.5))

It’s obvious that there exist strong positive correlation between area_mean, radius_mean, perimeter_mean. ares_worst and perimeter_worst.

Supervised Learning

Next procedure is to build classification model. In this part, we’ll build random forest, support vector machine and ……

Conduct train-test split first for model validation.

set.seed(7047)
idx <- sample(1:nrow(breastcancer), nrow(breastcancer)*0.75)

bcancer <- breastcancer[,-1]

bcancer_train <- bcancer[idx,]
bcancer_test <- bcancer[-idx,]

random-forest

m = sqrt(p) = 6

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

cancer_rf <- randomForest(diagnosis~., data = bcancer_train, mtry = 6)
cancer_rf

## 
## Call:
##  randomForest(formula = diagnosis ~ ., data = bcancer_train, mtry = 6) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 4.69%
## Confusion matrix:
##     B   M class.error
## B 266   9  0.03272727
## M  11 140  0.07284768

# define cost function
library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

cost_auc <- function(prob, obs) {
  pred <- prediction(prob, obs)
  perf <- performance(pred, "tpr", "fpr")
  cost <- slot(performance(pred, "auc"), "y.values")[[1]]
  
  return(cost)
}

#in-sample AUC
cancer_rf_prob_train <- predict(cancer_rf, type = "prob")[,2]
cost_auc(cancer_rf_prob_train, bcancer_train$diagnosis)

## [1] 0.989693

#out-of-sample AUC
cancer_rf_prob_test <- predict(cancer_rf, newdata = bcancer_test,type = "prob")[,2]
cost_auc(cancer_rf_prob_test, bcancer_test$diagnosis)

## [1] 0.9880048

SVM

library(e1071)

## Warning: package 'e1071' was built under R version 3.6.3

cancer_svm <- svm(diagnosis ~ ., data = bcancer_train)

# in-sample
pred_svm_train <-  predict(cancer_svm)
table(pred_svm_train, bcancer_train$diagnosis)

##               
## pred_svm_train   B   M
##              B 275   6
##              M   0 145

mean(pred_svm_train != bcancer_train$diagnosis)

## [1] 0.01408451

# out-of-sample 
pred_svm_test  <-  predict(cancer_svm, bcancer_test)
table(pred_svm_test, bcancer_test$diagnosis)

##              
## pred_svm_test  B  M
##             B 82  1
##             M  0 60

mean(pred_svm_test != bcancer_test$diagnosis)

## [1] 0.006993007

Group–Breast Cancer

Xingchen

4/9/2020

Breast Cancer Wisconsin (Diagnostic)

Introduction

Data Preparation

Load data

Data Preparation

Data Descrption

EDA

Supervised Learning

random-forest

SVM