Breast cancer is cancer that develops from breast tissue. Worldwide, breast cancer is the most-common invasive cancer in women.It affects 1 in 7 (14%) of women worldwide.
Most types of breast cancer are easy to diagnose by microscopic analysis of a sample - or biopsy - of the affected area of the breast. The two most commonly used screening methods, physical examination of the breasts by a healthcare provider and mammography, can offer an approximate likelihood that a lump is cancer, and may also detect some other lesions, such as a simple cyst. When these examinations are inconclusive, a healthcare provider can remove a sample of the fluid in the lump for microscopic analysis (a procedure known as fine needle aspiration, or fine needle aspiration and cytology, FNAC) to help establish the diagnosis.
In this project, we’ll explore the Breast Cancer Wisconsin (Diagnostic) data set, which contains diagnosis and features computed from a digitized image of FNA, to discover the underlying pattern between FNA feature and diagnosis.
Orginal data set comes from kaggle.
## Warning: Missing column names filled in: 'X33' [33]
## Warning: 569 parsing failures.
## row col expected actual file
## 1 -- 33 columns 32 columns 'data.csv'
## 2 -- 33 columns 32 columns 'data.csv'
## 3 -- 33 columns 32 columns 'data.csv'
## 4 -- 33 columns 32 columns 'data.csv'
## 5 -- 33 columns 32 columns 'data.csv'
## ... ... .......... .......... ..........
## See problems(...) for more details.
There are 569 cases of diagnosis in the data set, ten real-valued features are computed for each cell nucleus:
The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features.
In this procedure, we convert response variable diagnosis to categorical to faciliate our analysis. Besides, we also remove all white spaces in variable names to make them consistent.
There is no missing value or abnormal value in the data set.
breastcancer <- breastcancer %>%
mutate(diagnosis = as.factor(diagnosis))
colnames(breastcancer)
## [1] "id" "diagnosis"
## [3] "radius_mean" "texture_mean"
## [5] "perimeter_mean" "area_mean"
## [7] "smoothness_mean" "compactness_mean"
## [9] "concavity_mean" "concave points_mean"
## [11] "symmetry_mean" "fractal_dimension_mean"
## [13] "radius_se" "texture_se"
## [15] "perimeter_se" "area_se"
## [17] "smoothness_se" "compactness_se"
## [19] "concavity_se" "concave points_se"
## [21] "symmetry_se" "fractal_dimension_se"
## [23] "radius_worst" "texture_worst"
## [25] "perimeter_worst" "area_worst"
## [27] "smoothness_worst" "compactness_worst"
## [29] "concavity_worst" "concave points_worst"
## [31] "symmetry_worst" "fractal_dimension_worst"
colnames(breastcancer)[10] = "concave_points_mean"
colnames(breastcancer)[20] = "concave_points_se"
colnames(breastcancer)[30] = "concave_points_worst"
We could also check on the distribution of response variable, 0.6274165 are Benign, 0.3725835 are Malignant.
table(breastcancer$diagnosis)
##
## B M
## 357 212
| Variable Name | Description |
|---|---|
| id | ID number |
| diagnosis | The diagnosis of breast tissues (M = malignant, B = benign) |
| radius_mean | mean of distances from center to points on the perimeter |
| exture_mean | mean of gray-scale values |
| perimeter_mean | mean size of the core tumor |
| area_mean | mean size of area |
| smoothness_mean | mean of local variation in radius lengths |
| compactness_mean | mean of perimeter^2 / area - 1.0 |
| concavity_mean | mean of severity of concave portions of the contour |
| concave_points_mean | mean for number of concave portions of the contour |
| symmetry_mean | mean of symmetry |
| fractal_dimension_mean | mean for “coastline approximation” - 1 |
| radius_se | standard error for the mean of distances from center to points on the perimeter |
| texture_se | standard error for standard deviation of gray-scale values |
| perimeter_se | standard error for standard deviation of the core tumor |
| area_se | tandard deviation of size of area |
| smoothness_se | standard error for local variation in radius lengths |
| compactness_se | standard error for perimeter^2 / area - 1.0 |
| concavity_se | standard error for severity of concave portions of the contour |
| concave_points_se | standard error for number of concave portions of the contour |
| symmetry_se | tandard deviation of symmetry |
| fractal_dimension_se | standard error for “coastline approximation” - 1 |
| radius_worst | “worst” or largest mean value for mean of distances from center to points on the perimeter |
| texture_worst | “worst” or largest mean value for standard deviation of gray-scale values |
| perimeter_worst | “worst” or largest mean value for standard deviation of the core tumor |
| smoothness_worst | “worst” or largest mean value for local variation in radius lengths |
| compactness_worst | “worst” or largest mean value for perimeter^2 / area - 1.0 |
| concavity_worst | “worst” or largest mean value for severity of concave portions of the contour |
| concave points_worst | “worst” or largest mean value for number of concave portions of the contour |
| symmetry_worst | worst" or largest mean value for number of symmetry |
| fractal_dimension_worst | “worst” or largest mean value for “coastline approximation” - 1 |
First, let’s look at the distribution of response.
ggplot(breastcancer, aes(x = diagnosis, fill = diagnosis)) +
geom_bar() +
labs(title = 'Number of Benign v.s. Malignant')
Then let’s look at the density plot of each predictor by diagnosis types.
feature_names <- names(breastcancer)[3:32]
breastcancer %>%
select(c('diagnosis', feature_names)) %>%
pivot_longer(cols = feature_names) %>%
ggplot(aes(x = value)) +
geom_density(aes(fill = diagnosis), alpha = 0.5) +
facet_wrap(~name, scales = 'free') +
labs(title = 'Variable Density Plot',
subtitle = 'by Diagnosis',
x = '', y = 'density') +
theme(axis.text.y = element_blank())
From figure above we can see that in many predictors, different diagnosis types have very different distributions. For example, in area_mean, area_se, area_worsr, cancave_point_mean amd cancave_point_worst, Benigh group has significant smaller mean that the Malignat group. Those predictor could be very useful when we build a classigication model.
Let’s also check the corelation between predictors and draw a corrplot.
breastcancer %>%
select(feature_names) %>%
cor() %>%
corrplot::corrplot(method = 'color',
order = 'hclust',
type = 'upper',
diag = F,
main = 'Correlation between Predictors',
mar = c(.5,.5,.5,.5))
It’s obvious that there exist strong positive correlation between area_mean, radius_mean, perimeter_mean. ares_worst and perimeter_worst.
Next procedure is to build classification model. In this part, we’ll build random forest, support vector machine and ……
Conduct train-test split first for model validation.
set.seed(7047)
idx <- sample(1:nrow(breastcancer), nrow(breastcancer)*0.75)
bcancer <- breastcancer[,-1]
bcancer_train <- bcancer[idx,]
bcancer_test <- bcancer[-idx,]
m = sqrt(p) = 6
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
cancer_rf <- randomForest(diagnosis~., data = bcancer_train, mtry = 6)
cancer_rf
##
## Call:
## randomForest(formula = diagnosis ~ ., data = bcancer_train, mtry = 6)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 4.69%
## Confusion matrix:
## B M class.error
## B 266 9 0.03272727
## M 11 140 0.07284768
# define cost function
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
cost_auc <- function(prob, obs) {
pred <- prediction(prob, obs)
perf <- performance(pred, "tpr", "fpr")
cost <- slot(performance(pred, "auc"), "y.values")[[1]]
return(cost)
}
#in-sample AUC
cancer_rf_prob_train <- predict(cancer_rf, type = "prob")[,2]
cost_auc(cancer_rf_prob_train, bcancer_train$diagnosis)
## [1] 0.989693
#out-of-sample AUC
cancer_rf_prob_test <- predict(cancer_rf, newdata = bcancer_test,type = "prob")[,2]
cost_auc(cancer_rf_prob_test, bcancer_test$diagnosis)
## [1] 0.9880048
library(e1071)
## Warning: package 'e1071' was built under R version 3.6.3
cancer_svm <- svm(diagnosis ~ ., data = bcancer_train)
# in-sample
pred_svm_train <- predict(cancer_svm)
table(pred_svm_train, bcancer_train$diagnosis)
##
## pred_svm_train B M
## B 275 6
## M 0 145
mean(pred_svm_train != bcancer_train$diagnosis)
## [1] 0.01408451
# out-of-sample
pred_svm_test <- predict(cancer_svm, bcancer_test)
table(pred_svm_test, bcancer_test$diagnosis)
##
## pred_svm_test B M
## B 82 1
## M 0 60
mean(pred_svm_test != bcancer_test$diagnosis)
## [1] 0.006993007