this report describe prediction of breast cancer diagnosis using Machine Learning Algorithm. The dataset used in this report is Breast Cancer Wisconsin hosted in Kaggle.
The dataset can be downloaded [here.] (https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)
Report Outline
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
The dataset is downloaded from Kaggle and saved in data folder.
bcw_df <- read.csv("data/data.csv")
To see the number of rows and column, we used dim() function. The dataset has 569 rows and 33 columns.
dim(bcw_df)
## [1] 569 33
To find out the column names and types, we used str() function.
str(bcw_df)
## 'data.frame': 569 obs. of 33 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
## $ X : logi NA NA NA NA NA NA ...
From the result above, we know the following:
1. The first column is id. It is unique and unnecessary for prediction. So, it should be removed.
2. The second column is diagnosis. This should be a class variable. Currently the type is char and it should be converted to factor.
3. The last column is X. All the values are NA. So, it should be removed.
# remove unnecessary columns
bcw_df$id <- NULL
bcw_df$X33 <- NULL
# change to factor for a target variable
bcw_df$diagnosis <- as.factor(bcw_df$diagnosis)
analysis of a single variable. Number of benign (B) and malignant (M) in dataset.
library(ggplot2)
ggplot(data=bcw_df, aes(x=diagnosis)) + geom_bar()
Distribution of *radius mean** variable in boxplot.
ggplot(data=bcw_df, aes(y=radius_mean)) +
geom_boxplot() +
labs(title = "Breast Cancer Winconsin Data", y="Radius Mean")
Distribution of *radius mean** variable in histogram.
ggplot(data = bcw_df, aes(x=radius_mean)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p1 <-ggplot(data=bcw_df, aes(x=diagnosis)) + geom_bar()
p2 <- ggplot(data=bcw_df, aes(y=radius_mean)) +
geom_boxplot() +
labs(title = "Breast Cancer Winconsin Data", y="Radius Mean")
p3 <- ggplot(data = bcw_df, aes(x=radius_mean)) + geom_histogram()
library(gridExtra)
grid.arrange(p1, p2, p3, ncol = 3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Analysis of two variables. Distribution radius mean variables based on diagnosis.
ggplot(data=bcw_df, aes(x=diagnosis, y=radius_mean)) +
geom_boxplot() +
geom_jitter(alpha=0.3,
color =" blue",
width = 0.2) +
labs(title="Breast Cancer Wisconsin Data", x="Diagnosis", y="Radius Mean")
ggplot(data=bcw_df, aes(x=radius_mean, fill=diagnosis)) +
geom_density(alpha=.3)
Observation based on radius mean and texture mean variables. Each point is a single observation. The color and shape of the observation are based on diagnosis (benign or malignant)
ggplot(data=bcw_df, aes(x=radius_mean, y=texture_mean,
shape=diagnosis, color=diagnosis)) +
geom_point() +
labs(title="Breast Cancer Wisconsin Data", x="Radius Mean", y="Texture Mean")
In general, benign has lower radius mean and texture mean measurement than malignant However, these two variables are not enough two seperate the classes.
There are three type of measurements : mean, standard error (se), and worst (mean of the three largest values). Each measurement has 10 variables so the total is 30 variables. We want to compute and visualize correlation coefficient of each measurement.
Visualize Pearson’s Correlation Coefficient for *_mean variables.
# install.packages("corrgram")
library(corrgram)
corrgram(bcw_df[2:11], order = TRUE,
upper.panel = panel.pie)
Visualize Pearson’s Correlation Coefficient for *_se variables.
library(corrgram)
corrgram(bcw_df[12:21], order = TRUE,
upper.panel = panel.pie)
Visualize Pearson’s Correlation Coefficient for *_worst variables.
library(corrgram)
corrgram(bcw_df[22:31], order = TRUE,
upper.panel = panel.pie)
From the correlation coefficient, we can see that area, radius, and perimeter are co-linear. So, we need to remove two of them: area and perimeter.
We can also see that compactness, concavity, and concave.points are co-linear.
Remove *_worst variables. Based on discussion with domain expert, the all the variables with ending worst should be removed.
bcw_df2 <- bcw_df[1:21]
Remove area, perimeter, compactness, concavity.
bcw_df2$area_mean <- NULL
bcw_df2$perimeter_mean <- NULL
bcw_df2$compactness_mean <- NULL
bcw_df2$concavity_mean <- NULL
bcw_df2$area_se <- NULL
bcw_df2$perimeter_se <- NULL
bcw_df2$compactness_se <- NULL
bcw_df2$concavity_se <- NULL
dim(bcw_df2)
## [1] 569 13
Use set.seed() for reproducible result. Ratio train test = 70:30.
m = nrow(bcw_df2)
set.seed(2021)
train_idx <- sample(m, 0.7* m)
train_df <- bcw_df2[train_idx, ]
test_df <- bcw_df2[-train_idx, ]
We use 4 machine learning algorithmms.
fit.logit <- glm(diagnosis ~. ,
data = train_df,
family = binomial)
summary(fit.logit)
##
## Call:
## glm(formula = diagnosis ~ ., family = binomial, data = train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.29130 -0.06276 -0.01079 0.00266 2.41194
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -38.2941 10.8434 -3.532 0.000413 ***
## radius_mean 0.9770 0.3246 3.009 0.002618 **
## texture_mean 0.5801 0.1110 5.226 1.73e-07 ***
## smoothness_mean -0.2969 51.6663 -0.006 0.995414
## concave.points_mean 134.7234 38.6187 3.489 0.000486 ***
## symmetry_mean 42.5947 20.3845 2.090 0.036657 *
## fractal_dimension_mean 22.7464 105.3806 0.216 0.829105
## radius_se 5.0455 2.9014 1.739 0.082037 .
## texture_se -1.5634 0.9470 -1.651 0.098737 .
## smoothness_se 72.5956 159.1434 0.456 0.648272
## concave.points_se -160.9316 122.3363 -1.315 0.188347
## symmetry_se -27.4352 60.9517 -0.450 0.652628
## fractal_dimension_se -351.4736 288.8728 -1.217 0.223716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 516.229 on 397 degrees of freedom
## Residual deviance: 75.269 on 385 degrees of freedom
## AIC: 101.27
##
## Number of Fisher Scoring iterations: 9
library(party)
fit.ctree <- ctree(diagnosis~., data=train_df)
plot(fit.ctree, main="Conditional Inference Tree")
library(randomForest)
set.seed(2021)
fit.forest <- randomForest(diagnosis~., data=train_df,
na.action=na.roughfix,
importance=TRUE)
fit.forest
##
## Call:
## randomForest(formula = diagnosis ~ ., data = train_df, importance = TRUE, na.action = na.roughfix)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 5.28%
## Confusion matrix:
## B M class.error
## B 249 9 0.03488372
## M 12 128 0.08571429
library(e1071)
set.seed(2021)
fit.svm <- svm(diagnosis~., data=train_df)
fit.svm
##
## Call:
## svm(formula = diagnosis ~ ., data = train_df)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 119
We compute the accuracy, precision, recall, and F1 Score.
performance <- function(table, n=2){
tn = table[1,1]
fp = table[1,2]
fn = table[2,1]
tp = table[2,2]
sensitivity = tp/(tp+fn) # recall
specificity = tn/(tn+fp)
ppp = tp/(tp+fp) # precision
npp = tn/(tn+fn)
hitrate = (tp+tn)/(tp+tn+fp+fn) # accuracy
result <- paste("Sensitivity = ", round(sensitivity, n) ,
"\nSpecificity = ", round(specificity, n),
"\nPositive Predictive Value = ", round(ppp, n),
"\nNegative Predictive Value = ", round(npp, n),
"\nAccuracy = ", round(hitrate, n), "\n", sep="")
cat(result)
}
prob <- predict(fit.logit,test_df, type="response")
logit.pred <- factor(prob > .5, levels=c(FALSE, TRUE),
labels=c("benign", "malignant"))
logit.perf <- table(test_df$diagnosis, logit.pred,
dnn=c("Actual","Predicted"))
logit.perf
## Predicted
## Actual benign malignant
## B 95 4
## M 7 65
performance(logit.perf)
## Sensitivity = 0.9
## Specificity = 0.96
## Positive Predictive Value = 0.94
## Negative Predictive Value = 0.93
## Accuracy = 0.94
ctree.pred <- predict(fit.ctree, test_df, type = "response")
ctree.perf <- table(test_df$diagnosis, ctree.pred,
dnn=c("Actual","Predicted"))
ctree.perf
## Predicted
## Actual B M
## B 92 7
## M 7 65
forest.pred <- predict(fit.forest, test_df, type = "response")
forest.perf <- table(test_df$diagnosis, forest.pred,
dnn=c("Actual", "Predicted"))
forest.perf
## Predicted
## Actual B M
## B 97 2
## M 7 65
svm.pred <- predict(fit.svm, test_df, type = "response")
svm.perf <- table(test_df$diagnosis, svm.pred,
dnn=c("Actual", "Predicted"))
svm.perf
## Predicted
## Actual B M
## B 96 3
## M 8 64
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.