Breast cancer is the second most common and also the second leading cause of cancer deaths in women in the United States. According to the American Cancer Society, on average every 1 in 8 women in the United States would develop breast cancer in her lifetime and 2.6% would die from breast cancer. One of the warning symptoms of breast cancer is the development of a tumor in the breast. A tumor, however, could be either benign or malignant.
This project aims to predict whether an individual has breast cancer and determine which cytological attributes are significant in identifying benign and malignant tumors. To achieve this, I performed four different classification models in machine learning, namely Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting Machine, on a dataset obtained from the UCI Machine Learning Repository. This dataset was created by Dr. William H. Wolberg from the University of Wisconsin, who took a digital scan of the fine-needle aspirates from patients with solid breast masses. Then, he used a graphical computer program called Xcyt to calculate ten cytological characteristics present in each digitized image. These features are as follows:
| # | Attribute | Domain |
|---|---|---|
| 1 | Sample Code Number | ID number |
| 2 | Clump Thickness | 1 - 10 |
| 3 | Uniformity of Cell Size | 1 - 10 |
| 4 | Uniformity of Cell Shape | 1 - 10 |
| 5 | Marginal Adhesion | 1 - 10 |
| 6 | Single Epithelial Cell Size | 1 - 10 |
| 7 | Bare Nuclei | 1 - 10 |
| 8 | Bland Chromatin | 1 - 10 |
| 9 | Normal Nucleoli | 1 - 10 |
| 10 | Mitoses | 1 - 10 |
library(tidyverse)
library(dplyr)
library(car)
library(corrplot)
library(pROC)
library(MLmetrics)
library(rpart)
library(rpart.plot)
library(randomForest)
library(varImp)
library(gbm)
library(caret)
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
data <- read.csv(file = url, header = FALSE,
col.names = c("ID","clump_thickness", "uniformity_size", "uniformity_shape", "marginal_adhesion", "single_epithelial_cell_size", "bare_nuclei", "bland_chromatin", "normal_nucleoli","mitoses", "diagnosis"))
str(data)
## 'data.frame': 699 obs. of 11 variables:
## $ ID : int 1000025 1002945 1015425 1016277 1017023 1017122 1018099 1018561 1033078 1033078 ...
## $ clump_thickness : int 5 5 3 6 4 8 1 2 2 4 ...
## $ uniformity_size : int 1 4 1 8 1 10 1 1 1 2 ...
## $ uniformity_shape : int 1 4 1 8 1 10 1 2 1 1 ...
## $ marginal_adhesion : int 1 5 1 1 3 8 1 1 1 1 ...
## $ single_epithelial_cell_size: int 2 7 2 3 2 7 2 2 2 2 ...
## $ bare_nuclei : Factor w/ 11 levels "?","1","10","2",..: 2 3 4 6 2 3 3 2 2 2 ...
## $ bland_chromatin : int 3 3 3 3 3 9 3 3 1 2 ...
## $ normal_nucleoli : int 1 2 1 7 1 7 1 1 1 1 ...
## $ mitoses : int 1 1 1 1 1 1 1 1 5 1 ...
## $ diagnosis : int 2 2 2 2 2 4 2 2 2 2 ...
sum(data$bare_nuclei == "?")
## [1] 16
The dataset includes cytological characteristics of fluid samples from 699 patients. The first column consists of unique identifiers that wouldn’t be helpful for our model, so we’ll first take them out.
data <- select(data, -1)
We’ll also exclude the 16 data points that has missing values in the bare_nuclei column.
data <- data[data$bare_nuclei != "?",] %>% mutate(bare_nuclei = as.integer(as.character((bare_nuclei))))
The dependent variable diagnosis is now denoted as 2 that stands for “benign” and 4 that stands for “malignant”. We’ll convert it into a binary variable of 0 and 1 respectively.
data <- data %>% mutate(diagnosis = ifelse(diagnosis == 2, 0, 1),
diagnosis = as.factor(diagnosis))
summary(data)
## clump_thickness uniformity_size uniformity_shape marginal_adhesion
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.00
## Median : 4.000 Median : 1.000 Median : 1.000 Median : 1.00
## Mean : 4.442 Mean : 3.151 Mean : 3.215 Mean : 2.83
## 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 4.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
## single_epithelial_cell_size bare_nuclei bland_chromatin normal_nucleoli
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.00
## Median : 2.000 Median : 1.000 Median : 3.000 Median : 1.00
## Mean : 3.234 Mean : 3.545 Mean : 3.445 Mean : 2.87
## 3rd Qu.: 4.000 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
## mitoses diagnosis
## Min. : 1.000 0:444
## 1st Qu.: 1.000 1:239
## Median : 1.000
## Mean : 1.603
## 3rd Qu.: 1.000
## Max. :10.000
ggplot(data, aes(x = diagnosis)) +
geom_bar(fill = "#fc9272") +
ggtitle("Distribution of Diagnosis in the Entire Dataset") +
theme_minimal() +
theme(legend.position = "none")
After data cleaning, we now have 683 valid observations, of which 444 has a benign breast tumor and the other 239 has a malignant breast tumor. T
correlation <- cor(data[,-10])
corrplot(correlation, type = "lower", col = c("#fcbba1", "#b2d2e8"), addCoef.col = "black", tl.col = "black")
Looking at the correlation between the independent variables, we can see that uniformity_size and uniformity shape are highly correlated, which might be an indication of multicollinearity.
In order to develop an accurate binary classification model, we first split our dataset randomly into a training and a test set.
set.seed(3011)
train_index <- sample(nrow(data), size = round(0.75 * nrow(data)), replace = FALSE)
train <- data[train_index,]
test <- data[-train_index,]
# Develop and tune the logistic regression model on the training dataset
lm <- glm(formula = diagnosis ~ ., data = train, family = binomial())
summary(lm)
##
## Call:
## glm(formula = diagnosis ~ ., family = binomial(), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.78585 -0.09901 -0.05573 0.01655 2.67351
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.9307 1.4908 -7.332 2.27e-13 ***
## clump_thickness 0.5503 0.1717 3.205 0.00135 **
## uniformity_size -0.3486 0.2811 -1.240 0.21488
## uniformity_shape 0.4196 0.3033 1.384 0.16647
## marginal_adhesion 0.3406 0.1492 2.283 0.02240 *
## single_epithelial_cell_size 0.2727 0.2099 1.299 0.19390
## bare_nuclei 0.3466 0.1086 3.191 0.00142 **
## bland_chromatin 0.5763 0.1978 2.913 0.00357 **
## normal_nucleoli 0.3801 0.1562 2.434 0.01494 *
## mitoses 0.5530 0.3774 1.465 0.14283
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 648.031 on 511 degrees of freedom
## Residual deviance: 68.586 on 502 degrees of freedom
## AIC: 88.586
##
## Number of Fisher Scoring iterations: 8
vif(lm)
## clump_thickness uniformity_size
## 1.225315 4.265839
## uniformity_shape marginal_adhesion
## 4.023195 1.196004
## single_epithelial_cell_size bare_nuclei
## 1.409701 1.148769
## bland_chromatin normal_nucleoli
## 1.326350 1.158661
## mitoses
## 1.116602
As predicted, both uniformity_size and uniformity_shape has a significantly higher VIF than other variables because of their high correlation. However, as a rule of thumb, a VIF lower than 5 is still acceptable, and thus we don’t have to exclude these variables out of the model.
lm2 <- glm(formula = diagnosis ~ ., data = train %>% select(-c(uniformity_size, single_epithelial_cell_size, mitoses)), family = binomial())
summary(lm2)
##
## Call:
## glm(formula = diagnosis ~ ., family = binomial(), data = train %>%
## select(-c(uniformity_size, single_epithelial_cell_size, mitoses)))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.58047 -0.10760 -0.05289 0.01724 2.48016
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.1772 1.3467 -7.557 4.12e-14 ***
## clump_thickness 0.6471 0.1602 4.039 5.36e-05 ***
## uniformity_shape 0.2518 0.1864 1.351 0.17661
## marginal_adhesion 0.3635 0.1410 2.578 0.00993 **
## bare_nuclei 0.3534 0.1086 3.256 0.00113 **
## bland_chromatin 0.5391 0.1877 2.871 0.00409 **
## normal_nucleoli 0.3725 0.1408 2.645 0.00817 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 648.031 on 511 degrees of freedom
## Residual deviance: 74.553 on 505 degrees of freedom
## AIC: 88.553
##
## Number of Fisher Scoring iterations: 8
lm3 <- glm(formula = diagnosis ~ ., data = train %>% select(-c(uniformity_size, single_epithelial_cell_size, bare_nuclei, mitoses)), family = binomial())
summary(lm3)
##
## Call:
## glm(formula = diagnosis ~ ., family = binomial(), data = train %>%
## select(-c(uniformity_size, single_epithelial_cell_size, bare_nuclei,
## mitoses)))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.51441 -0.11798 -0.06096 0.01500 2.88791
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.1932 1.2372 -8.239 < 2e-16 ***
## clump_thickness 0.6364 0.1470 4.329 1.5e-05 ***
## uniformity_shape 0.5146 0.1730 2.975 0.002934 **
## marginal_adhesion 0.4052 0.1283 3.159 0.001582 **
## bland_chromatin 0.6861 0.1765 3.888 0.000101 ***
## normal_nucleoli 0.3906 0.1394 2.802 0.005081 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 648.03 on 511 degrees of freedom
## Residual deviance: 86.48 on 506 degrees of freedom
## AIC: 98.48
##
## Number of Fisher Scoring iterations: 8
pred_train_lm <- predict(lm3, train, type = 'response')
AUC_train_lm <- roc(train$diagnosis, pred_train_lm, percent = TRUE, plot = TRUE, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
hist(pred_train_lm,
main = "Distribution of Predicted Values",
xlab = "Predicted Values",
col = "#6baed6",
border = NULL)
The histogram is positively skewed, which is understandable because in our original dataset, we have almost twice as many benign tumors than malignant lumps. With that being said, if we use the regular threshold of 0.5 to categorize the predicted values into binary levels, we would end up with an unfitted model. Therefore, we now create a loop to find the threshold that maximizes classification performance metrics.
accuracy <- 0
f1 <- 0
threshold <- 0
for(i in seq(0.1, 0.9, by = 0.01)){
pred_cat_train <- ifelse(pred_train_lm < i, 0, 1)
a = Accuracy(y_true = train$diagnosis, y_pred = pred_cat_train)
b = F1_Score(y_true = train$diagnosis, y_pred = pred_cat_train)
if(a > accuracy & b > f1){
accuracy = a
f1 = b
threshold = i
}
}
accuracy
## [1] 0.9726562
f1
## [1] 0.9796512
threshold
## [1] 0.48
Now let’s apply the final logistic regression model to our test dataset as well as calculate and display different performance measures using the optimized threshold of 0.48.
pred_test_lm <- predict(lm3, test, type = 'response')
AUC_test_lm <- roc(test$diagnosis, pred_test_lm, percent = TRUE, plot = TRUE, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
pred_cat_test <- ifelse(pred_test_lm >= 0.48, 1, 0)
Accuracy(y_true = test$diagnosis, y_pred = pred_cat_test)
## [1] 0.9590643
F1_Score(y_true = test$diagnosis, y_pred = pred_cat_test)
## [1] 0.964467
ConfusionMatrix(y_true = test$diagnosis, y_pred = pred_cat_test)
## y_pred
## y_true 0 1
## 0 95 5
## 1 2 69
We will run the decision tree model through a grid search of minsplit (the minimum number of observations in each split) and maxdepth (the maximum depth of the tree) in order to find the optimized combination of hyperparameters.
AUC_train_besttree <- 0
AUC_test_besttree <- 0
AUC_tree <- data.frame(AUC_train_tree = numeric(), AUC_test_tree = numeric())
set.seed(3011)
tree_parameters <- data.frame(minsplit_para = floor(runif(8, 10, 60)),
maxdepth_para = floor(runif(8, 10, 30)))
for(para_comb in 1:nrow(tree_parameters)){
decision_tree <- rpart(diagnosis ~ ., data = train,
control = rpart.control(minsplit = tree_parameters[para_comb, "minsplit_para"],
maxdepth = tree_parameters[para_comb, "maxdepth_para"]))
pred_train_tree <- as.data.frame(predict(decision_tree, train, type='prob'))
AUC_train_tree <- roc(train$diagnosis, pred_train_tree$`1`, percent = TRUE, plot = TRUE)
pred_test_tree <- as.data.frame(predict(decision_tree, test, type='prob'))
AUC_test_tree <- roc(test$diagnosis, pred_test_tree$`1`, percent = TRUE, plot = TRUE)
AUC_tree[para_comb, ] <- c(round(AUC_train_tree$auc, 2), round(AUC_test_tree$auc, 2))
AUC_train_besttree = ifelse(AUC_train_besttree > AUC_train_tree$auc, AUC_train_besttree, AUC_train_tree$auc)
AUC_test_besttree = ifelse(AUC_test_besttree > AUC_test_tree$auc, AUC_test_besttree, AUC_test_tree$auc)
}
| minsplit_para | maxdepth_para | AUC_train_tree | AUC_test_tree |
|---|---|---|---|
| 22 | 12 | 95.52 | 93.97 |
| 14 | 24 | 96.65 | 95.87 |
| 10 | 18 | 97.14 | 96.44 |
| 15 | 14 | 96.65 | 95.87 |
| 29 | 15 | 95.52 | 93.97 |
| 42 | 25 | 95.52 | 93.97 |
| 59 | 28 | 96.51 | 94.51 |
| 11 | 10 | 97.57 | 95.04 |
According to our grid search, the best decision tree has a minsplit of 11 and maxdepth of 10.
best_decision_tree <- rpart(diagnosis ~., data = train,
control = rpart.control(minsplit = 11,
maxdepth = 10))
rpart.plot(x = best_decision_tree, box.palette="RdBu", shadow.col="gray", nn=TRUE, yesno = 2)
Similarly, we will run the random forest model through a grid search to find the optimized combination of hyperparameters. The hyperparameters used are nodesize (the minimum number of observations in the terminal nodes), sampsize (the sample size of each tree), mtry (the number of variables to be considered for each tree), and ntree (the number of decision trees that constitute the forest).
AUC_train_bestrf <- 0
AUC_test_bestrf <- 0
AUC_rf <- data.frame(AUC_train_rf = numeric(), AUC_test_rf = numeric())
set.seed(160)
rf_parameters <- data.frame(nodesize = round(runif(10,5,20)),
sampsize= round(runif(10,1,400)),
mtry = round(runif(10,1,10)),
ntree = round(runif(10,1,400)))
for(paracomb_rf in 1:nrow(rf_parameters)){
random_forest <- randomForest(diagnosis ~ ., data = train,
nodesize = rf_parameters[paracomb_rf, "nodesize"],
sampsize = rf_parameters[paracomb_rf, "sampsize"],
mtry = rf_parameters[paracomb_rf, "mtry"],
ntree = rf_parameters[paracomb_rf, "ntree"])
pred_train_rf <- as.data.frame(predict(random_forest, train, type='prob'))
AUC_train_rf <- roc(train$diagnosis, pred_train_rf$`1`, percent = TRUE, plot = TRUE)
pred_test_rf <- as.data.frame(predict(random_forest, test, type='prob'))
AUC_test_rf <- roc(test$diagnosis, pred_test_rf$`1`, percent = TRUE, plot = TRUE)
AUC_rf[paracomb_rf, ] <- c(round(AUC_train_rf$au, 2), round(AUC_test_rf$auc, 2))
AUC_train_bestrf = ifelse(AUC_train_bestrf > AUC_train_rf$auc, AUC_train_bestrf, AUC_train_rf$auc)
AUC_test_bestrf = ifelse(AUC_test_bestrf > AUC_test_rf$auc, AUC_test_bestrf, AUC_test_rf$auc)
}
| nodesize | sampsize | mtry | ntree | AUC_train_rf | AUC_test_rf |
|---|---|---|---|---|---|
| 16 | 398 | 7 | 205 | 99.66 | 98.66 |
| 15 | 366 | 6 | 358 | 99.66 | 98.43 |
| 8 | 32 | 9 | 339 | 99.36 | 98.75 |
| 13 | 305 | 4 | 8 | 99.05 | 97.52 |
| 17 | 50 | 5 | 188 | 99.48 | 98.81 |
| 9 | 293 | 7 | 189 | 99.80 | 98.77 |
| 15 | 37 | 9 | 172 | 99.42 | 99.09 |
| 19 | 303 | 3 | 222 | 99.56 | 98.58 |
| 9 | 329 | 7 | 210 | 99.83 | 98.87 |
| 11 | 196 | 6 | 118 | 99.60 | 98.68 |
According to the grid search, the best random forest model would have a nodesize of 9, sampsize of 329, mtry of 7, and ntree of 210.
best_random_forest <- randomForest(diagnosis ~ ., data = train,
nodesize = 9,
sampsize = 329,
mtry = 7,
ntree = 210)
best_random_forest
##
## Call:
## randomForest(formula = diagnosis ~ ., data = train, nodesize = 9, sampsize = 329, mtry = 7, ntree = 210)
## Type of random forest: classification
## Number of trees: 210
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 3.12%
## Confusion matrix:
## 0 1 class.error
## 0 334 10 0.02906977
## 1 6 162 0.03571429
# Identify the most significant independent variables
varImpPlot(best_random_forest)
Lastly, we will run the gradient boosting model through a grid search to find the optimized combination of hyperparameters. The hyperparameters used are n.trees (the number of decision trees), shrinkage (learning rate), interaction.depth (the depth of each tree) bag.fraction (the sample size of each tree as a fraction of the dataset), and n.minobsinnode (the minimum number of observations in the terminal nodes).
AUC_train_bestgb <- 0
AUC_test_bestgb <-0
AUC_gb <- data.frame(AUC_train_gb = numeric(), AUC_test_gb = numeric())
set.seed(3011)
gb_parameters <- data.frame(sample_size = round(runif(10,0.5,1), 2),
min_size= round(runif(10,5,20)),
num_tree = round(runif(10,20,200)),
shrink = round(runif(10,0.1,0.5), 2))
for(paracomb_gb in 1:nrow(gb_parameters)){
gradient_boosting <- gbm(diagnosis ~ ., data = train,
distribution = "bernoulli",
n.trees = gb_parameters[paracomb_gb,'num_tree'],
shrinkage = gb_parameters[paracomb_gb,'shrink'],
interaction.depth = 3,
bag.fraction = gb_parameters[paracomb_gb,'sample_size'],
n.minobsinnode = gb_parameters[paracomb_gb,'min_size'],
verbose = TRUE)
pred_train_gb <- predict(gradient_boosting, train, type = "response", n.trees = gb_parameters[paracomb_gb,'num_tree'])
AUC_train_gb <- roc(train$diagnosis, pred_train_gb, percent = TRUE, plot = TRUE)
pred_test_gb <- predict(gradient_boosting, test, type = "response", n.trees = gb_parameters[paracomb_gb,'num_tree'])
AUC_test_gb <- roc(test$diagnosis, pred_test_gb, percent = TRUE, plot = TRUE)
AUC_gb[paracomb_gb, ] <- c(round(AUC_train_gb$auc,2), round(AUC_test_gb$auc,2))
AUC_train_bestgb = ifelse(AUC_train_bestgb > AUC_train_gb$auc, AUC_train_bestgb, AUC_train_gb$auc)
AUC_test_bestgb = ifelse(AUC_test_bestgb > AUC_test_gb$auc, AUC_test_bestgb, AUC_test_gb$auc)
}
| sample_size | min_size | num_tree | shrink | AUC_train_gb | AUC_test_gb |
|---|---|---|---|---|---|
| 0.63 | 11 | 87 | 0.24 | 100 | 98.73 |
| 0.54 | 8 | 154 | 0.17 | 100 | 98.89 |
| 0.51 | 9 | 195 | 0.29 | 100 | 98.66 |
| 0.55 | 17 | 171 | 0.43 | 100 | 98.75 |
| 0.70 | 19 | 103 | 0.47 | 100 | 98.58 |
| 0.83 | 5 | 148 | 0.14 | 100 | 99.08 |
| 0.99 | 17 | 199 | 0.39 | 100 | 98.87 |
| 0.51 | 9 | 150 | 0.36 | 100 | 98.46 |
| 0.55 | 6 | 159 | 0.42 | 100 | 98.94 |
| 0.86 | 17 | 192 | 0.45 | 100 | 98.94 |
According to our grid search, the best gradient boosting model would have be as follows:
best_gradient_boosting <- gbm(diagnosis ~ ., data = train,
distribution = "bernoulli",
n.trees = 159,
shrinkage = 0.42,
interaction.depth = 3,
bag.fraction = 0.55,
n.minobsinnode = 6,
verbose = TRUE)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.7387 nan 0.4200 0.2509
## 2 0.5173 nan 0.4200 0.0913
## 3 0.3859 nan 0.4200 0.0658
## 4 0.2842 nan 0.4200 0.0462
## 5 0.2249 nan 0.4200 0.0230
## 6 0.1973 nan 0.4200 0.0032
## 7 0.1627 nan 0.4200 0.0110
## 8 0.1470 nan 0.4200 0.0016
## 9 0.1394 nan 0.4200 -0.0030
## 10 0.1208 nan 0.4200 0.0009
## 20 0.0558 nan 0.4200 -0.0045
## 40 0.0202 nan 0.4200 -0.0019
## 60 0.0088 nan 0.4200 -0.0002
## 80 0.0047 nan 0.4200 -0.0004
## 100 0.0028 nan 0.4200 -0.0001
## 120 0.0014 nan 0.4200 -0.0000
## 140 0.0006 nan 0.4200 0.0000
## 159 0.0004 nan 0.4200 -0.0000
Now, we will evaluate all four classification models by comparing their area under the curve (AUC) when fitted into the training, test, and the entire datasets.
kable(AUC_values) %>% kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = F)
| Model | AUC_train | AUC_test | AUC_data |
|---|---|---|---|
| Logistic Regression | 99.53 | 99.06 | 99.43 |
| Decision Tree | 97.57 | 96.44 | 96.93 |
| Random Forest | 99.83 | 99.09 | 99.51 |
| Gradient Boosting Machine | 100.00 | 99.08 | 99.75 |
All four models have a high performance, probably due to the extreme differences in cytological features between benign and malignant tumors. In this case, I’d choose the gradient boosting model for their exceptionally high and consistent performance across all datasets. Let’s take a closer look at other performance measures of this prediction model.
pred_test_bestgb <- predict(best_gradient_boosting, test, type = "response", n.trees = 159)
pred_cat_test_gb <- as.factor(ifelse(pred_test_bestgb < 0.5, 0, 1))
confusionMatrix(pred_cat_test_gb, test$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 94 1
## 1 6 70
##
## Accuracy : 0.9591
## 95% CI : (0.9175, 0.9834)
## No Information Rate : 0.5848
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9166
##
## Mcnemar's Test P-Value : 0.1306
##
## Sensitivity : 0.9400
## Specificity : 0.9859
## Pos Pred Value : 0.9895
## Neg Pred Value : 0.9211
## Prevalence : 0.5848
## Detection Rate : 0.5497
## Detection Prevalence : 0.5556
## Balanced Accuracy : 0.9630
##
## 'Positive' Class : 0
##
The gradient boosting model correctly predicted 163 out of 171 diagnoses, giving us an accuracy of 95.32%. This performance measure, however, could be misleading especially when we have an imbalanced dataset like in this case. Fortunately, we have a better balanced accuracy of 95.59%, implying that the classifier actually performs equally well on either classes rather than take advantages of the skewed dataset.
rel_inf_gb <- as.data.frame(summary.gbm(best_gradient_boosting, plotit = FALSE))
rel_inf_gb %>%
arrange(desc(rel.inf)) %>%
top_n(4) %>%
ggplot(aes(x = reorder(var, rel.inf),
y = rel.inf,
fill = rel.inf)) +
geom_col() +
coord_flip() +
xlab('Features') +
ylab('Relative Influence') +
ggtitle("Top 4 Predictors of Breast Cancer") +
theme_minimal()
## Selecting by rel.inf
Based on this gradient boosting model, the top most influential variables are uniformity_size, uniformity_shape, bland_chromatin, and clump_thickness.
In this project, we’ve created four classification machine learning models that can predict if a person has breast cancer based on digitized image readings of patients’ fine-needle aspirates. The best performing model, the gradient boosting, correctly classifies patients with and without breast cancer 95.3% of the times. Its AUC of 99.1% indicates a great ability to distinguish between a benign lump and a malignant tumor. The final model also has a lift metric of 2.2, meaning that it’s more than twice as good as randomly guessing. The top cytological characteristics in identifying breast cancer are the uniformity of cell size, the uniformity of cell shape, the uniformity of nucleus texture, and the thickness of the clump.