Predicting Breast Cancer Using Machine Learning Algorithms

Jasmine Pham

July 1, 2020

LinkedIn
Email
Website


Introduction

Problem Identification

Breast cancer is the second most common and also the second leading cause of cancer deaths in women in the United States. According to the American Cancer Society, on average every 1 in 8 women in the United States would develop breast cancer in her lifetime and 2.6% would die from breast cancer. One of the warning symptoms of breast cancer is the development of a tumor in the breast. A tumor, however, could be either benign or malignant.

Objective

This project aims to predict whether an individual has breast cancer and determine which cytological attributes are significant in identifying benign and malignant tumors. To achieve this, I performed four different classification models in machine learning, namely Logistic Regression, Decision Tree, Random Forest, and Gradient Boosting Machine, on a dataset obtained from the UCI Machine Learning Repository. This dataset was created by Dr. William H. Wolberg from the University of Wisconsin, who took a digital scan of the fine-needle aspirates from patients with solid breast masses. Then, he used a graphical computer program called Xcyt to calculate ten cytological characteristics present in each digitized image. These features are as follows:

# Attribute Domain
1 Sample Code Number ID number
2 Clump Thickness 1 - 10
3 Uniformity of Cell Size 1 - 10
4 Uniformity of Cell Shape 1 - 10
5 Marginal Adhesion 1 - 10
6 Single Epithelial Cell Size 1 - 10
7 Bare Nuclei 1 - 10
8 Bland Chromatin 1 - 10
9 Normal Nucleoli 1 - 10
10 Mitoses 1 - 10

Data Exploration

library(tidyverse)
library(dplyr)
library(car)
library(corrplot)
library(pROC)
library(MLmetrics)
library(rpart)
library(rpart.plot) 
library(randomForest)
library(varImp)
library(gbm)
library(caret)
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
data <- read.csv(file = url, header = FALSE,
                 col.names = c("ID","clump_thickness", "uniformity_size", "uniformity_shape", "marginal_adhesion", "single_epithelial_cell_size", "bare_nuclei", "bland_chromatin", "normal_nucleoli","mitoses", "diagnosis"))

str(data)
## 'data.frame':    699 obs. of  11 variables:
##  $ ID                         : int  1000025 1002945 1015425 1016277 1017023 1017122 1018099 1018561 1033078 1033078 ...
##  $ clump_thickness            : int  5 5 3 6 4 8 1 2 2 4 ...
##  $ uniformity_size            : int  1 4 1 8 1 10 1 1 1 2 ...
##  $ uniformity_shape           : int  1 4 1 8 1 10 1 2 1 1 ...
##  $ marginal_adhesion          : int  1 5 1 1 3 8 1 1 1 1 ...
##  $ single_epithelial_cell_size: int  2 7 2 3 2 7 2 2 2 2 ...
##  $ bare_nuclei                : Factor w/ 11 levels "?","1","10","2",..: 2 3 4 6 2 3 3 2 2 2 ...
##  $ bland_chromatin            : int  3 3 3 3 3 9 3 3 1 2 ...
##  $ normal_nucleoli            : int  1 2 1 7 1 7 1 1 1 1 ...
##  $ mitoses                    : int  1 1 1 1 1 1 1 1 5 1 ...
##  $ diagnosis                  : int  2 2 2 2 2 4 2 2 2 2 ...
sum(data$bare_nuclei == "?")
## [1] 16

The dataset includes cytological characteristics of fluid samples from 699 patients. The first column consists of unique identifiers that wouldn’t be helpful for our model, so we’ll first take them out.

data <- select(data, -1)

We’ll also exclude the 16 data points that has missing values in the bare_nuclei column.

data <- data[data$bare_nuclei != "?",] %>% mutate(bare_nuclei = as.integer(as.character((bare_nuclei))))

The dependent variable diagnosis is now denoted as 2 that stands for “benign” and 4 that stands for “malignant”. We’ll convert it into a binary variable of 0 and 1 respectively.

data <- data %>% mutate(diagnosis = ifelse(diagnosis == 2, 0, 1),
                        diagnosis = as.factor(diagnosis))
summary(data)
##  clump_thickness  uniformity_size  uniformity_shape marginal_adhesion
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00    
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.00    
##  Median : 4.000   Median : 1.000   Median : 1.000   Median : 1.00    
##  Mean   : 4.442   Mean   : 3.151   Mean   : 3.215   Mean   : 2.83    
##  3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 5.000   3rd Qu.: 4.00    
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00    
##  single_epithelial_cell_size  bare_nuclei     bland_chromatin  normal_nucleoli
##  Min.   : 1.000              Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000              1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.00  
##  Median : 2.000              Median : 1.000   Median : 3.000   Median : 1.00  
##  Mean   : 3.234              Mean   : 3.545   Mean   : 3.445   Mean   : 2.87  
##  3rd Qu.: 4.000              3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000              Max.   :10.000   Max.   :10.000   Max.   :10.00  
##     mitoses       diagnosis
##  Min.   : 1.000   0:444    
##  1st Qu.: 1.000   1:239    
##  Median : 1.000            
##  Mean   : 1.603            
##  3rd Qu.: 1.000            
##  Max.   :10.000
ggplot(data, aes(x = diagnosis)) +
  geom_bar(fill = "#fc9272") +
  ggtitle("Distribution of Diagnosis in the Entire Dataset") +
  theme_minimal() +
  theme(legend.position = "none")

After data cleaning, we now have 683 valid observations, of which 444 has a benign breast tumor and the other 239 has a malignant breast tumor. T

correlation <- cor(data[,-10])
corrplot(correlation, type = "lower", col = c("#fcbba1", "#b2d2e8"), addCoef.col = "black", tl.col = "black")  

Looking at the correlation between the independent variables, we can see that uniformity_size and uniformity shape are highly correlated, which might be an indication of multicollinearity.

Classification Models in Machine Learning

In order to develop an accurate binary classification model, we first split our dataset randomly into a training and a test set.

set.seed(3011) 
train_index <- sample(nrow(data), size = round(0.75 * nrow(data)), replace = FALSE)
train <- data[train_index,]
test <- data[-train_index,]

Logistic Regression

# Develop and tune the logistic regression model on the training dataset
lm <- glm(formula = diagnosis ~ ., data = train, family = binomial())
summary(lm)
## 
## Call:
## glm(formula = diagnosis ~ ., family = binomial(), data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.78585  -0.09901  -0.05573   0.01655   2.67351  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -10.9307     1.4908  -7.332 2.27e-13 ***
## clump_thickness               0.5503     0.1717   3.205  0.00135 ** 
## uniformity_size              -0.3486     0.2811  -1.240  0.21488    
## uniformity_shape              0.4196     0.3033   1.384  0.16647    
## marginal_adhesion             0.3406     0.1492   2.283  0.02240 *  
## single_epithelial_cell_size   0.2727     0.2099   1.299  0.19390    
## bare_nuclei                   0.3466     0.1086   3.191  0.00142 ** 
## bland_chromatin               0.5763     0.1978   2.913  0.00357 ** 
## normal_nucleoli               0.3801     0.1562   2.434  0.01494 *  
## mitoses                       0.5530     0.3774   1.465  0.14283    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 648.031  on 511  degrees of freedom
## Residual deviance:  68.586  on 502  degrees of freedom
## AIC: 88.586
## 
## Number of Fisher Scoring iterations: 8
vif(lm)
##             clump_thickness             uniformity_size 
##                    1.225315                    4.265839 
##            uniformity_shape           marginal_adhesion 
##                    4.023195                    1.196004 
## single_epithelial_cell_size                 bare_nuclei 
##                    1.409701                    1.148769 
##             bland_chromatin             normal_nucleoli 
##                    1.326350                    1.158661 
##                     mitoses 
##                    1.116602

As predicted, both uniformity_size and uniformity_shape has a significantly higher VIF than other variables because of their high correlation. However, as a rule of thumb, a VIF lower than 5 is still acceptable, and thus we don’t have to exclude these variables out of the model.

lm2 <- glm(formula = diagnosis ~ ., data = train %>% select(-c(uniformity_size, single_epithelial_cell_size, mitoses)), family = binomial())
summary(lm2)
## 
## Call:
## glm(formula = diagnosis ~ ., family = binomial(), data = train %>% 
##     select(-c(uniformity_size, single_epithelial_cell_size, mitoses)))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.58047  -0.10760  -0.05289   0.01724   2.48016  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -10.1772     1.3467  -7.557 4.12e-14 ***
## clump_thickness     0.6471     0.1602   4.039 5.36e-05 ***
## uniformity_shape    0.2518     0.1864   1.351  0.17661    
## marginal_adhesion   0.3635     0.1410   2.578  0.00993 ** 
## bare_nuclei         0.3534     0.1086   3.256  0.00113 ** 
## bland_chromatin     0.5391     0.1877   2.871  0.00409 ** 
## normal_nucleoli     0.3725     0.1408   2.645  0.00817 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 648.031  on 511  degrees of freedom
## Residual deviance:  74.553  on 505  degrees of freedom
## AIC: 88.553
## 
## Number of Fisher Scoring iterations: 8
lm3 <- glm(formula = diagnosis ~ ., data = train %>% select(-c(uniformity_size, single_epithelial_cell_size, bare_nuclei, mitoses)), family = binomial())
summary(lm3)
## 
## Call:
## glm(formula = diagnosis ~ ., family = binomial(), data = train %>% 
##     select(-c(uniformity_size, single_epithelial_cell_size, bare_nuclei, 
##         mitoses)))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.51441  -0.11798  -0.06096   0.01500   2.88791  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -10.1932     1.2372  -8.239  < 2e-16 ***
## clump_thickness     0.6364     0.1470   4.329  1.5e-05 ***
## uniformity_shape    0.5146     0.1730   2.975 0.002934 ** 
## marginal_adhesion   0.4052     0.1283   3.159 0.001582 ** 
## bland_chromatin     0.6861     0.1765   3.888 0.000101 ***
## normal_nucleoli     0.3906     0.1394   2.802 0.005081 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 648.03  on 511  degrees of freedom
## Residual deviance:  86.48  on 506  degrees of freedom
## AIC: 98.48
## 
## Number of Fisher Scoring iterations: 8
pred_train_lm <- predict(lm3, train, type = 'response')
AUC_train_lm <- roc(train$diagnosis, pred_train_lm, percent = TRUE, plot = TRUE, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

hist(pred_train_lm, 
     main = "Distribution of Predicted Values",
     xlab = "Predicted Values", 
     col = "#6baed6",
     border = NULL)

The histogram is positively skewed, which is understandable because in our original dataset, we have almost twice as many benign tumors than malignant lumps. With that being said, if we use the regular threshold of 0.5 to categorize the predicted values into binary levels, we would end up with an unfitted model. Therefore, we now create a loop to find the threshold that maximizes classification performance metrics.

accuracy <- 0
f1 <- 0
threshold <- 0

for(i in seq(0.1, 0.9, by = 0.01)){
  pred_cat_train <- ifelse(pred_train_lm < i, 0, 1)
  a = Accuracy(y_true = train$diagnosis, y_pred = pred_cat_train)
  b = F1_Score(y_true = train$diagnosis, y_pred = pred_cat_train)
  
  if(a > accuracy & b > f1){
    accuracy = a
    f1 = b
    threshold = i
  }
}
accuracy
## [1] 0.9726562
f1 
## [1] 0.9796512
threshold 
## [1] 0.48

Now let’s apply the final logistic regression model to our test dataset as well as calculate and display different performance measures using the optimized threshold of 0.48.

pred_test_lm <- predict(lm3, test, type = 'response')

AUC_test_lm <- roc(test$diagnosis, pred_test_lm, percent = TRUE, plot = TRUE, print.auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

pred_cat_test <- ifelse(pred_test_lm >= 0.48, 1, 0)
Accuracy(y_true = test$diagnosis, y_pred = pred_cat_test) 
## [1] 0.9590643
F1_Score(y_true = test$diagnosis, y_pred = pred_cat_test) 
## [1] 0.964467
ConfusionMatrix(y_true = test$diagnosis, y_pred = pred_cat_test)
##       y_pred
## y_true  0  1
##      0 95  5
##      1  2 69

Decision Tree

We will run the decision tree model through a grid search of minsplit (the minimum number of observations in each split) and maxdepth (the maximum depth of the tree) in order to find the optimized combination of hyperparameters.

AUC_train_besttree <- 0
AUC_test_besttree <- 0
AUC_tree <- data.frame(AUC_train_tree = numeric(), AUC_test_tree = numeric())

set.seed(3011)
tree_parameters <- data.frame(minsplit_para = floor(runif(8, 10, 60)), 
                              maxdepth_para = floor(runif(8, 10, 30)))

for(para_comb in 1:nrow(tree_parameters)){
  decision_tree <- rpart(diagnosis ~ .,  data = train,
                      control = rpart.control(minsplit = tree_parameters[para_comb, "minsplit_para"], 
                                              maxdepth = tree_parameters[para_comb, "maxdepth_para"])) 
  
  pred_train_tree <- as.data.frame(predict(decision_tree, train, type='prob'))
  AUC_train_tree <- roc(train$diagnosis, pred_train_tree$`1`, percent = TRUE, plot = TRUE)
  
  pred_test_tree <- as.data.frame(predict(decision_tree, test, type='prob'))
  AUC_test_tree <- roc(test$diagnosis, pred_test_tree$`1`, percent = TRUE, plot = TRUE)

  AUC_tree[para_comb, ] <- c(round(AUC_train_tree$auc, 2), round(AUC_test_tree$auc, 2))
  AUC_train_besttree = ifelse(AUC_train_besttree > AUC_train_tree$auc, AUC_train_besttree, AUC_train_tree$auc)
  AUC_test_besttree = ifelse(AUC_test_besttree > AUC_test_tree$auc, AUC_test_besttree, AUC_test_tree$auc)
}
minsplit_para maxdepth_para AUC_train_tree AUC_test_tree
22 12 95.52 93.97
14 24 96.65 95.87
10 18 97.14 96.44
15 14 96.65 95.87
29 15 95.52 93.97
42 25 95.52 93.97
59 28 96.51 94.51
11 10 97.57 95.04

According to our grid search, the best decision tree has a minsplit of 11 and maxdepth of 10.

best_decision_tree <- rpart(diagnosis ~., data = train,
                            control = rpart.control(minsplit = 11,
                                                    maxdepth = 10))
rpart.plot(x = best_decision_tree, box.palette="RdBu", shadow.col="gray", nn=TRUE, yesno = 2)

Random Forest

Similarly, we will run the random forest model through a grid search to find the optimized combination of hyperparameters. The hyperparameters used are nodesize (the minimum number of observations in the terminal nodes), sampsize (the sample size of each tree), mtry (the number of variables to be considered for each tree), and ntree (the number of decision trees that constitute the forest).

AUC_train_bestrf <- 0
AUC_test_bestrf <- 0
AUC_rf <- data.frame(AUC_train_rf = numeric(), AUC_test_rf = numeric()) 

set.seed(160)
rf_parameters <- data.frame(nodesize = round(runif(10,5,20)),
                            sampsize= round(runif(10,1,400)),
                            mtry = round(runif(10,1,10)),
                            ntree = round(runif(10,1,400)))

for(paracomb_rf in 1:nrow(rf_parameters)){
  random_forest <- randomForest(diagnosis ~ ., data = train,
                                nodesize = rf_parameters[paracomb_rf, "nodesize"],
                                sampsize = rf_parameters[paracomb_rf, "sampsize"],
                                mtry = rf_parameters[paracomb_rf, "mtry"],
                                ntree = rf_parameters[paracomb_rf, "ntree"])
  
  pred_train_rf <- as.data.frame(predict(random_forest, train, type='prob'))
  AUC_train_rf <- roc(train$diagnosis, pred_train_rf$`1`, percent = TRUE, plot = TRUE)
  
  pred_test_rf <- as.data.frame(predict(random_forest, test, type='prob'))
  AUC_test_rf <- roc(test$diagnosis, pred_test_rf$`1`, percent = TRUE, plot = TRUE) 
  
  AUC_rf[paracomb_rf, ] <- c(round(AUC_train_rf$au, 2), round(AUC_test_rf$auc, 2))
  AUC_train_bestrf = ifelse(AUC_train_bestrf > AUC_train_rf$auc, AUC_train_bestrf, AUC_train_rf$auc)
  AUC_test_bestrf = ifelse(AUC_test_bestrf > AUC_test_rf$auc, AUC_test_bestrf, AUC_test_rf$auc)
}
nodesize sampsize mtry ntree AUC_train_rf AUC_test_rf
16 398 7 205 99.66 98.66
15 366 6 358 99.66 98.43
8 32 9 339 99.36 98.75
13 305 4 8 99.05 97.52
17 50 5 188 99.48 98.81
9 293 7 189 99.80 98.77
15 37 9 172 99.42 99.09
19 303 3 222 99.56 98.58
9 329 7 210 99.83 98.87
11 196 6 118 99.60 98.68

According to the grid search, the best random forest model would have a nodesize of 9, sampsize of 329, mtry of 7, and ntree of 210.

best_random_forest <- randomForest(diagnosis ~ ., data = train,
                                   nodesize = 9,
                                   sampsize = 329,
                                   mtry = 7,
                                   ntree = 210)
best_random_forest
## 
## Call:
##  randomForest(formula = diagnosis ~ ., data = train, nodesize = 9,      sampsize = 329, mtry = 7, ntree = 210) 
##                Type of random forest: classification
##                      Number of trees: 210
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 3.12%
## Confusion matrix:
##     0   1 class.error
## 0 334  10  0.02906977
## 1   6 162  0.03571429
# Identify the most significant independent variables
varImpPlot(best_random_forest)

Gradient Boosting Machine

Lastly, we will run the gradient boosting model through a grid search to find the optimized combination of hyperparameters. The hyperparameters used are n.trees (the number of decision trees), shrinkage (learning rate), interaction.depth (the depth of each tree) bag.fraction (the sample size of each tree as a fraction of the dataset), and n.minobsinnode (the minimum number of observations in the terminal nodes).

AUC_train_bestgb <- 0
AUC_test_bestgb <-0
AUC_gb <- data.frame(AUC_train_gb = numeric(), AUC_test_gb = numeric()) 

set.seed(3011)
gb_parameters <- data.frame(sample_size = round(runif(10,0.5,1), 2),
                           min_size= round(runif(10,5,20)),
                           num_tree = round(runif(10,20,200)),
                           shrink = round(runif(10,0.1,0.5), 2))

for(paracomb_gb in 1:nrow(gb_parameters)){
  gradient_boosting <- gbm(diagnosis ~ ., data = train, 
                           distribution = "bernoulli",
                           n.trees = gb_parameters[paracomb_gb,'num_tree'],
                           shrinkage = gb_parameters[paracomb_gb,'shrink'], 
                           interaction.depth = 3,
                           bag.fraction = gb_parameters[paracomb_gb,'sample_size'], 
                           n.minobsinnode = gb_parameters[paracomb_gb,'min_size'], 
                           verbose = TRUE)

  pred_train_gb <- predict(gradient_boosting, train, type = "response", n.trees = gb_parameters[paracomb_gb,'num_tree'])
  AUC_train_gb <- roc(train$diagnosis, pred_train_gb, percent = TRUE, plot = TRUE)
  
  pred_test_gb <- predict(gradient_boosting, test, type = "response", n.trees = gb_parameters[paracomb_gb,'num_tree'])
  AUC_test_gb <- roc(test$diagnosis, pred_test_gb, percent = TRUE, plot = TRUE) 
  
  AUC_gb[paracomb_gb, ] <- c(round(AUC_train_gb$auc,2), round(AUC_test_gb$auc,2))
  AUC_train_bestgb = ifelse(AUC_train_bestgb > AUC_train_gb$auc, AUC_train_bestgb, AUC_train_gb$auc)
  AUC_test_bestgb = ifelse(AUC_test_bestgb > AUC_test_gb$auc, AUC_test_bestgb, AUC_test_gb$auc)
  
}
sample_size min_size num_tree shrink AUC_train_gb AUC_test_gb
0.63 11 87 0.24 100 98.73
0.54 8 154 0.17 100 98.89
0.51 9 195 0.29 100 98.66
0.55 17 171 0.43 100 98.75
0.70 19 103 0.47 100 98.58
0.83 5 148 0.14 100 99.08
0.99 17 199 0.39 100 98.87
0.51 9 150 0.36 100 98.46
0.55 6 159 0.42 100 98.94
0.86 17 192 0.45 100 98.94

According to our grid search, the best gradient boosting model would have be as follows:

best_gradient_boosting <- gbm(diagnosis ~ ., data = train, 
                           distribution = "bernoulli",
                           n.trees = 159,
                           shrinkage = 0.42, 
                           interaction.depth = 3,
                           bag.fraction = 0.55, 
                           n.minobsinnode = 6, 
                           verbose = TRUE)
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.7387             nan     0.4200    0.2509
##      2        0.5173             nan     0.4200    0.0913
##      3        0.3859             nan     0.4200    0.0658
##      4        0.2842             nan     0.4200    0.0462
##      5        0.2249             nan     0.4200    0.0230
##      6        0.1973             nan     0.4200    0.0032
##      7        0.1627             nan     0.4200    0.0110
##      8        0.1470             nan     0.4200    0.0016
##      9        0.1394             nan     0.4200   -0.0030
##     10        0.1208             nan     0.4200    0.0009
##     20        0.0558             nan     0.4200   -0.0045
##     40        0.0202             nan     0.4200   -0.0019
##     60        0.0088             nan     0.4200   -0.0002
##     80        0.0047             nan     0.4200   -0.0004
##    100        0.0028             nan     0.4200   -0.0001
##    120        0.0014             nan     0.4200   -0.0000
##    140        0.0006             nan     0.4200    0.0000
##    159        0.0004             nan     0.4200   -0.0000

Model Selection and Evaluation

Now, we will evaluate all four classification models by comparing their area under the curve (AUC) when fitted into the training, test, and the entire datasets.

kable(AUC_values) %>%  kable_styling(bootstrap_options = c("hover", "condensed", "responsive"), full_width = F)
Model AUC_train AUC_test AUC_data
Logistic Regression 99.53 99.06 99.43
Decision Tree 97.57 96.44 96.93
Random Forest 99.83 99.09 99.51
Gradient Boosting Machine 100.00 99.08 99.75

All four models have a high performance, probably due to the extreme differences in cytological features between benign and malignant tumors. In this case, I’d choose the gradient boosting model for their exceptionally high and consistent performance across all datasets. Let’s take a closer look at other performance measures of this prediction model.

pred_test_bestgb <- predict(best_gradient_boosting, test, type = "response", n.trees = 159)
pred_cat_test_gb <- as.factor(ifelse(pred_test_bestgb < 0.5, 0, 1))
confusionMatrix(pred_cat_test_gb, test$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 94  1
##          1  6 70
##                                           
##                Accuracy : 0.9591          
##                  95% CI : (0.9175, 0.9834)
##     No Information Rate : 0.5848          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9166          
##                                           
##  Mcnemar's Test P-Value : 0.1306          
##                                           
##             Sensitivity : 0.9400          
##             Specificity : 0.9859          
##          Pos Pred Value : 0.9895          
##          Neg Pred Value : 0.9211          
##              Prevalence : 0.5848          
##          Detection Rate : 0.5497          
##    Detection Prevalence : 0.5556          
##       Balanced Accuracy : 0.9630          
##                                           
##        'Positive' Class : 0               
## 

The gradient boosting model correctly predicted 163 out of 171 diagnoses, giving us an accuracy of 95.32%. This performance measure, however, could be misleading especially when we have an imbalanced dataset like in this case. Fortunately, we have a better balanced accuracy of 95.59%, implying that the classifier actually performs equally well on either classes rather than take advantages of the skewed dataset.

rel_inf_gb <- as.data.frame(summary.gbm(best_gradient_boosting, plotit = FALSE))
rel_inf_gb %>% 
  arrange(desc(rel.inf)) %>%
  top_n(4) %>%
  ggplot(aes(x = reorder(var, rel.inf), 
             y = rel.inf,
             fill = rel.inf)) +
  geom_col() +
  coord_flip() +
  xlab('Features') +
  ylab('Relative Influence') +
  ggtitle("Top 4 Predictors of Breast Cancer") +
  theme_minimal()
## Selecting by rel.inf

Based on this gradient boosting model, the top most influential variables are uniformity_size, uniformity_shape, bland_chromatin, and clump_thickness.

Conclusion

In this project, we’ve created four classification machine learning models that can predict if a person has breast cancer based on digitized image readings of patients’ fine-needle aspirates. The best performing model, the gradient boosting, correctly classifies patients with and without breast cancer 95.3% of the times. Its AUC of 99.1% indicates a great ability to distinguish between a benign lump and a malignant tumor. The final model also has a lift metric of 2.2, meaning that it’s more than twice as good as randomly guessing. The top cytological characteristics in identifying breast cancer are the uniformity of cell size, the uniformity of cell shape, the uniformity of nucleus texture, and the thickness of the clump.