Motivation and Overview

Cancer is the second leading cause of death globally and is responsible for an estimated 9.6 million deaths in 2018. Globally, about 1 in 6 deaths are due to cancer. It is the leading cause of death in developed and developing countries, with projected annual deaths rising to 13.1 million by 2030. However, some forms of cancer, like breast cancer, have a higher chance of total remission if they are detected at an early stage and adequately treated.

Background Work

Model comparison is a hot topic in the machine learning domain, as multiple models are often used on the same dataset to see how they differ in performance. Generally, there isn’t one model that dominates for any given type of data, some there are differences in interpretability and presentability between types of models. Some of the nuances of this topic are discussed here and here, by authors Taniya and Nischitha Sadananda respectively.

Data

Data sorce

The data used for our project sourced from the UC Irvine Machine Learning Repository, created by Dr. Mangasarian, Dr. Wolberg, and Dr. Street, which can be found here, and a description of their research can be found here.

Dataset Features

The covariates included are:

Cell radius - mean distance from cell center to points on the perimeter
Cell texture - standard deviation of grey-scale image values
Perimeter
Area
Smoothness - Local variation in radius lengths
Compactness - Defined as perimeter² / Area - 1
Concavity - Severity of concave portions of the cell contour
Symmetry
Fractal dimension - Defined as the coastline approximation - 1

Exploratory Data Analysis

The first step was to validate the dataset to ensure there were no missing values and remove any redundant information. The standard error columns were removed from the analysis since estimations would not likely provide any more meaningful insight that couldn’t be found from mean and worst-case columns.

Loading in packages

library(tidyverse)
#install.packages("ggcorrplot")
library(ggcorrplot)
library(grid)
library(gridExtra)

Formatting dataset for analysis

#Loading dataset
breast_cancer <- readxl::read_xlsx('Breast Cancer data - CS 5610.xlsx')

#Removing id and standard error columns
b_cancer <- breast_cancer[, -1]
b_cancer <- b_cancer[, !grepl('_se', colnames(b_cancer))]
colnames(b_cancer)

##  [1] "diagnosis"               "radius_mean"            
##  [3] "texture_mean"            "perimeter_mean"         
##  [5] "area_mean"               "smoothness_mean"        
##  [7] "compactness_mean"        "concavity_mean"         
##  [9] "concave points_mean"     "symmetry_mean"          
## [11] "fractal_dimension_mean"  "radius_worst"           
## [13] "texture_worst"           "perimeter_worst"        
## [15] "area_worst"              "smoothness_worst"       
## [17] "compactness_worst"       "concavity_worst"        
## [19] "concave points_worst"    "symmetry_worst"         
## [21] "fractal_dimension_worst"

colnames(b_cancer)[c(9, 19)] <- c("concave_points_mean", "concave_points_worst")

#converting diagnosis to factor
b_cancer$diagnosis <- b_cancer$diagnosis %>% as.factor()

Overview of formatted dataset

dim(b_cancer)

## [1] 569  21

summary(b_cancer$diagnosis)

##   B   M 
## 357 212

any(is.na(b_cancer)) #No missing data

## [1] FALSE

Correlation and multicolinearity

Next the covariate matrix was checked for potential multicollinearity issues. There did appear to be several highly correlated covariates in the dataset that could impact model performance. Most of the highly correlated variables were related to radiuses, perimeters, and areas.

ggcorrplot(cor(b_cancer[-1]), type = 'lower', lab = TRUE) +
  ggtitle("Correlation Plot of All Covariates") + 
  theme(plot.title = element_text(hjust = 0.5, size = 22))

knitr::include_graphics(paste0(getwd(),"/Cancer_data_plots/correlation_plot.png"))

Covariate Histograms

It also appears that there’s a variety of spread for each covariate, ranging from extremely right skewed too moderately skewed, to normally distributed. Additionally, not all of the covariates share similar scales. Dissimilar data can sometimes impact model performance, and some models are more susceptible to performance loss than others.

ggplot(gather(b_cancer[,-1]), aes(x = value, color = key, fill = key)) +
  geom_histogram(bins = 32) +
  ggtitle("Covariates Used for Breast Cancer Diagnosis") +
  xlab("Value") + ylab("Count") +
  theme(plot.title = element_text(hjust = 0.5, size = 22)) +
  facet_wrap(~key, scales = 'free_x')

knitr::include_graphics(paste0(getwd(), "/Cancer_data_plots/histograms_all_covariates.png"))

Lastly, the dataset was split into mean and worst cases, and histograms were plotted by diagnosis (Benign or Malignant). This was done to identify any potential differences in frequencies by diagnosis.

#Plotting histograms of covariates grouped by diagnosis, for mean/worst
hist <- list()
for(i in names(b_cancer[,-1])){
  hist[[i]] <- ggplot(data = b_cancer, aes_string(x = i,
                   fill = "diagnosis")) +
                   geom_histogram(position = 'identity', alpha = 0.8, bins = 32)     
}

Plotting the worst cancer data

#Worst count covariates
grep('worst', names(b_cancer[,-1]))
grid.arrange(hist[[11]], hist[[12]], hist[[13]], hist[[14]], hist[[15]],
                        hist[[16]], hist[[17]], hist[[18]], hist[[19]], hist[[20]],
                        nrow = 4,
                        top = textGrob("Worst Cancer Data",
                                       gp = gpar(fontsize = 22, font = 2)))

knitr::include_graphics(paste0(getwd(), "/Cancer_data_plots/worst_histogram_by_diag.png"))

Plotting the mean cancer data

#Mean count covariates
grep('mean', names(b_cancer[,-1]))
grid.arrange(hist[[1]], hist[[2]], hist[[3]], hist[[4]], hist[[5]],
                        hist[[6]], hist[[7]], hist[[8]], hist[[9]], hist[[10]],
                        nrow = 4,
                        top = textGrob("Mean Cancer Data",
                                             gp = gpar(fontsize =22, font = 2)))

knitr::include_graphics(paste0(getwd(), "/Cancer_data_plots/mean_histogram_by diag.png"))

These plots clearly show differences in the values for most of the covariates by diagnosis. Separations like these will be beneficial for classification model training for both accuracy and misclassification rates. The variables that do have clear separations are likely to be highly predictive of cancer diagnosis, too.

Model Implementation

Implementation steps

Three different types of models were used for this analysis: logistic regression, random forests, and gradient boosting with XGBoost. Each were applied to the total dataset, just the mean cancer data, and just the worst cancer data. Each model generally followed these steps:

Loaded datasets and divided into 75/25 train-test split
Model specific data formatting
Subset selection or hyperparameter tuning
Final model summary
Test data prediction and analysis
Model diagnostics

Logistic Regression

Logistic regression is a type of generalized linear model that works with binary response variables and can be easily interpreted. One of the benefits of the logistic regression is that it can identify the log-odds and odds ratios for covariates of interest. More information about logistic regression can be found here.

Loading packages

### Installing necessary packages
#install.packages("tidyverse")
#install.packages("caTools") # For Logistic regression
#install.packages("ROCR")    # For ROC curve to evaluate model
#install.packages("pscl")  # Model evaluation

### Loading package
library(plyr)
library(tidyverse)
library(caTools)
library(ROCR)
library(carData)
library(caret)
library(car)
library(pscl)

All Cancer Data

Formatting data

###load dataset
data_all <- readxl::read_xlsx("Breast Cancer data - CS 5610.xlsx")
#remove ids and standard errors, setting diagnosis to factor variable
# factor set 1 == "M", 0 == "B"
data_all <- data_all[,-1]
data_all <- data_all[, !grepl('_se', colnames(data_all))]
colnames(data_all)[c(9, 19)] <- c("concave_points_mean", "concave_points_worst")
data_all$diagnosis <- as.factor(data_all$diagnosis)
data_all$diagnosis <- as.integer(data_all$diagnosis)-1

### Correlation plot for whole dataset 
findCorrelation(cor(data_all[-1]), cutoff = 0.75, names = TRUE)

##  [1] "concave_points_worst"    "concave_points_mean"    
##  [3] "concavity_mean"          "compactness_mean"       
##  [5] "concavity_worst"         "perimeter_worst"        
##  [7] "compactness_worst"       "radius_worst"           
##  [9] "perimeter_mean"          "area_worst"             
## [11] "area_mean"               "smoothness_worst"       
## [13] "fractal_dimension_worst" "texture_mean"

# 14 variables that have at least one correlation above 0.75

Train-test split

### Splitting dataset dividing data 75/25 split
set.seed(99)
split <- sample.split(data_all$diagnosis, SplitRatio = 0.75)

train_all <- subset(data_all, split == "TRUE")
test_all <- subset(data_all, split == "FALSE")

Full model

### Training model full model and summary output
logistic_full <- glm(diagnosis ~ ., data=train_all, family="binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

pR2(logistic_full)["McFadden"]

## fitting null model for pseudo-r2

##  McFadden 
## 0.9084415

vif(logistic_full)

##             radius_mean            texture_mean          perimeter_mean 
##             1814.660498               13.252368             1077.329499 
##               area_mean         smoothness_mean        compactness_mean 
##              603.890780               17.151417               80.098313 
##          concavity_mean     concave_points_mean           symmetry_mean 
##               41.137726               16.287288                5.390728 
##  fractal_dimension_mean            radius_worst           texture_worst 
##               31.875074              391.940065                9.281670 
##         perimeter_worst              area_worst        smoothness_worst 
##               74.294101              279.561432                5.543660 
##       compactness_worst         concavity_worst    concave_points_worst 
##               58.542447               36.312709                8.271853 
##          symmetry_worst fractal_dimension_worst 
##                4.359011               32.192332

The full model appeared to fit the data well with a high McFadden R² = 0.9084415, but it only had one significant coefficient, and coefficients had high variable inflation factors.

#we can assume that multicollinearity is an issue in our model. So, we have 
#values above 5 indicate severe multicollinearity such that radius_worst and perimeter_worst.
# Set a VIF threshold. All the variables having higher VIF than threshold
#are dropped from the model
threshold=4.99

### Sequentially drop the variable with the largest VIF until
# all variables have VIF less than threshold
logistic_all <- logistic_full
drop=TRUE

aftervif=data.frame()
while(drop==TRUE) {
  vmodel=vif(logistic_all)
  aftervif=rbind.fill(aftervif,as.data.frame(t(vmodel)))
  if(max(vmodel)>threshold) {
    logistic_all=update(logistic_all,as.formula(paste(".","~",".","-",names(which.max(vmodel))))) }
  else { drop=FALSE }}

#Model after removing correlated Variables with their VIFs
print(as.data.frame(vmodel))

##                          vmodel
## texture_mean           1.554087
## concavity_mean         3.728442
## concave_points_mean    4.798353
## symmetry_mean          2.777392
## fractal_dimension_mean 3.269113
## perimeter_worst        1.545516
## smoothness_worst       2.385307
## concave_points_worst   3.296249
## symmetry_worst         2.866333

Creating predictions

### Use the Model to Make Predictions on test data
# Predict test data, converting to 0 or 1 based on 0.5 cutoff value
predict_reg <- predict(logistic_all, test_all, type = "response")
predict_reg <- ifelse(predict_reg > 0.5, 1, 0)
predict_reg <- as.vector(predict_reg)

Model diagnostics and evaluation

# Diagnostics plots
par(mfrow = c(2,2))
plot(logistic_all, which = 1:4, main = "All Cancer Data")

### ROC-AUC Curve
ROCPred <- prediction(predict_reg, test_all$diagnosis)
ROCPer <- performance(ROCPred, measure = "tpr",
                      x.measure = "fpr")

auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc

### Plotting curve
par(mfrow = c(1,1))
plot(ROCPer, main = "ROC Curve for All Cancer Data")
abline(a = 0, b = 1)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

### Evaluating model accuracy
predict_reg <- factor(ifelse(predict_reg > 0.5, 1, 0),
                      labels = c("B", "M"))
test_all$diagnosis <- factor(ifelse(test_all$diagnosis > 0.5, 1, 0),
                             labels = c("B", "M"))
all_confusion <- caret::confusionMatrix(test_all$diagnosis, predict_reg,
                                        mode = 'everything',
                                        positive = 'M')
all_r2 <- pR2(logistic_all)["McFadden"]

Mean Cancer Data

Here is the same setup, but for just the data that includes mean cancer data.

###load dataset
data_mean <- read.csv("breast_cancer_mean.csv")

#r Setting diagnosis to factor variable: factor set 1 == "M", 0 == "B"
data_mean$diagnosis <- as.factor(data_mean$diagnosis)
data_mean$diagnosis <- as.integer(data_mean$diagnosis)-1

### Summary of dataset in package
summary(data_mean)
nrow(data_mean)

### Correlation plot for whole dataset 
#pairs(data_mean[-1])
findCorrelation(cor(data_mean[-1]), cutoff = 0.7, names = TRUE)

### Splitting dataset dividing data 75/25 split
set.seed(99)
split <- sample.split(data_mean$diagnosis, SplitRatio = 0.75)
head(split)

train_mean <- subset(data_mean, split == "TRUE")
test_mean <- subset(data_mean, split == "FALSE")

### Training model full model and summary output
logistic_mean <- glm(diagnosis ~ ., data=train_mean, family="binomial")
logistic_mean
summary(logistic_mean)

#Assessing Model Fit
#We can compute McFadden's R2 for our model using the pR2 function from the pscl package.

pR2(logistic_mean)["McFadden"]
#A value of 0.9084415 is quite high for McFadden's R2, 
#which indicates that our model fits the data very well and has high predictive power.

#Variable Importance
varImp(logistic_mean, sort = TRUE)

#calculate VIF values for each predictor variable in our model
vif(logistic_mean)

#we can assume that multicollinearity is an issue in our model. So, we have 
#values above 5 indicate severe multicollinearity such that radius_worst and perimeter_worst.
# Set a VIF threshold. All the variables having higher VIF than threshold
#are dropped from the model
threshold=4.99

### Sequentially drop the variable with the largest VIF until
# all variables have VIF less than threshold
drop=TRUE

aftervif=data.frame()
while(drop==TRUE) {
  vmodel=vif(logistic_mean)
  aftervif=rbind.fill(aftervif,as.data.frame(t(vmodel)))
  if(max(vmodel)>threshold) {
    logistic_mean=update(logistic_mean,as.formula(paste(".","~",".","-",names(which.max(vmodel))))) }
  else { drop=FALSE }}

### How variables removed sequentially
t_aftervif= as.data.frame(t(aftervif))

# Final (uncorrelated) variables with their VIFs
print(as.data.frame(vmodel))

##                    vmodel
## texture_mean     1.778788
## area_mean        1.983117
## smoothness_mean  2.954579
## compactness_mean 3.592998
## concavity_mean   2.585441
## symmetry_mean    1.817450

### Use the Model to Make Predictions on test data
# Predict test data, converting to 0 or 1 based on 0.5 cutoff value
predict_reg <- predict(logistic_mean, test_mean, type = "response")
predict_reg <- ifelse(predict_reg > 0.5, 1, 0)
predict_reg <- as.vector(predict_reg)

### Model Diagnostics
# Diagnostic plots
par(mfrow = c(2,2))
plot(logistic_mean, which = 1:4, main = "Mean Cancer Data")

# ROC-AUC Curve
ROCPred <- prediction(predict_reg, test_mean$diagnosis)
ROCPer <- performance(ROCPred, measure = "tpr",
                      x.measure = "fpr")

auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc

### Plotting curve
par(mfrow = c(1,1))
plot(ROCPer, main = "ROC Curve for Mean Cancer Data")
abline(a = 0, b = 1)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

### Evaluating model accuracy
predict_reg <- factor(ifelse(predict_reg > 0.5, 1, 0),
                      labels = c("B", "M"))
test_mean$diagnosis <- factor(ifelse(test_mean$diagnosis > 0.5, 1, 0),
                             labels = c("B", "M"))
mean_confusion<- caret::confusionMatrix(test_mean$diagnosis, predict_reg,
                       mode = 'everything',
                       positive = 'M')
mean_r2 <- pR2(logistic_mean)["McFadden"]

## fitting null model for pseudo-r2

Worst Cancer Data

Here is the same setup, but for just the worst cancer data.

###load dataset
data_worst <- read.csv("breast_cancer_worst.csv")

#r Setting diagnosis to factor variable: factor set 1 == "M", 0 == "B"
data_worst$diagnosis <- as.factor(data_worst$diagnosis)
data_worst$diagnosis <- as.integer(data_worst$diagnosis)-1

### Summary of dataset in package
summary(data_worst)
nrow(data_worst)

### Correlation plot for whole dataset 
#pairs(data_worst[-1])
findCorrelation(cor(data_worst[-1]), cutoff = 0.7, names = TRUE)

### Splitting dataset dividing data 75/25 split
set.seed(99)
split <- sample.split(data_worst$diagnosis, SplitRatio = 0.75)
head(split)

train_worst <- subset(data_worst, split == "TRUE")
test_worst <- subset(data_worst, split == "FALSE")

### Training model full model and summary output
logistic_worst <- glm(diagnosis ~ ., data=train_worst, family="binomial")
logistic_worst

summary(logistic_worst)

#Assessing Model Fit
#We can compute McFadden's R2 for our model using the pR2 function from the pscl package.

pR2(logistic_worst)["McFadden"]
#A value of 0.9084415 is quite high for McFadden's R2, 
#which indicates that our model fits the data very well and has high predictive power.

#Variable Importance
varImp(logistic_worst, sort = TRUE)

#calculate VIF values for each predictor variable in our model
vif(logistic_worst)

#we can assume that multicollinearity is an issue in our model. So, we have 
#values above 5 indicate severe multicollinearity such that radius_worst and perimeter_worst.
# Set a VIF threshold. All the variables having higher VIF than threshold
#are dropped from the model
threshold=4.99


### Sequentially drop the variable with the largest VIF until
# all variables have VIF less than threshold
drop=TRUE

aftervif=data.frame()
while(drop==TRUE) {
  vmodel=vif(logistic_worst)
  aftervif=rbind.fill(aftervif,as.data.frame(t(vmodel)))
  if(max(vmodel)>threshold) {
    logistic_worst=update(logistic_worst,as.formula(paste(".","~",".","-",names(which.max(vmodel))))) }
  else { drop=FALSE }}

#Model after removing correlated Variables
summary(logistic_worst)
vif(logistic_worst)


### How variables removed sequentially
t_aftervif= as.data.frame(t(aftervif))

# Final (uncorrelated) variables with their VIFs
print(as.data.frame(vmodel))

##                           vmodel
## texture_worst           1.429116
## area_worst              1.496540
## smoothness_worst        2.555601
## concavity_worst         3.040412
## concave_points_worst    2.707778
## symmetry_worst          1.368137
## fractal_dimension_worst 3.729195

### Use the Model to Make Predictions on test data
# Predict test data, converting to 0 or 1 based on 0.5 cutoff value
predict_reg <- predict(logistic_worst, test_worst, type = "response")
predict_reg <- ifelse(predict_reg > 0.5, 1, 0)
predict_reg <- as.vector(predict_reg)

### Model Diagnostics
# Diagnostic plots
par(mfrow = c(2,2))
plot(logistic_worst, which = 1:4, main = "Worst Cancer Data")

# ROC-AUC Curve
ROCPred <- prediction(predict_reg, test_worst$diagnosis)
ROCPer <- performance(ROCPred, measure = "tpr",
                      x.measure = "fpr")

auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc

### Plotting curve
par(mfrow = c(1,1))
plot(ROCPer, main = "ROC Curve for Worst Cancer Data")
abline(a = 0, b = 1)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

### Evaluating model accuracy
predict_reg <- factor(ifelse(predict_reg > 0.5, 1, 0),
                      labels = c("B", "M"))
test_worst$diagnosis <- factor(ifelse(test_worst$diagnosis > 0.5, 1, 0),
                              labels = c("B", "M"))
worst_confusion<- caret::confusionMatrix(test_worst$diagnosis, predict_reg,
                       mode = 'everything',
                       positive = 'M')
worst_r2 <- pR2(logistic_worst)["McFadden"]

Random forest

A random forest algorithm that uses ensemble learning for either regression or classification problems. A random forest establishes predictions based on the aggerate outcomes of multiple decision trees. Decision trees are intuitive, as they are similar to a heuristics like flow charts. Random forests rely on a majority-voting system, in which the final prediction is determined by the outcome that is consistent with the majority of the decision trees. That is, if a majority of the ensemble decision trees predicted “Yes”, the final random forests prediction would be the same. More information about random forests can be found here.

Preping Data for Analysis

#Loading required libraries
library(tidyverse)
library(randomForest)
library(caTools)
library(caret)

#Loading datasets
breast_cancer <- readxl::read_xlsx('Breast Cancer data - CS 5610.xlsx')
cancer_mean   <- read.csv('breast_cancer_mean.csv')
cancer_worst  <- read.csv('breast_cancer_worst.csv')

#Removing SE columns and renaming two columns for cancer_all
cancer_all    <- breast_cancer[, -1]
cancer_all    <- cancer_all[, !grepl('_se', colnames(cancer_all))]
colnames(cancer_all)[c(9, 19)] <- c("concave_points_mean", "concave_points_worst")

#Setting 'diagnosis' to factor variable
cancer_all$diagnosis   <- as.factor(cancer_all$diagnosis)
cancer_mean$diagnosis  <- as.factor(cancer_mean$diagnosis)
cancer_worst$diagnosis <- as.factor(cancer_worst$diagnosis)

All Cancer Data

Splitting dataset

set.seed(99)
sampl_all <- sample.split(cancer_all$diagnosis, SplitRatio = 0.75)
train_all <- subset(cancer_all, sampl_all == TRUE)
test_all  <- subset(cancer_all, sampl_all != TRUE)

Hyperparameter tuning and model training

control <- trainControl(method = 'cv', number = 5, search = 'grid')
tunegrid <- expand.grid(mtry = c(1:ncol(train_all)))
set.seed(99)
test_rf <- train(diagnosis ~., data = train_all, method = 'rf',
                 metric = 'Accuracy',
                 tuneGrid = tunegrid,
                 trControl = control)

test_rf$bestTune

## building training model
set.seed(99)
rf_all <- randomForest(diagnosis ~., data = train_all, 
                       ntree = 500,
                       mtry = 20,
                       importance = TRUE)

Model testing and visualizations

cancer_all_rf.pred <- predict(rf_all, newdata = test_all)
confusionMatrix(cancer_all_rf.pred, test_all$diagnosis,
                mode = 'everything',
                positive = 'M')

plot(test_rf, main = "CV Accuracy per Number of Included Predictors",
     sub = "All Cancer Data")

varImpPlot(rf_all, main = "Variable Importance: All Cancer Data")

Mean Cancer Data

Splitting dataset

set.seed(99)
sampl_mean <- sample.split(cancer_mean$diagnosis, SplitRatio = 0.75)
train_mean <- subset(cancer_mean, sampl_mean == TRUE)
test_mean  <- subset(cancer_mean, sampl_mean != TRUE)

Hyperparameter tuning and model training

control <- trainControl(method = 'cv', number = 5, search = 'grid')
tunegrid <- expand.grid(mtry = c(1:ncol(train_mean)))
set.seed(99)
test_rf <- train(diagnosis ~., data = train_mean, method = 'rf',
                 metric = 'Accuracy',
                 tuneGrid = tunegrid,
                 trControl = control)

test_rf$bestTune

## building training model
set.seed(99)
rf_mean <- randomForest(diagnosis ~., data = train_mean, 
                       ntree = 500,
                       mtry = 8,
                       importance = TRUE)

Model testing and visualizations

cancer_mean_rf.pred <- predict(rf_mean, newdata = test_mean)
confusionMatrix(cancer_mean_rf.pred, test_mean$diagnosis,
                mode = 'everything',
                positive = 'M')

plot(test_rf, main = "CV Accuracy per Number of Included Predictors",
     sub = "Mean Cancer Data")

varImpPlot(rf_mean, main = "Variable Importance: Mean Cancer Data")

Worst Cancer Data

Splitting dataset

set.seed(99)
sampl_worst <- sample.split(cancer_worst$diagnosis, SplitRatio = 0.75)
train_worst <- subset(cancer_worst, sampl_worst == TRUE)
test_worst  <- subset(cancer_worst, sampl_worst != TRUE)

Hyperparameter tuning and model training

control <- trainControl(method = 'cv', number = 5, search = 'grid')
tunegrid <- expand.grid(mtry = c(1:ncol(train_worst)))
set.seed(99)
test_rf <- train(diagnosis ~., data = train_worst, method = 'rf',
                 metric = 'Accuracy',
                 tuneGrid = tunegrid,
                 trControl = control)

test_rf$bestTune

## building training model
set.seed(99)
rf_worst <- randomForest(diagnosis ~., data = train_worst,
                         ntree = 500, 
                         mtry = 10,
                         importance = TRUE)

Model testing and visualizations

cancer_worst_rf.pred <- predict(rf_worst, newdata = test_worst)
confusionMatrix(cancer_worst_rf.pred, test_worst$diagnosis,
                mode = 'everything',
                positive = 'M')

plot(test_rf, main = "CV Accuracy per Number of Included Predictors",
     sub = "Worst Cancer Data")

varImpPlot(rf_worst, main = "Variable Importance: Worst Cancer Data")

XGBoost

XGBoost stands for “extreme gradient boosting”; it is an extension of gradient boosted decision trees that is optimized to improve speed and performance. In general, gradient boosting refers to iteratively fitting residuals of a loss function from the previous fitted model to the next fitted model. XGBoost is a recent library that uses a gradient boosted decision trees for classification problems, and gradient boosted generalized linear models for regression problems. More information about the XGBoost library can be found here.

Preping Data for Analysis

#Loading packages
#install.packages('xgboost')
library(xgboost)
library(caret)
library(caTools)
library(tidyverse)
library(pROC)

#Loading datasets
breast_cancer <- readxl::read_xlsx('Breast Cancer data - CS 5610.xlsx')
cancer_mean   <- read.csv('breast_cancer_mean.csv')
cancer_worst  <- read.csv('breast_cancer_worst.csv')

#Removing SE columns and renaming two columns for cancer_all
cancer_all    <- breast_cancer[, -1]
cancer_all    <- cancer_all[, !grepl('_se', colnames(cancer_all))]
colnames(cancer_all)[c(9, 19)] <- c("concave_points_mean", "concave_points_worst")

#Setting 'diagnosis' to factor variable
cancer_all$diagnosis   <- as.factor(cancer_all$diagnosis)
cancer_mean$diagnosis  <- as.factor(cancer_mean$diagnosis)
cancer_worst$diagnosis <- as.factor(cancer_worst$diagnosis)

All Cancer Data

Splitting into train/test data

set.seed(99)
sampl_all <- sample.split(cancer_all$diagnosis, SplitRatio = 0.75)
train_all <- subset(cancer_all, sampl_all == TRUE)
test_all  <- subset(cancer_all, sampl_all != TRUE)

Formatting data for XGBoost modeling

## Creating the independent variable and label matricies of train/test data
train_all_data  <- as.matrix(train_all[-1])
train_all_label <- train_all$diagnosis
## Converting labels to 0,1 where "M" is coded at 1
train_all_label <- as.integer(train_all_label)-1
train_all$diagnosis[1:5]; train_all_label[1:5]

## [1] M M M M M
## Levels: B M

## [1] 1 1 1 1 1

## Repeat for test dataset
test_all_data   <- as.matrix(test_all[-1])
test_all_label  <- test_all$diagnosis
test_all_label  <- as.integer(test_all_label)-1
test_all$diagnosis[1:5]; test_all_label[1:5]

## [1] M M B M M
## Levels: B M

## [1] 1 1 0 1 1

## Formatting data for XGBoost matricies
all_dtrain <- xgb.DMatrix(data = train_all_data, label = train_all_label)
all_dtest  <- xgb.DMatrix(data = test_all_data, label = test_all_label)

Hyperparameter tuning using random search

### parameters: max_depth, eta, subsample, colsample_bytree, and min_child_weight
all_low_err_list <- list()
all_parameters_list <- list()
set.seed(99)
for(i in 1:3000){
  params <- list(booster = "gbtree",
                 objective = "binary:logistic",
                 max_depth = sample(3:25, 1),
                 eta = runif(1, 0.01, 0.3),
                 subsample = runif(1, 0.5, 1),
                 colsample_bytree = runif(1, 0.5, 1),
                 min_child_weight = sample(0:10, 1)
                )
  
  parameters <- as.data.frame(params)
  all_parameters_list[[i]] <- parameters
}

all_parameters_df <- do.call(rbind, all_parameters_list) #df containing random search params

### Fitting xgboost models based on search parameters
for (row in 1:nrow(all_parameters_df)){
  set.seed(99)
  all_tmp_mdl <- xgb.cv(data = all_dtrain,
                       booster = "gbtree",
                       objective = "binary:logistic",
                       nfold = 5,
                       prediction = TRUE,
                       max_depth = all_parameters_df$max_depth[row],
                       eta = all_parameters_df$eta[row],
                       subsample = all_parameters_df$subsample[row],
                       colsample_bytree = all_parameters_df$colsample_bytree[row],
                       min_child_weight = all_parameters_df$min_child_weight[row],
                       nrounds = 200,
                       eval_metric = "error",
                       early_stopping_rounds = 20,
                       print_every_n = 500,
                       verbose = 0
                    )
  
  #this is the lowest error for the iteration
  all_low_err <- as.data.frame(1 - min(all_tmp_mdl$evaluation_log$test_error_mean))
  all_low_err_list[[row]] <- all_low_err
}

all_low_err_df <- do.call(rbind, all_low_err_list) #accuracies 
all_randsearch <- cbind(all_low_err_df, all_parameters_df) #data frame with everything

###Reformatting the dataframe
all_randsearch <- all_randsearch %>%
  dplyr::rename(val_acc = '1 - min(all_tmp_mdl$evaluation_log$test_error_mean)') %>%
  dplyr::arrange(-val_acc)

###Grabbing just the top model
all_randsearch_best <- all_randsearch[1,]

###Storing best parameters in list
all_best_params <- list(booster = all_randsearch_best$booster,
                        objective = all_randsearch_best$objective,
                        max_depth = all_randsearch_best$max_depth,
                        eta = all_randsearch_best$eta,
                        subsample = all_randsearch_best$subsample,
                        colsample_bytree = all_randsearch_best$colsample_bytree,
                        min_child_weight = all_randsearch_best$min_child_weight)

Hyperparameter tuning nround with 5-fold cross validation

### Finding the best nround parameter for the model using 5-fold cross validation
set.seed(99)
all_xgbcv <- xgb.cv(params = all_best_params,
                    data = all_dtrain,
                    nrounds = 500,
                    nfold = 5,
                    prediction = TRUE,
                    print_every_n = 50,
                    early_stopping_rounds = 25,
                    eval_metric = "error",
                    verbose = 0
                    )
all_xgbcv$best_iteration

Model training using best hyperparameters

set.seed(99)
all_best_xgb <- xgb.train(params = all_best_params,
                          data = all_dtrain,
                          nrounds = all_xgbcv$best_iteration,
                          eval_metric = "error",
                          )

xgb.save(all_best_xgb, 'final_xgb_cancerall')

## [1] TRUE

Model testing and visualizations

cancer_all.pred <- predict(all_best_xgb, all_dtest)
cancer_all.pred <- factor(ifelse(cancer_all.pred > 0.5, 1, 0),
                          labels = c("B", "M"))
confusionMatrix(cancer_all.pred, test_all$diagnosis,
                mode = 'everything',
                positive = 'M')

## Visualizations
all_impt_mtx <- xgb.importance(feature_names = colnames(test_all_data), model = all_best_xgb)
xgb.plot.importance(importance_matrix = all_impt_mtx,
                      xlab = "Variable Importance")

### ROC curve for 5-fold CV random parameter search
all_randsearch_roc <- roc(response = train_all_label,
                          predictor = all_tmp_mdl$pred,
                          print.auc = TRUE,
                          plot = TRUE)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

### ROC curve for 5-fold CV nround parameter search
all_nround_roc <- roc(response = train_all_label,
                          predictor = all_xgbcv$pred,
                          print.auc = TRUE,
                          plot = TRUE)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Mean Cancer Data

Splitting into train/test data

set.seed(99)
sampl_mean <- sample.split(cancer_mean$diagnosis, SplitRatio = 0.75)
train_mean <- subset(cancer_mean, sampl_mean == TRUE)
test_mean  <- subset(cancer_mean, sampl_mean != TRUE)

Formatting data for XGBoost modeling

## Creating the independent variable and label matricies of train/test data
train_mean_data  <- as.matrix(train_mean[-1])
train_mean_label <- train_mean$diagnosis
## Converting labels to 0,1 where "M" is coded at 1
train_mean_label <- as.integer(train_mean_label)-1
train_mean$diagnosis[1:5]; train_mean_label[1:5]

## [1] M M M M M
## Levels: B M

## [1] 1 1 1 1 1

## Repeat for test dataset
test_mean_data   <- as.matrix(test_mean[-1])
test_mean_label  <- test_mean$diagnosis
test_mean_label  <- as.integer(test_mean_label)-1
test_mean$diagnosis[1:5]; test_mean_label[1:5]

## [1] M M B M M
## Levels: B M

## [1] 1 1 0 1 1

## Formatting data for XGBoost matricies
mean_dtrain <- xgb.DMatrix(data = train_mean_data, label = train_mean_label)
mean_dtest  <- xgb.DMatrix(data = test_mean_data, label = test_mean_label)

Hyperparameter tuning using random search

### parameters: max_depth, eta, subsample, colsample_bytree, and min_child_weight
mean_low_err_list <- list()
mean_parameters_list <- list()
set.seed(99)
for(i in 1:3000){
  params <- list(booster = "gbtree",
                 objective = "binary:logistic",
                 max_depth = sample(3:25, 1),
                 eta = runif(1, 0.01, 0.3),
                 subsample = runif(1, 0.5, 1),
                 colsample_bytree = runif(1, 0.5, 1),
                 min_child_weight = sample(0:10, 1)
  )
  parameters <- as.data.frame(params)
  mean_parameters_list[[i]] <- parameters
}
mean_parameters_df <- do.call(rbind, mean_parameters_list) #df containing random search params

### Fitting xgboost models based on search parameters
for (row in 1:nrow(mean_parameters_df)){
  set.seed(99)
  mean_tmp_mdl <- xgb.cv(data = mean_dtrain,
                       booster = "gbtree",
                       objective = "binary:logistic",
                       nfold = 5,
                       prediction = TRUE,
                       max_depth = mean_parameters_df$max_depth[row],
                       eta = mean_parameters_df$eta[row],
                       subsample = mean_parameters_df$subsample[row],
                       colsample_bytree = mean_parameters_df$colsample_bytree[row],
                       min_child_weight = mean_parameters_df$min_child_weight[row],
                       nrounds = 200,
                       eval_metric = "error",
                       early_stopping_rounds = 20,
                       print_every_n = 500,
                       verbose = 0
                      ) 
  
  #this is the lowest error for the iteration
  mean_low_err <- as.data.frame(1 - min(mean_tmp_mdl$evaluation_log$test_error_mean))
  mean_low_err_list[[row]] <- mean_low_err
}

mean_low_err_df <- do.call(rbind, mean_low_err_list) #accuracies 
mean_randsearch <- cbind(mean_low_err_df, mean_parameters_df) #data frame with everything

###Reformatting the dataframe
mean_randsearch <- mean_randsearch %>%
  dplyr::rename(val_acc = '1 - min(mean_tmp_mdl$evaluation_log$test_error_mean)') %>%
  dplyr::arrange(-val_acc)

###Grabbing just the top model
mean_randsearch_best <- mean_randsearch[1,]

### Storing best parameters in list
mean_best_params <- list(booster = mean_randsearch_best$booster,
                         objective = mean_randsearch_best$objective,
                         max_depth = mean_randsearch_best$max_depth,
                         eta = mean_randsearch_best$eta,
                         subsample = mean_randsearch_best$subsample,
                         colsample_bytree = mean_randsearch_best$colsample_bytree,
                         min_child_weight = mean_randsearch_best$min_child_weight)

Hyperparameter tuning nround with 5-fold cross validation

set.seed(99)
mean_xgbcv <- xgb.cv(params = mean_best_params,
                      data = mean_dtrain,
                      nrounds = 500,
                      nfold = 5,
                      prediction = TRUE,
                      print_every_n = 50,
                      early_stopping_rounds = 25,
                      eval_metric = "error",
                      verbose = 0
                      )
mean_xgbcv$best_iteration

Model training using best hyperparameters

set.seed(99)
mean_best_xgb <- xgb.train(params = mean_best_params,
                          data = mean_dtrain,
                          nrounds = mean_xgbcv$best_iteration,
                          eval_metric = "error",
                          )

xgb.save(mean_best_xgb, 'final_xgb_cancermean')

## [1] TRUE

Model testing and visualizations

cancer_mean.pred <- predict(mean_best_xgb, mean_dtest)
cancer_mean.pred <- factor(ifelse(cancer_mean.pred > 0.5, 1, 0),
                          labels = c("B", "M"))

## Visualizations
mean_impt_mtx <- xgb.importance(feature_names = colnames(test_mean_data), model = mean_best_xgb)
xgb.plot.importance(importance_matrix = mean_impt_mtx,
                    xlab = "Variable Importance")

### ROC curve for 5-fold CV random parameter search
mean_randsearch_roc <- roc(response = train_mean_label,
                            predictor = mean_tmp_mdl$pred,
                            print.auc = TRUE,
                            plot = TRUE)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

### ROC curve for 5-fold CV nround parameter search
mean_nround_roc <- roc(response = train_mean_label,
                            predictor = mean_xgbcv$pred,
                            print.auc = TRUE,
                            plot = TRUE)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Worst Cancer Data

Splitting into train/test data

set.seed(99)
sampl_worst <- sample.split(cancer_worst$diagnosis, SplitRatio = 0.75)
train_worst <- subset(cancer_worst, sampl_worst == TRUE)
test_worst  <- subset(cancer_worst, sampl_worst != TRUE)

Formatting data for XGBoost modeling

## Creating the independent variable and label matricies of train/test data
train_worst_data  <- as.matrix(train_worst[-1])
train_worst_label <- train_worst$diagnosis
## Converting labels to 0,1 where "M" is coded at 1
train_worst_label <- as.integer(train_worst_label)-1
train_worst$diagnosis[1:5]; train_worst_label[1:5]

## [1] M M M M M
## Levels: B M

## [1] 1 1 1 1 1

## Repeat for test dataset
test_worst_data   <- as.matrix(test_worst[-1])
test_worst_label  <- test_worst$diagnosis
test_worst_label  <- as.integer(test_worst_label)-1
test_worst$diagnosis[1:5]; test_worst_label[1:5]

## [1] M M B M M
## Levels: B M

## [1] 1 1 0 1 1

## Formatting data for XGBoost matricies
worst_dtrain <- xgb.DMatrix(data = train_worst_data, label = train_worst_label)
worst_dtest  <- xgb.DMatrix(data = test_worst_data, label = test_worst_label)

Hyperparameter tuning using random search

### parameters: max_depth, eta, subsample, colsample_bytree, and min_child_weight
worst_low_err_list <- list()
worst_parameters_list <- list()
set.seed(99)
for(i in 1:3000){
  params <- list(booster = "gbtree",
                 objective = "binary:logistic",
                 max_depth = sample(3:25, 1),
                 eta = runif(1, 0.01, 0.3),
                 subsample = runif(1, 0.5, 1),
                 colsample_bytree = runif(1, 0.5, 1),
                 min_child_weight = sample(0:10, 1)
  )
  parameters <- as.data.frame(params)
  worst_parameters_list[[i]] <- parameters
}
worst_parameters_df <- do.call(rbind, worst_parameters_list) #df containing random search params

### Fitting 5-fold CV xgboost models based on search parameters 
for (row in 1:nrow(worst_parameters_df)){
  set.seed(99)
  worst_tmp_mdl <- xgb.cv(data = worst_dtrain,
                       booster = "gbtree",
                       objective = "binary:logistic",
                       nfold = 5,
                       prediction = TRUE,
                       max_depth = worst_parameters_df$max_depth[row],
                       eta = worst_parameters_df$eta[row],
                       subsample = worst_parameters_df$subsample[row],
                       colsample_bytree = worst_parameters_df$colsample_bytree[row],
                       min_child_weight = worst_parameters_df$min_child_weight[row],
                       nrounds = 200,
                       eval_metric = "error",
                       early_stopping_rounds = 20,
                       print_every_n = 500,
                       verbose = 0
                       )
                       
  
  #this is the lowest error for the iteration
  worst_low_err <- as.data.frame(1 - min(worst_tmp_mdl$evaluation_log$test_error_mean))
  worst_low_err_list[[row]] <- worst_low_err
}

worst_low_err_df <- do.call(rbind, worst_low_err_list) #accuracies 
worst_randsearch <- cbind(worst_low_err_df, worst_parameters_df) #data frame with everything

###Reformatting the dataframe
worst_randsearch <- worst_randsearch %>%
  dplyr::rename(val_acc = '1 - min(worst_tmp_mdl$evaluation_log$test_error_mean)') %>%
  dplyr::arrange(-val_acc)

###Grabbing just the top model
worst_randsearch_best <- worst_randsearch[1,]

### Storing best parameters in list
worst_best_params <- list(booster = worst_randsearch_best$booster,
                          objective = worst_randsearch_best$objective,
                          max_depth = worst_randsearch_best$max_depth,
                          eta = worst_randsearch_best$eta,
                          subsample = worst_randsearch_best$subsample,
                          colsample_bytree = worst_randsearch_best$colsample_bytree,
                          min_child_weight = worst_randsearch_best$min_child_weight)

Hyperparameter tuning nround with 5-fold cross validation

set.seed(99)
worst_xgbcv <- xgb.cv(params = worst_best_params,
                    data = worst_dtrain,
                    nrounds = 500,
                    nfold = 5,
                    prediction = TRUE, 
                    print_every_n = 50,
                    early_stopping_rounds = 25,
                    eval_metric = "error",
                    verbose = 0
                    )
worst_xgbcv$best_iteration

Model training using best hyperparameters

set.seed(99)
worst_best_xgb <- xgb.train(params = worst_best_params,
                            data = worst_dtrain,
                            nrounds = worst_xgbcv$best_iteration,
                            eval_metric = "error"
                            )
xgb.save(worst_best_xgb, 'final_xgb_cancerworst')

## [1] TRUE

cancer_worst.pred <- predict(worst_best_xgb, worst_dtest)
cancer_worst.pred <- factor(ifelse(cancer_worst.pred> 0.5, 1, 0),
                          labels = c("B", "M"))

Model testing and visualizations

### variable importance plot
worst_impt_mtx <- xgb.importance(feature_names = colnames(test_worst_data), model = worst_best_xgb)
xgb.plot.importance(importance_matrix = worst_impt_mtx,
                    xlab = "Variable Importance")

### ROC curve for 5-fold CV random parameter search
worst_randsearch_roc <- roc(response = train_worst_label,
                           predictor = worst_tmp_mdl$pred,
                           print.auc = TRUE,
                           plot = TRUE)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

### ROC curve for 5-fold CV nround parameter search
worst_nround_roc <- roc(response = train_worst_label,
                           predictor = worst_xgbcv$pred,
                           print.auc = TRUE,
                           plot = TRUE)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Results

Logistic Regression

All cancer data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 89  0
##          M  4 49
##                                           
##                Accuracy : 0.9718          
##                  95% CI : (0.9294, 0.9923)
##     No Information Rate : 0.6549          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9389          
##                                           
##  Mcnemar's Test P-Value : 0.1336          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9570          
##          Pos Pred Value : 0.9245          
##          Neg Pred Value : 1.0000          
##               Precision : 0.9245          
##                  Recall : 1.0000          
##                      F1 : 0.9608          
##              Prevalence : 0.3451          
##          Detection Rate : 0.3451          
##    Detection Prevalence : 0.3732          
##       Balanced Accuracy : 0.9785          
##                                           
##        'Positive' Class : M               
##

## 
## Call:
## glm(formula = diagnosis ~ texture_mean + concavity_mean + concave_points_mean + 
##     symmetry_mean + fractal_dimension_mean + perimeter_worst + 
##     smoothness_worst + concave_points_worst + symmetry_worst, 
##     family = "binomial", data = train_all)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0045  -0.0902  -0.0124   0.0045   4.0534  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -24.83421    7.50130  -3.311 0.000931 ***
## texture_mean              0.38508    0.08764   4.394 1.11e-05 ***
## concavity_mean            5.62729   14.58252   0.386 0.699576    
## concave_points_mean      58.21992   36.38346   1.600 0.109560    
## symmetry_mean             0.79034   20.70848   0.038 0.969556    
## fractal_dimension_mean -215.73218   93.21534  -2.314 0.020649 *  
## perimeter_worst           0.13176    0.03570   3.690 0.000224 ***
## smoothness_worst         56.25842   24.20205   2.325 0.020097 *  
## concave_points_worst      4.89131   15.50494   0.315 0.752406    
## symmetry_worst           14.72512    8.02238   1.836 0.066431 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 563.813  on 426  degrees of freedom
## Residual deviance:  75.096  on 417  degrees of freedom
## AIC: 95.096
## 
## Number of Fisher Scoring iterations: 9

## fitting null model for pseudo-r2

##  McFadden 
## 0.8668077

Mean cancer data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 87  2
##          M  9 44
##                                           
##                Accuracy : 0.9225          
##                  95% CI : (0.8656, 0.9607)
##     No Information Rate : 0.6761          
##     P-Value [Acc > NIR] : 2.116e-12       
##                                           
##                   Kappa : 0.8299          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.9565          
##             Specificity : 0.9062          
##          Pos Pred Value : 0.8302          
##          Neg Pred Value : 0.9775          
##               Precision : 0.8302          
##                  Recall : 0.9565          
##                      F1 : 0.8889          
##              Prevalence : 0.3239          
##          Detection Rate : 0.3099          
##    Detection Prevalence : 0.3732          
##       Balanced Accuracy : 0.9314          
##                                           
##        'Positive' Class : M               
##

## 
## Call:
## glm(formula = diagnosis ~ texture_mean + area_mean + smoothness_mean + 
##     compactness_mean + concavity_mean + symmetry_mean, family = "binomial", 
##     data = train_mean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1678  -0.1344  -0.0271   0.0051   3.3066  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -35.33487    5.47241  -6.457 1.07e-10 ***
## texture_mean       0.41225    0.07396   5.574 2.49e-08 ***
## area_mean          0.01736    0.00264   6.576 4.83e-11 ***
## smoothness_mean  122.34322   32.54862   3.759 0.000171 ***
## compactness_mean -22.80941   11.84340  -1.926 0.054115 .  
## concavity_mean    27.63625    8.02356   3.444 0.000572 ***
## symmetry_mean     20.51731   12.36067   1.660 0.096937 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 563.81  on 426  degrees of freedom
## Residual deviance: 109.94  on 420  degrees of freedom
## AIC: 123.94
## 
## Number of Fisher Scoring iterations: 8

## fitting null model for pseudo-r2

##  McFadden 
## 0.8050058

Worst cacner data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 89  0
##          M  1 52
##                                           
##                Accuracy : 0.993           
##                  95% CI : (0.9614, 0.9998)
##     No Information Rate : 0.6338          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9849          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9889          
##          Pos Pred Value : 0.9811          
##          Neg Pred Value : 1.0000          
##               Precision : 0.9811          
##                  Recall : 1.0000          
##                      F1 : 0.9905          
##              Prevalence : 0.3662          
##          Detection Rate : 0.3662          
##    Detection Prevalence : 0.3732          
##       Balanced Accuracy : 0.9944          
##                                           
##        'Positive' Class : M               
##

## 
## Call:
## glm(formula = diagnosis ~ texture_worst + area_worst + smoothness_worst + 
##     concavity_worst + concave_points_worst + symmetry_worst + 
##     fractal_dimension_worst, family = "binomial", data = train_worst)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4564  -0.0837  -0.0116   0.0018   4.0632  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -26.942363   4.919300  -5.477 4.33e-08 ***
## texture_worst             0.267815   0.063886   4.192 2.76e-05 ***
## area_worst                0.012137   0.002427   5.001 5.71e-07 ***
## smoothness_worst         51.235210  23.666386   2.165   0.0304 *  
## concavity_worst           4.444385   4.073509   1.091   0.2753    
## concave_points_worst     29.575980  14.443473   2.048   0.0406 *  
## symmetry_worst           10.518961   5.986965   1.757   0.0789 .  
## fractal_dimension_worst -66.389888  33.621091  -1.975   0.0483 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 563.813  on 426  degrees of freedom
## Residual deviance:  72.707  on 419  degrees of freedom
## AIC: 88.707
## 
## Number of Fisher Scoring iterations: 9

## fitting null model for pseudo-r2

##  McFadden 
## 0.8710438

Random Forest

All cancer data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 88  2
##          M  1 51
##                                           
##                Accuracy : 0.9789          
##                  95% CI : (0.9395, 0.9956)
##     No Information Rate : 0.6268          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9547          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9623          
##             Specificity : 0.9888          
##          Pos Pred Value : 0.9808          
##          Neg Pred Value : 0.9778          
##               Precision : 0.9808          
##                  Recall : 0.9623          
##                      F1 : 0.9714          
##              Prevalence : 0.3732          
##          Detection Rate : 0.3592          
##    Detection Prevalence : 0.3662          
##       Balanced Accuracy : 0.9755          
##                                           
##        'Positive' Class : M               
##

Mean cancer data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 88  7
##          M  1 46
##                                          
##                Accuracy : 0.9437         
##                  95% CI : (0.892, 0.9754)
##     No Information Rate : 0.6268         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8768         
##                                          
##  Mcnemar's Test P-Value : 0.0771         
##                                          
##             Sensitivity : 0.8679         
##             Specificity : 0.9888         
##          Pos Pred Value : 0.9787         
##          Neg Pred Value : 0.9263         
##               Precision : 0.9787         
##                  Recall : 0.8679         
##                      F1 : 0.9200         
##              Prevalence : 0.3732         
##          Detection Rate : 0.3239         
##    Detection Prevalence : 0.3310         
##       Balanced Accuracy : 0.9283         
##                                          
##        'Positive' Class : M              
##

Worst cacner data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 89  2
##          M  0 51
##                                         
##                Accuracy : 0.9859        
##                  95% CI : (0.95, 0.9983)
##     No Information Rate : 0.6268        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.9697        
##                                         
##  Mcnemar's Test P-Value : 0.4795        
##                                         
##             Sensitivity : 0.9623        
##             Specificity : 1.0000        
##          Pos Pred Value : 1.0000        
##          Neg Pred Value : 0.9780        
##               Precision : 1.0000        
##                  Recall : 0.9623        
##                      F1 : 0.9808        
##              Prevalence : 0.3732        
##          Detection Rate : 0.3592        
##    Detection Prevalence : 0.3592        
##       Balanced Accuracy : 0.9811        
##                                         
##        'Positive' Class : M             
##

XGBoost

All cancer data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 89  3
##          M  0 50
##                                           
##                Accuracy : 0.9789          
##                  95% CI : (0.9395, 0.9956)
##     No Information Rate : 0.6268          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9543          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 0.9434          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9674          
##               Precision : 1.0000          
##                  Recall : 0.9434          
##                      F1 : 0.9709          
##              Prevalence : 0.3732          
##          Detection Rate : 0.3521          
##    Detection Prevalence : 0.3521          
##       Balanced Accuracy : 0.9717          
##                                           
##        'Positive' Class : M               
##

Mean cacner data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 88  6
##          M  1 47
##                                         
##                Accuracy : 0.9507        
##                  95% CI : (0.9011, 0.98)
##     No Information Rate : 0.6268        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.8926        
##                                         
##  Mcnemar's Test P-Value : 0.1306        
##                                         
##             Sensitivity : 0.8868        
##             Specificity : 0.9888        
##          Pos Pred Value : 0.9792        
##          Neg Pred Value : 0.9362        
##               Precision : 0.9792        
##                  Recall : 0.8868        
##                      F1 : 0.9307        
##              Prevalence : 0.3732        
##          Detection Rate : 0.3310        
##    Detection Prevalence : 0.3380        
##       Balanced Accuracy : 0.9378        
##                                         
##        'Positive' Class : M             
##

Worst cancer data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 89  2
##          M  0 51
##                                         
##                Accuracy : 0.9859        
##                  95% CI : (0.95, 0.9983)
##     No Information Rate : 0.6268        
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.9697        
##                                         
##  Mcnemar's Test P-Value : 0.4795        
##                                         
##             Sensitivity : 0.9623        
##             Specificity : 1.0000        
##          Pos Pred Value : 1.0000        
##          Neg Pred Value : 0.9780        
##               Precision : 1.0000        
##                  Recall : 0.9623        
##                      F1 : 0.9808        
##              Prevalence : 0.3732        
##          Detection Rate : 0.3592        
##    Detection Prevalence : 0.3592        
##       Balanced Accuracy : 0.9811        
##                                         
##        'Positive' Class : M             
##

Discussion

Since the main focus of this project was to identify the best model that could identify malignant breast cancer tumors, accuracy scores and false negative rates were emphasized. False negatives are a type of misclassification when the predicted value is negative even though the true value is positive. Although overall model accuracy is important for cancer diagnoses, incorrectly identifying a cancerous cell as benign can lead to more harm than incorrectly identifying a benign cell as cancerous.

Logistic Regression

For logistic regression the model with the highest accuracy score (0.993) and lowest false negative rate (Sensitivity = 1.000) was found when only looking at worst mean cancer data. The confusion matrix showed zero false negatives and one false positive. Combined with an AUC = 0.991, the model performed very well at classification. The model summary showed that all covariates except worst mean cell concavity and and worst mean cell symmetry were significant at an $$ = 0.05 level. Odd ratios for all covariates in the model could be calculated by exponentiating the coefficient estimates; for example, the odd ratio for the worst mean cell texture is 1.3071052. Meaning, that holding all other covariates constant, each additional increase in the standard deviation of grey-scale image values of the worst mean cells corresponded to a 30.7% increase in the odds of classifying the sample as malignant. A McFadden R² of 0.867 indicated a good overall measure of model fit. The model diagnostics indicated that there were some outlier issues that could’ve influenced model performance.

Random Forest

For random forests, the model with the highest accuracy score (0.986) was found when only looking at worst mean cancer data. This model had two false negatives (Sensitivity = 0.962) and zero false positives (Specificity = 1). When hyperparameter tuning the model, it was found that including all ten covariates in each spit in the decision trees led to the highest training model accuracy. A variable importance plot indicated that the top three covariates that influenced model performance were the number of worst mean concave points, worst mean perimeter, and worst mean cell texture in that order.

XGBoost

For the XGBoost algorithm, the model with the highest accuracy score (0.986) and false negative rate (Sensitivity = 0.962) was found when only looking at the worst mean cancer data. This model had 2 false negatives and 0 false positives (Specificity = 1.00). A variance importance plot showed that the top three covariates that influenced the training model performance were cell radii, the number of concave points on the cell perimeters, and the perimeters themselves. ROC curves of the 5-fold cross validated hyperparameter searches had high AUC values (AUC = 0.981, AUC = 0.984) indicated that the training model performed well at classification.

Cross Model Comparison

Overall, the model that best classified cancer cells as benign or malignant was a logistic regression using different characteristics of worst average cell data. The XGBoost algorithm performed competitively, but had a lower sensitivity score than logistic regression. The results of logistic regression were easier to interpret compared to the results of XGBoost. An additional benefit of the logistic regression was that inferences on the covariates could also be made along side with the overall accuracy of the model.

Limitations and Future Work

These models didn’t include a validation set during the splitting process. A validation set would provide another subset of data to test the trained models. Although subset selection processes were used, feature engineering was not. All models used started with their full sets of covariates during model construction. In the future better domain knowledge could be used to assess which features would be most appropriate for the project problem. Model diagnostics of the logistic regression models indicated an outlier problem. There were at least two influential datum that impacted the assumptions of linear models. It should be determined if these points could be removed in the future, and see if their removal changes the results of logistic regression models. Although random forests and XGBoost should be more robust to outliers, all other models should be ran again with the new dataset to test for improvements.

References

https://medium.com/@taniyaghosh29/machine-learning-algorithms-what-are-the-differences-9b71df4f248f

https://medium.com/@nischitasadananda/the-battle-between-logistic-regression-random-forest-classifier-xg-boost-and-support-vector-46d773c70f41

https://archive-beta.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+diagnostic

https://pages.cs.wisc.edu/~olvi/uwmp/cancer.html#diag

https://courses.lumenlearning.com/introstats1/chapter/introduction-to-logistic-regression/

https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/

https://xgboost.readthedocs.io/en/stable/tutorials/model.html

https://cran.r-project.org/web/packages/xgboost/xgboost.pdf

https://towardsdatascience.com/getting-to-an-hyperparameter-tuned-xgboost-model-in-no-time-a9560f8eb54b

Breast Cancer Detection: Model Comparisons

Corey Moon and Asma Mohammed Asiri

4/18/2022

Motivation and Overview

Background Work

Data

Data sorce

Dataset Features

Exploratory Data Analysis

Loading in packages

Formatting dataset for analysis

Overview of formatted dataset

Correlation and multicolinearity

Covariate Histograms

Plotting the worst cancer data

Plotting the mean cancer data

Model Implementation

Implementation steps

Logistic Regression

Loading packages

All Cancer Data

Formatting data

Train-test split

Full model

Creating predictions

Model diagnostics and evaluation

Mean Cancer Data

Worst Cancer Data

Random forest

Preping Data for Analysis

All Cancer Data

Splitting dataset

Hyperparameter tuning and model training

Model testing and visualizations

Mean Cancer Data

Splitting dataset

Hyperparameter tuning and model training

Model testing and visualizations

Worst Cancer Data

Splitting dataset

Hyperparameter tuning and model training

Model testing and visualizations

XGBoost

Preping Data for Analysis

All Cancer Data

Splitting into train/test data

Formatting data for XGBoost modeling

Hyperparameter tuning using random search

Hyperparameter tuning nround with 5-fold cross validation

Model training using best hyperparameters

Model testing and visualizations

Mean Cancer Data

Splitting into train/test data

Formatting data for XGBoost modeling

Hyperparameter tuning using random search

Hyperparameter tuning nround with 5-fold cross validation

Model training using best hyperparameters

Model testing and visualizations

Worst Cancer Data

Splitting into train/test data

Formatting data for XGBoost modeling

Hyperparameter tuning using random search

Hyperparameter tuning nround with 5-fold cross validation

Model training using best hyperparameters

Model testing and visualizations

Results

Logistic Regression

All cancer data

Mean cancer data

Worst cacner data

Random Forest

All cancer data

Mean cancer data

Worst cacner data

XGBoost

All cancer data

Mean cacner data

Worst cancer data

Discussion

Logistic Regression