Cancer is the second leading cause of death globally and is responsible for an estimated 9.6 million deaths in 2018. Globally, about 1 in 6 deaths are due to cancer. It is the leading cause of death in developed and developing countries, with projected annual deaths rising to 13.1 million by 2030. However, some forms of cancer, like breast cancer, have a higher chance of total remission if they are detected at an early stage and adequately treated.
Model comparison is a hot topic in the machine learning domain, as multiple models are often used on the same dataset to see how they differ in performance. Generally, there isn’t one model that dominates for any given type of data, some there are differences in interpretability and presentability between types of models. Some of the nuances of this topic are discussed here and here, by authors Taniya and Nischitha Sadananda respectively.
The data used for our project sourced from the UC Irvine Machine Learning Repository, created by Dr. Mangasarian, Dr. Wolberg, and Dr. Street, which can be found here, and a description of their research can be found here.
The covariates included are:
The first step was to validate the dataset to ensure there were no missing values and remove any redundant information. The standard error columns were removed from the analysis since estimations would not likely provide any more meaningful insight that couldn’t be found from mean and worst-case columns.
library(tidyverse)
#install.packages("ggcorrplot")
library(ggcorrplot)
library(grid)
library(gridExtra)
#Loading dataset
breast_cancer <- readxl::read_xlsx('Breast Cancer data - CS 5610.xlsx')
#Removing id and standard error columns
b_cancer <- breast_cancer[, -1]
b_cancer <- b_cancer[, !grepl('_se', colnames(b_cancer))]
colnames(b_cancer)
## [1] "diagnosis" "radius_mean"
## [3] "texture_mean" "perimeter_mean"
## [5] "area_mean" "smoothness_mean"
## [7] "compactness_mean" "concavity_mean"
## [9] "concave points_mean" "symmetry_mean"
## [11] "fractal_dimension_mean" "radius_worst"
## [13] "texture_worst" "perimeter_worst"
## [15] "area_worst" "smoothness_worst"
## [17] "compactness_worst" "concavity_worst"
## [19] "concave points_worst" "symmetry_worst"
## [21] "fractal_dimension_worst"
colnames(b_cancer)[c(9, 19)] <- c("concave_points_mean", "concave_points_worst")
#converting diagnosis to factor
b_cancer$diagnosis <- b_cancer$diagnosis %>% as.factor()
dim(b_cancer)
## [1] 569 21
summary(b_cancer$diagnosis)
## B M
## 357 212
any(is.na(b_cancer)) #No missing data
## [1] FALSE
Next the covariate matrix was checked for potential multicollinearity issues. There did appear to be several highly correlated covariates in the dataset that could impact model performance. Most of the highly correlated variables were related to radiuses, perimeters, and areas.
ggcorrplot(cor(b_cancer[-1]), type = 'lower', lab = TRUE) +
ggtitle("Correlation Plot of All Covariates") +
theme(plot.title = element_text(hjust = 0.5, size = 22))
knitr::include_graphics(paste0(getwd(),"/Cancer_data_plots/correlation_plot.png"))
It also appears that there’s a variety of spread for each covariate, ranging from extremely right skewed too moderately skewed, to normally distributed. Additionally, not all of the covariates share similar scales. Dissimilar data can sometimes impact model performance, and some models are more susceptible to performance loss than others.
ggplot(gather(b_cancer[,-1]), aes(x = value, color = key, fill = key)) +
geom_histogram(bins = 32) +
ggtitle("Covariates Used for Breast Cancer Diagnosis") +
xlab("Value") + ylab("Count") +
theme(plot.title = element_text(hjust = 0.5, size = 22)) +
facet_wrap(~key, scales = 'free_x')
knitr::include_graphics(paste0(getwd(), "/Cancer_data_plots/histograms_all_covariates.png"))
Lastly, the dataset was split into mean and worst cases, and histograms were plotted by diagnosis (Benign or Malignant). This was done to identify any potential differences in frequencies by diagnosis.
#Plotting histograms of covariates grouped by diagnosis, for mean/worst
hist <- list()
for(i in names(b_cancer[,-1])){
hist[[i]] <- ggplot(data = b_cancer, aes_string(x = i,
fill = "diagnosis")) +
geom_histogram(position = 'identity', alpha = 0.8, bins = 32)
}
#Worst count covariates
grep('worst', names(b_cancer[,-1]))
grid.arrange(hist[[11]], hist[[12]], hist[[13]], hist[[14]], hist[[15]],
hist[[16]], hist[[17]], hist[[18]], hist[[19]], hist[[20]],
nrow = 4,
top = textGrob("Worst Cancer Data",
gp = gpar(fontsize = 22, font = 2)))
knitr::include_graphics(paste0(getwd(), "/Cancer_data_plots/worst_histogram_by_diag.png"))
#Mean count covariates
grep('mean', names(b_cancer[,-1]))
grid.arrange(hist[[1]], hist[[2]], hist[[3]], hist[[4]], hist[[5]],
hist[[6]], hist[[7]], hist[[8]], hist[[9]], hist[[10]],
nrow = 4,
top = textGrob("Mean Cancer Data",
gp = gpar(fontsize =22, font = 2)))
knitr::include_graphics(paste0(getwd(), "/Cancer_data_plots/mean_histogram_by diag.png"))
These plots clearly show differences in the values for most of the covariates by diagnosis. Separations like these will be beneficial for classification model training for both accuracy and misclassification rates. The variables that do have clear separations are likely to be highly predictive of cancer diagnosis, too.
Three different types of models were used for this analysis: logistic regression, random forests, and gradient boosting with XGBoost. Each were applied to the total dataset, just the mean cancer data, and just the worst cancer data. Each model generally followed these steps:
Logistic regression is a type of generalized linear model that works with binary response variables and can be easily interpreted. One of the benefits of the logistic regression is that it can identify the log-odds and odds ratios for covariates of interest. More information about logistic regression can be found here.
### Installing necessary packages
#install.packages("tidyverse")
#install.packages("caTools") # For Logistic regression
#install.packages("ROCR") # For ROC curve to evaluate model
#install.packages("pscl") # Model evaluation
### Loading package
library(plyr)
library(tidyverse)
library(caTools)
library(ROCR)
library(carData)
library(caret)
library(car)
library(pscl)
###load dataset
data_all <- readxl::read_xlsx("Breast Cancer data - CS 5610.xlsx")
#remove ids and standard errors, setting diagnosis to factor variable
# factor set 1 == "M", 0 == "B"
data_all <- data_all[,-1]
data_all <- data_all[, !grepl('_se', colnames(data_all))]
colnames(data_all)[c(9, 19)] <- c("concave_points_mean", "concave_points_worst")
data_all$diagnosis <- as.factor(data_all$diagnosis)
data_all$diagnosis <- as.integer(data_all$diagnosis)-1
### Correlation plot for whole dataset
findCorrelation(cor(data_all[-1]), cutoff = 0.75, names = TRUE)
## [1] "concave_points_worst" "concave_points_mean"
## [3] "concavity_mean" "compactness_mean"
## [5] "concavity_worst" "perimeter_worst"
## [7] "compactness_worst" "radius_worst"
## [9] "perimeter_mean" "area_worst"
## [11] "area_mean" "smoothness_worst"
## [13] "fractal_dimension_worst" "texture_mean"
# 14 variables that have at least one correlation above 0.75
### Splitting dataset dividing data 75/25 split
set.seed(99)
split <- sample.split(data_all$diagnosis, SplitRatio = 0.75)
train_all <- subset(data_all, split == "TRUE")
test_all <- subset(data_all, split == "FALSE")
### Training model full model and summary output
logistic_full <- glm(diagnosis ~ ., data=train_all, family="binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
pR2(logistic_full)["McFadden"]
## fitting null model for pseudo-r2
## McFadden
## 0.9084415
vif(logistic_full)
## radius_mean texture_mean perimeter_mean
## 1814.660498 13.252368 1077.329499
## area_mean smoothness_mean compactness_mean
## 603.890780 17.151417 80.098313
## concavity_mean concave_points_mean symmetry_mean
## 41.137726 16.287288 5.390728
## fractal_dimension_mean radius_worst texture_worst
## 31.875074 391.940065 9.281670
## perimeter_worst area_worst smoothness_worst
## 74.294101 279.561432 5.543660
## compactness_worst concavity_worst concave_points_worst
## 58.542447 36.312709 8.271853
## symmetry_worst fractal_dimension_worst
## 4.359011 32.192332
The full model appeared to fit the data well with a high McFadden R2 = 0.9084415, but it only had one significant coefficient, and coefficients had high variable inflation factors.
#we can assume that multicollinearity is an issue in our model. So, we have
#values above 5 indicate severe multicollinearity such that radius_worst and perimeter_worst.
# Set a VIF threshold. All the variables having higher VIF than threshold
#are dropped from the model
threshold=4.99
### Sequentially drop the variable with the largest VIF until
# all variables have VIF less than threshold
logistic_all <- logistic_full
drop=TRUE
aftervif=data.frame()
while(drop==TRUE) {
vmodel=vif(logistic_all)
aftervif=rbind.fill(aftervif,as.data.frame(t(vmodel)))
if(max(vmodel)>threshold) {
logistic_all=update(logistic_all,as.formula(paste(".","~",".","-",names(which.max(vmodel))))) }
else { drop=FALSE }}
#Model after removing correlated Variables with their VIFs
print(as.data.frame(vmodel))
## vmodel
## texture_mean 1.554087
## concavity_mean 3.728442
## concave_points_mean 4.798353
## symmetry_mean 2.777392
## fractal_dimension_mean 3.269113
## perimeter_worst 1.545516
## smoothness_worst 2.385307
## concave_points_worst 3.296249
## symmetry_worst 2.866333
### Use the Model to Make Predictions on test data
# Predict test data, converting to 0 or 1 based on 0.5 cutoff value
predict_reg <- predict(logistic_all, test_all, type = "response")
predict_reg <- ifelse(predict_reg > 0.5, 1, 0)
predict_reg <- as.vector(predict_reg)
# Diagnostics plots
par(mfrow = c(2,2))
plot(logistic_all, which = 1:4, main = "All Cancer Data")
### ROC-AUC Curve
ROCPred <- prediction(predict_reg, test_all$diagnosis)
ROCPer <- performance(ROCPred, measure = "tpr",
x.measure = "fpr")
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc
### Plotting curve
par(mfrow = c(1,1))
plot(ROCPer, main = "ROC Curve for All Cancer Data")
abline(a = 0, b = 1)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)
### Evaluating model accuracy
predict_reg <- factor(ifelse(predict_reg > 0.5, 1, 0),
labels = c("B", "M"))
test_all$diagnosis <- factor(ifelse(test_all$diagnosis > 0.5, 1, 0),
labels = c("B", "M"))
all_confusion <- caret::confusionMatrix(test_all$diagnosis, predict_reg,
mode = 'everything',
positive = 'M')
all_r2 <- pR2(logistic_all)["McFadden"]
Here is the same setup, but for just the data that includes mean cancer data.
###load dataset
data_mean <- read.csv("breast_cancer_mean.csv")
#r Setting diagnosis to factor variable: factor set 1 == "M", 0 == "B"
data_mean$diagnosis <- as.factor(data_mean$diagnosis)
data_mean$diagnosis <- as.integer(data_mean$diagnosis)-1
### Summary of dataset in package
summary(data_mean)
nrow(data_mean)
### Correlation plot for whole dataset
#pairs(data_mean[-1])
findCorrelation(cor(data_mean[-1]), cutoff = 0.7, names = TRUE)
### Splitting dataset dividing data 75/25 split
set.seed(99)
split <- sample.split(data_mean$diagnosis, SplitRatio = 0.75)
head(split)
train_mean <- subset(data_mean, split == "TRUE")
test_mean <- subset(data_mean, split == "FALSE")
### Training model full model and summary output
logistic_mean <- glm(diagnosis ~ ., data=train_mean, family="binomial")
logistic_mean
summary(logistic_mean)
#Assessing Model Fit
#We can compute McFadden's R2 for our model using the pR2 function from the pscl package.
pR2(logistic_mean)["McFadden"]
#A value of 0.9084415 is quite high for McFadden's R2,
#which indicates that our model fits the data very well and has high predictive power.
#Variable Importance
varImp(logistic_mean, sort = TRUE)
#calculate VIF values for each predictor variable in our model
vif(logistic_mean)
#we can assume that multicollinearity is an issue in our model. So, we have
#values above 5 indicate severe multicollinearity such that radius_worst and perimeter_worst.
# Set a VIF threshold. All the variables having higher VIF than threshold
#are dropped from the model
threshold=4.99
### Sequentially drop the variable with the largest VIF until
# all variables have VIF less than threshold
drop=TRUE
aftervif=data.frame()
while(drop==TRUE) {
vmodel=vif(logistic_mean)
aftervif=rbind.fill(aftervif,as.data.frame(t(vmodel)))
if(max(vmodel)>threshold) {
logistic_mean=update(logistic_mean,as.formula(paste(".","~",".","-",names(which.max(vmodel))))) }
else { drop=FALSE }}
### How variables removed sequentially
t_aftervif= as.data.frame(t(aftervif))
# Final (uncorrelated) variables with their VIFs
print(as.data.frame(vmodel))
## vmodel
## texture_mean 1.778788
## area_mean 1.983117
## smoothness_mean 2.954579
## compactness_mean 3.592998
## concavity_mean 2.585441
## symmetry_mean 1.817450
### Use the Model to Make Predictions on test data
# Predict test data, converting to 0 or 1 based on 0.5 cutoff value
predict_reg <- predict(logistic_mean, test_mean, type = "response")
predict_reg <- ifelse(predict_reg > 0.5, 1, 0)
predict_reg <- as.vector(predict_reg)
### Model Diagnostics
# Diagnostic plots
par(mfrow = c(2,2))
plot(logistic_mean, which = 1:4, main = "Mean Cancer Data")
# ROC-AUC Curve
ROCPred <- prediction(predict_reg, test_mean$diagnosis)
ROCPer <- performance(ROCPred, measure = "tpr",
x.measure = "fpr")
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc
### Plotting curve
par(mfrow = c(1,1))
plot(ROCPer, main = "ROC Curve for Mean Cancer Data")
abline(a = 0, b = 1)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)
### Evaluating model accuracy
predict_reg <- factor(ifelse(predict_reg > 0.5, 1, 0),
labels = c("B", "M"))
test_mean$diagnosis <- factor(ifelse(test_mean$diagnosis > 0.5, 1, 0),
labels = c("B", "M"))
mean_confusion<- caret::confusionMatrix(test_mean$diagnosis, predict_reg,
mode = 'everything',
positive = 'M')
mean_r2 <- pR2(logistic_mean)["McFadden"]
## fitting null model for pseudo-r2
Here is the same setup, but for just the worst cancer data.
###load dataset
data_worst <- read.csv("breast_cancer_worst.csv")
#r Setting diagnosis to factor variable: factor set 1 == "M", 0 == "B"
data_worst$diagnosis <- as.factor(data_worst$diagnosis)
data_worst$diagnosis <- as.integer(data_worst$diagnosis)-1
### Summary of dataset in package
summary(data_worst)
nrow(data_worst)
### Correlation plot for whole dataset
#pairs(data_worst[-1])
findCorrelation(cor(data_worst[-1]), cutoff = 0.7, names = TRUE)
### Splitting dataset dividing data 75/25 split
set.seed(99)
split <- sample.split(data_worst$diagnosis, SplitRatio = 0.75)
head(split)
train_worst <- subset(data_worst, split == "TRUE")
test_worst <- subset(data_worst, split == "FALSE")
### Training model full model and summary output
logistic_worst <- glm(diagnosis ~ ., data=train_worst, family="binomial")
logistic_worst
summary(logistic_worst)
#Assessing Model Fit
#We can compute McFadden's R2 for our model using the pR2 function from the pscl package.
pR2(logistic_worst)["McFadden"]
#A value of 0.9084415 is quite high for McFadden's R2,
#which indicates that our model fits the data very well and has high predictive power.
#Variable Importance
varImp(logistic_worst, sort = TRUE)
#calculate VIF values for each predictor variable in our model
vif(logistic_worst)
#we can assume that multicollinearity is an issue in our model. So, we have
#values above 5 indicate severe multicollinearity such that radius_worst and perimeter_worst.
# Set a VIF threshold. All the variables having higher VIF than threshold
#are dropped from the model
threshold=4.99
### Sequentially drop the variable with the largest VIF until
# all variables have VIF less than threshold
drop=TRUE
aftervif=data.frame()
while(drop==TRUE) {
vmodel=vif(logistic_worst)
aftervif=rbind.fill(aftervif,as.data.frame(t(vmodel)))
if(max(vmodel)>threshold) {
logistic_worst=update(logistic_worst,as.formula(paste(".","~",".","-",names(which.max(vmodel))))) }
else { drop=FALSE }}
#Model after removing correlated Variables
summary(logistic_worst)
vif(logistic_worst)
### How variables removed sequentially
t_aftervif= as.data.frame(t(aftervif))
# Final (uncorrelated) variables with their VIFs
print(as.data.frame(vmodel))
## vmodel
## texture_worst 1.429116
## area_worst 1.496540
## smoothness_worst 2.555601
## concavity_worst 3.040412
## concave_points_worst 2.707778
## symmetry_worst 1.368137
## fractal_dimension_worst 3.729195
### Use the Model to Make Predictions on test data
# Predict test data, converting to 0 or 1 based on 0.5 cutoff value
predict_reg <- predict(logistic_worst, test_worst, type = "response")
predict_reg <- ifelse(predict_reg > 0.5, 1, 0)
predict_reg <- as.vector(predict_reg)
### Model Diagnostics
# Diagnostic plots
par(mfrow = c(2,2))
plot(logistic_worst, which = 1:4, main = "Worst Cancer Data")
# ROC-AUC Curve
ROCPred <- prediction(predict_reg, test_worst$diagnosis)
ROCPer <- performance(ROCPred, measure = "tpr",
x.measure = "fpr")
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc
### Plotting curve
par(mfrow = c(1,1))
plot(ROCPer, main = "ROC Curve for Worst Cancer Data")
abline(a = 0, b = 1)
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)
### Evaluating model accuracy
predict_reg <- factor(ifelse(predict_reg > 0.5, 1, 0),
labels = c("B", "M"))
test_worst$diagnosis <- factor(ifelse(test_worst$diagnosis > 0.5, 1, 0),
labels = c("B", "M"))
worst_confusion<- caret::confusionMatrix(test_worst$diagnosis, predict_reg,
mode = 'everything',
positive = 'M')
worst_r2 <- pR2(logistic_worst)["McFadden"]
A random forest algorithm that uses ensemble learning for either regression or classification problems. A random forest establishes predictions based on the aggerate outcomes of multiple decision trees. Decision trees are intuitive, as they are similar to a heuristics like flow charts. Random forests rely on a majority-voting system, in which the final prediction is determined by the outcome that is consistent with the majority of the decision trees. That is, if a majority of the ensemble decision trees predicted “Yes”, the final random forests prediction would be the same. More information about random forests can be found here.
#Loading required libraries
library(tidyverse)
library(randomForest)
library(caTools)
library(caret)
#Loading datasets
breast_cancer <- readxl::read_xlsx('Breast Cancer data - CS 5610.xlsx')
cancer_mean <- read.csv('breast_cancer_mean.csv')
cancer_worst <- read.csv('breast_cancer_worst.csv')
#Removing SE columns and renaming two columns for cancer_all
cancer_all <- breast_cancer[, -1]
cancer_all <- cancer_all[, !grepl('_se', colnames(cancer_all))]
colnames(cancer_all)[c(9, 19)] <- c("concave_points_mean", "concave_points_worst")
#Setting 'diagnosis' to factor variable
cancer_all$diagnosis <- as.factor(cancer_all$diagnosis)
cancer_mean$diagnosis <- as.factor(cancer_mean$diagnosis)
cancer_worst$diagnosis <- as.factor(cancer_worst$diagnosis)
set.seed(99)
sampl_all <- sample.split(cancer_all$diagnosis, SplitRatio = 0.75)
train_all <- subset(cancer_all, sampl_all == TRUE)
test_all <- subset(cancer_all, sampl_all != TRUE)
control <- trainControl(method = 'cv', number = 5, search = 'grid')
tunegrid <- expand.grid(mtry = c(1:ncol(train_all)))
set.seed(99)
test_rf <- train(diagnosis ~., data = train_all, method = 'rf',
metric = 'Accuracy',
tuneGrid = tunegrid,
trControl = control)
test_rf$bestTune
## building training model
set.seed(99)
rf_all <- randomForest(diagnosis ~., data = train_all,
ntree = 500,
mtry = 20,
importance = TRUE)
cancer_all_rf.pred <- predict(rf_all, newdata = test_all)
confusionMatrix(cancer_all_rf.pred, test_all$diagnosis,
mode = 'everything',
positive = 'M')
plot(test_rf, main = "CV Accuracy per Number of Included Predictors",
sub = "All Cancer Data")
varImpPlot(rf_all, main = "Variable Importance: All Cancer Data")
set.seed(99)
sampl_mean <- sample.split(cancer_mean$diagnosis, SplitRatio = 0.75)
train_mean <- subset(cancer_mean, sampl_mean == TRUE)
test_mean <- subset(cancer_mean, sampl_mean != TRUE)
control <- trainControl(method = 'cv', number = 5, search = 'grid')
tunegrid <- expand.grid(mtry = c(1:ncol(train_mean)))
set.seed(99)
test_rf <- train(diagnosis ~., data = train_mean, method = 'rf',
metric = 'Accuracy',
tuneGrid = tunegrid,
trControl = control)
test_rf$bestTune
## building training model
set.seed(99)
rf_mean <- randomForest(diagnosis ~., data = train_mean,
ntree = 500,
mtry = 8,
importance = TRUE)
cancer_mean_rf.pred <- predict(rf_mean, newdata = test_mean)
confusionMatrix(cancer_mean_rf.pred, test_mean$diagnosis,
mode = 'everything',
positive = 'M')
plot(test_rf, main = "CV Accuracy per Number of Included Predictors",
sub = "Mean Cancer Data")
varImpPlot(rf_mean, main = "Variable Importance: Mean Cancer Data")
set.seed(99)
sampl_worst <- sample.split(cancer_worst$diagnosis, SplitRatio = 0.75)
train_worst <- subset(cancer_worst, sampl_worst == TRUE)
test_worst <- subset(cancer_worst, sampl_worst != TRUE)
control <- trainControl(method = 'cv', number = 5, search = 'grid')
tunegrid <- expand.grid(mtry = c(1:ncol(train_worst)))
set.seed(99)
test_rf <- train(diagnosis ~., data = train_worst, method = 'rf',
metric = 'Accuracy',
tuneGrid = tunegrid,
trControl = control)
test_rf$bestTune
## building training model
set.seed(99)
rf_worst <- randomForest(diagnosis ~., data = train_worst,
ntree = 500,
mtry = 10,
importance = TRUE)
cancer_worst_rf.pred <- predict(rf_worst, newdata = test_worst)
confusionMatrix(cancer_worst_rf.pred, test_worst$diagnosis,
mode = 'everything',
positive = 'M')
plot(test_rf, main = "CV Accuracy per Number of Included Predictors",
sub = "Worst Cancer Data")
varImpPlot(rf_worst, main = "Variable Importance: Worst Cancer Data")
XGBoost stands for “extreme gradient boosting”; it is an extension of gradient boosted decision trees that is optimized to improve speed and performance. In general, gradient boosting refers to iteratively fitting residuals of a loss function from the previous fitted model to the next fitted model. XGBoost is a recent library that uses a gradient boosted decision trees for classification problems, and gradient boosted generalized linear models for regression problems. More information about the XGBoost library can be found here.
#Loading packages
#install.packages('xgboost')
library(xgboost)
library(caret)
library(caTools)
library(tidyverse)
library(pROC)
#Loading datasets
breast_cancer <- readxl::read_xlsx('Breast Cancer data - CS 5610.xlsx')
cancer_mean <- read.csv('breast_cancer_mean.csv')
cancer_worst <- read.csv('breast_cancer_worst.csv')
#Removing SE columns and renaming two columns for cancer_all
cancer_all <- breast_cancer[, -1]
cancer_all <- cancer_all[, !grepl('_se', colnames(cancer_all))]
colnames(cancer_all)[c(9, 19)] <- c("concave_points_mean", "concave_points_worst")
#Setting 'diagnosis' to factor variable
cancer_all$diagnosis <- as.factor(cancer_all$diagnosis)
cancer_mean$diagnosis <- as.factor(cancer_mean$diagnosis)
cancer_worst$diagnosis <- as.factor(cancer_worst$diagnosis)
set.seed(99)
sampl_all <- sample.split(cancer_all$diagnosis, SplitRatio = 0.75)
train_all <- subset(cancer_all, sampl_all == TRUE)
test_all <- subset(cancer_all, sampl_all != TRUE)
## Creating the independent variable and label matricies of train/test data
train_all_data <- as.matrix(train_all[-1])
train_all_label <- train_all$diagnosis
## Converting labels to 0,1 where "M" is coded at 1
train_all_label <- as.integer(train_all_label)-1
train_all$diagnosis[1:5]; train_all_label[1:5]
## [1] M M M M M
## Levels: B M
## [1] 1 1 1 1 1
## Repeat for test dataset
test_all_data <- as.matrix(test_all[-1])
test_all_label <- test_all$diagnosis
test_all_label <- as.integer(test_all_label)-1
test_all$diagnosis[1:5]; test_all_label[1:5]
## [1] M M B M M
## Levels: B M
## [1] 1 1 0 1 1
## Formatting data for XGBoost matricies
all_dtrain <- xgb.DMatrix(data = train_all_data, label = train_all_label)
all_dtest <- xgb.DMatrix(data = test_all_data, label = test_all_label)
### parameters: max_depth, eta, subsample, colsample_bytree, and min_child_weight
all_low_err_list <- list()
all_parameters_list <- list()
set.seed(99)
for(i in 1:3000){
params <- list(booster = "gbtree",
objective = "binary:logistic",
max_depth = sample(3:25, 1),
eta = runif(1, 0.01, 0.3),
subsample = runif(1, 0.5, 1),
colsample_bytree = runif(1, 0.5, 1),
min_child_weight = sample(0:10, 1)
)
parameters <- as.data.frame(params)
all_parameters_list[[i]] <- parameters
}
all_parameters_df <- do.call(rbind, all_parameters_list) #df containing random search params
### Fitting xgboost models based on search parameters
for (row in 1:nrow(all_parameters_df)){
set.seed(99)
all_tmp_mdl <- xgb.cv(data = all_dtrain,
booster = "gbtree",
objective = "binary:logistic",
nfold = 5,
prediction = TRUE,
max_depth = all_parameters_df$max_depth[row],
eta = all_parameters_df$eta[row],
subsample = all_parameters_df$subsample[row],
colsample_bytree = all_parameters_df$colsample_bytree[row],
min_child_weight = all_parameters_df$min_child_weight[row],
nrounds = 200,
eval_metric = "error",
early_stopping_rounds = 20,
print_every_n = 500,
verbose = 0
)
#this is the lowest error for the iteration
all_low_err <- as.data.frame(1 - min(all_tmp_mdl$evaluation_log$test_error_mean))
all_low_err_list[[row]] <- all_low_err
}
all_low_err_df <- do.call(rbind, all_low_err_list) #accuracies
all_randsearch <- cbind(all_low_err_df, all_parameters_df) #data frame with everything
###Reformatting the dataframe
all_randsearch <- all_randsearch %>%
dplyr::rename(val_acc = '1 - min(all_tmp_mdl$evaluation_log$test_error_mean)') %>%
dplyr::arrange(-val_acc)
###Grabbing just the top model
all_randsearch_best <- all_randsearch[1,]
###Storing best parameters in list
all_best_params <- list(booster = all_randsearch_best$booster,
objective = all_randsearch_best$objective,
max_depth = all_randsearch_best$max_depth,
eta = all_randsearch_best$eta,
subsample = all_randsearch_best$subsample,
colsample_bytree = all_randsearch_best$colsample_bytree,
min_child_weight = all_randsearch_best$min_child_weight)
### Finding the best nround parameter for the model using 5-fold cross validation
set.seed(99)
all_xgbcv <- xgb.cv(params = all_best_params,
data = all_dtrain,
nrounds = 500,
nfold = 5,
prediction = TRUE,
print_every_n = 50,
early_stopping_rounds = 25,
eval_metric = "error",
verbose = 0
)
all_xgbcv$best_iteration
set.seed(99)
all_best_xgb <- xgb.train(params = all_best_params,
data = all_dtrain,
nrounds = all_xgbcv$best_iteration,
eval_metric = "error",
)
xgb.save(all_best_xgb, 'final_xgb_cancerall')
## [1] TRUE
cancer_all.pred <- predict(all_best_xgb, all_dtest)
cancer_all.pred <- factor(ifelse(cancer_all.pred > 0.5, 1, 0),
labels = c("B", "M"))
confusionMatrix(cancer_all.pred, test_all$diagnosis,
mode = 'everything',
positive = 'M')
## Visualizations
all_impt_mtx <- xgb.importance(feature_names = colnames(test_all_data), model = all_best_xgb)
xgb.plot.importance(importance_matrix = all_impt_mtx,
xlab = "Variable Importance")
### ROC curve for 5-fold CV random parameter search
all_randsearch_roc <- roc(response = train_all_label,
predictor = all_tmp_mdl$pred,
print.auc = TRUE,
plot = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
### ROC curve for 5-fold CV nround parameter search
all_nround_roc <- roc(response = train_all_label,
predictor = all_xgbcv$pred,
print.auc = TRUE,
plot = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
set.seed(99)
sampl_mean <- sample.split(cancer_mean$diagnosis, SplitRatio = 0.75)
train_mean <- subset(cancer_mean, sampl_mean == TRUE)
test_mean <- subset(cancer_mean, sampl_mean != TRUE)
## Creating the independent variable and label matricies of train/test data
train_mean_data <- as.matrix(train_mean[-1])
train_mean_label <- train_mean$diagnosis
## Converting labels to 0,1 where "M" is coded at 1
train_mean_label <- as.integer(train_mean_label)-1
train_mean$diagnosis[1:5]; train_mean_label[1:5]
## [1] M M M M M
## Levels: B M
## [1] 1 1 1 1 1
## Repeat for test dataset
test_mean_data <- as.matrix(test_mean[-1])
test_mean_label <- test_mean$diagnosis
test_mean_label <- as.integer(test_mean_label)-1
test_mean$diagnosis[1:5]; test_mean_label[1:5]
## [1] M M B M M
## Levels: B M
## [1] 1 1 0 1 1
## Formatting data for XGBoost matricies
mean_dtrain <- xgb.DMatrix(data = train_mean_data, label = train_mean_label)
mean_dtest <- xgb.DMatrix(data = test_mean_data, label = test_mean_label)
### parameters: max_depth, eta, subsample, colsample_bytree, and min_child_weight
mean_low_err_list <- list()
mean_parameters_list <- list()
set.seed(99)
for(i in 1:3000){
params <- list(booster = "gbtree",
objective = "binary:logistic",
max_depth = sample(3:25, 1),
eta = runif(1, 0.01, 0.3),
subsample = runif(1, 0.5, 1),
colsample_bytree = runif(1, 0.5, 1),
min_child_weight = sample(0:10, 1)
)
parameters <- as.data.frame(params)
mean_parameters_list[[i]] <- parameters
}
mean_parameters_df <- do.call(rbind, mean_parameters_list) #df containing random search params
### Fitting xgboost models based on search parameters
for (row in 1:nrow(mean_parameters_df)){
set.seed(99)
mean_tmp_mdl <- xgb.cv(data = mean_dtrain,
booster = "gbtree",
objective = "binary:logistic",
nfold = 5,
prediction = TRUE,
max_depth = mean_parameters_df$max_depth[row],
eta = mean_parameters_df$eta[row],
subsample = mean_parameters_df$subsample[row],
colsample_bytree = mean_parameters_df$colsample_bytree[row],
min_child_weight = mean_parameters_df$min_child_weight[row],
nrounds = 200,
eval_metric = "error",
early_stopping_rounds = 20,
print_every_n = 500,
verbose = 0
)
#this is the lowest error for the iteration
mean_low_err <- as.data.frame(1 - min(mean_tmp_mdl$evaluation_log$test_error_mean))
mean_low_err_list[[row]] <- mean_low_err
}
mean_low_err_df <- do.call(rbind, mean_low_err_list) #accuracies
mean_randsearch <- cbind(mean_low_err_df, mean_parameters_df) #data frame with everything
###Reformatting the dataframe
mean_randsearch <- mean_randsearch %>%
dplyr::rename(val_acc = '1 - min(mean_tmp_mdl$evaluation_log$test_error_mean)') %>%
dplyr::arrange(-val_acc)
###Grabbing just the top model
mean_randsearch_best <- mean_randsearch[1,]
### Storing best parameters in list
mean_best_params <- list(booster = mean_randsearch_best$booster,
objective = mean_randsearch_best$objective,
max_depth = mean_randsearch_best$max_depth,
eta = mean_randsearch_best$eta,
subsample = mean_randsearch_best$subsample,
colsample_bytree = mean_randsearch_best$colsample_bytree,
min_child_weight = mean_randsearch_best$min_child_weight)
set.seed(99)
mean_xgbcv <- xgb.cv(params = mean_best_params,
data = mean_dtrain,
nrounds = 500,
nfold = 5,
prediction = TRUE,
print_every_n = 50,
early_stopping_rounds = 25,
eval_metric = "error",
verbose = 0
)
mean_xgbcv$best_iteration
set.seed(99)
mean_best_xgb <- xgb.train(params = mean_best_params,
data = mean_dtrain,
nrounds = mean_xgbcv$best_iteration,
eval_metric = "error",
)
xgb.save(mean_best_xgb, 'final_xgb_cancermean')
## [1] TRUE
cancer_mean.pred <- predict(mean_best_xgb, mean_dtest)
cancer_mean.pred <- factor(ifelse(cancer_mean.pred > 0.5, 1, 0),
labels = c("B", "M"))
## Visualizations
mean_impt_mtx <- xgb.importance(feature_names = colnames(test_mean_data), model = mean_best_xgb)
xgb.plot.importance(importance_matrix = mean_impt_mtx,
xlab = "Variable Importance")
### ROC curve for 5-fold CV random parameter search
mean_randsearch_roc <- roc(response = train_mean_label,
predictor = mean_tmp_mdl$pred,
print.auc = TRUE,
plot = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
### ROC curve for 5-fold CV nround parameter search
mean_nround_roc <- roc(response = train_mean_label,
predictor = mean_xgbcv$pred,
print.auc = TRUE,
plot = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
set.seed(99)
sampl_worst <- sample.split(cancer_worst$diagnosis, SplitRatio = 0.75)
train_worst <- subset(cancer_worst, sampl_worst == TRUE)
test_worst <- subset(cancer_worst, sampl_worst != TRUE)
## Creating the independent variable and label matricies of train/test data
train_worst_data <- as.matrix(train_worst[-1])
train_worst_label <- train_worst$diagnosis
## Converting labels to 0,1 where "M" is coded at 1
train_worst_label <- as.integer(train_worst_label)-1
train_worst$diagnosis[1:5]; train_worst_label[1:5]
## [1] M M M M M
## Levels: B M
## [1] 1 1 1 1 1
## Repeat for test dataset
test_worst_data <- as.matrix(test_worst[-1])
test_worst_label <- test_worst$diagnosis
test_worst_label <- as.integer(test_worst_label)-1
test_worst$diagnosis[1:5]; test_worst_label[1:5]
## [1] M M B M M
## Levels: B M
## [1] 1 1 0 1 1
## Formatting data for XGBoost matricies
worst_dtrain <- xgb.DMatrix(data = train_worst_data, label = train_worst_label)
worst_dtest <- xgb.DMatrix(data = test_worst_data, label = test_worst_label)
### parameters: max_depth, eta, subsample, colsample_bytree, and min_child_weight
worst_low_err_list <- list()
worst_parameters_list <- list()
set.seed(99)
for(i in 1:3000){
params <- list(booster = "gbtree",
objective = "binary:logistic",
max_depth = sample(3:25, 1),
eta = runif(1, 0.01, 0.3),
subsample = runif(1, 0.5, 1),
colsample_bytree = runif(1, 0.5, 1),
min_child_weight = sample(0:10, 1)
)
parameters <- as.data.frame(params)
worst_parameters_list[[i]] <- parameters
}
worst_parameters_df <- do.call(rbind, worst_parameters_list) #df containing random search params
### Fitting 5-fold CV xgboost models based on search parameters
for (row in 1:nrow(worst_parameters_df)){
set.seed(99)
worst_tmp_mdl <- xgb.cv(data = worst_dtrain,
booster = "gbtree",
objective = "binary:logistic",
nfold = 5,
prediction = TRUE,
max_depth = worst_parameters_df$max_depth[row],
eta = worst_parameters_df$eta[row],
subsample = worst_parameters_df$subsample[row],
colsample_bytree = worst_parameters_df$colsample_bytree[row],
min_child_weight = worst_parameters_df$min_child_weight[row],
nrounds = 200,
eval_metric = "error",
early_stopping_rounds = 20,
print_every_n = 500,
verbose = 0
)
#this is the lowest error for the iteration
worst_low_err <- as.data.frame(1 - min(worst_tmp_mdl$evaluation_log$test_error_mean))
worst_low_err_list[[row]] <- worst_low_err
}
worst_low_err_df <- do.call(rbind, worst_low_err_list) #accuracies
worst_randsearch <- cbind(worst_low_err_df, worst_parameters_df) #data frame with everything
###Reformatting the dataframe
worst_randsearch <- worst_randsearch %>%
dplyr::rename(val_acc = '1 - min(worst_tmp_mdl$evaluation_log$test_error_mean)') %>%
dplyr::arrange(-val_acc)
###Grabbing just the top model
worst_randsearch_best <- worst_randsearch[1,]
### Storing best parameters in list
worst_best_params <- list(booster = worst_randsearch_best$booster,
objective = worst_randsearch_best$objective,
max_depth = worst_randsearch_best$max_depth,
eta = worst_randsearch_best$eta,
subsample = worst_randsearch_best$subsample,
colsample_bytree = worst_randsearch_best$colsample_bytree,
min_child_weight = worst_randsearch_best$min_child_weight)
set.seed(99)
worst_xgbcv <- xgb.cv(params = worst_best_params,
data = worst_dtrain,
nrounds = 500,
nfold = 5,
prediction = TRUE,
print_every_n = 50,
early_stopping_rounds = 25,
eval_metric = "error",
verbose = 0
)
worst_xgbcv$best_iteration
set.seed(99)
worst_best_xgb <- xgb.train(params = worst_best_params,
data = worst_dtrain,
nrounds = worst_xgbcv$best_iteration,
eval_metric = "error"
)
xgb.save(worst_best_xgb, 'final_xgb_cancerworst')
## [1] TRUE
cancer_worst.pred <- predict(worst_best_xgb, worst_dtest)
cancer_worst.pred <- factor(ifelse(cancer_worst.pred> 0.5, 1, 0),
labels = c("B", "M"))
### variable importance plot
worst_impt_mtx <- xgb.importance(feature_names = colnames(test_worst_data), model = worst_best_xgb)
xgb.plot.importance(importance_matrix = worst_impt_mtx,
xlab = "Variable Importance")
### ROC curve for 5-fold CV random parameter search
worst_randsearch_roc <- roc(response = train_worst_label,
predictor = worst_tmp_mdl$pred,
print.auc = TRUE,
plot = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
### ROC curve for 5-fold CV nround parameter search
worst_nround_roc <- roc(response = train_worst_label,
predictor = worst_xgbcv$pred,
print.auc = TRUE,
plot = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 89 0
## M 4 49
##
## Accuracy : 0.9718
## 95% CI : (0.9294, 0.9923)
## No Information Rate : 0.6549
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9389
##
## Mcnemar's Test P-Value : 0.1336
##
## Sensitivity : 1.0000
## Specificity : 0.9570
## Pos Pred Value : 0.9245
## Neg Pred Value : 1.0000
## Precision : 0.9245
## Recall : 1.0000
## F1 : 0.9608
## Prevalence : 0.3451
## Detection Rate : 0.3451
## Detection Prevalence : 0.3732
## Balanced Accuracy : 0.9785
##
## 'Positive' Class : M
##
##
## Call:
## glm(formula = diagnosis ~ texture_mean + concavity_mean + concave_points_mean +
## symmetry_mean + fractal_dimension_mean + perimeter_worst +
## smoothness_worst + concave_points_worst + symmetry_worst,
## family = "binomial", data = train_all)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0045 -0.0902 -0.0124 0.0045 4.0534
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -24.83421 7.50130 -3.311 0.000931 ***
## texture_mean 0.38508 0.08764 4.394 1.11e-05 ***
## concavity_mean 5.62729 14.58252 0.386 0.699576
## concave_points_mean 58.21992 36.38346 1.600 0.109560
## symmetry_mean 0.79034 20.70848 0.038 0.969556
## fractal_dimension_mean -215.73218 93.21534 -2.314 0.020649 *
## perimeter_worst 0.13176 0.03570 3.690 0.000224 ***
## smoothness_worst 56.25842 24.20205 2.325 0.020097 *
## concave_points_worst 4.89131 15.50494 0.315 0.752406
## symmetry_worst 14.72512 8.02238 1.836 0.066431 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 563.813 on 426 degrees of freedom
## Residual deviance: 75.096 on 417 degrees of freedom
## AIC: 95.096
##
## Number of Fisher Scoring iterations: 9
## fitting null model for pseudo-r2
## McFadden
## 0.8668077
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 87 2
## M 9 44
##
## Accuracy : 0.9225
## 95% CI : (0.8656, 0.9607)
## No Information Rate : 0.6761
## P-Value [Acc > NIR] : 2.116e-12
##
## Kappa : 0.8299
##
## Mcnemar's Test P-Value : 0.07044
##
## Sensitivity : 0.9565
## Specificity : 0.9062
## Pos Pred Value : 0.8302
## Neg Pred Value : 0.9775
## Precision : 0.8302
## Recall : 0.9565
## F1 : 0.8889
## Prevalence : 0.3239
## Detection Rate : 0.3099
## Detection Prevalence : 0.3732
## Balanced Accuracy : 0.9314
##
## 'Positive' Class : M
##
##
## Call:
## glm(formula = diagnosis ~ texture_mean + area_mean + smoothness_mean +
## compactness_mean + concavity_mean + symmetry_mean, family = "binomial",
## data = train_mean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1678 -0.1344 -0.0271 0.0051 3.3066
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -35.33487 5.47241 -6.457 1.07e-10 ***
## texture_mean 0.41225 0.07396 5.574 2.49e-08 ***
## area_mean 0.01736 0.00264 6.576 4.83e-11 ***
## smoothness_mean 122.34322 32.54862 3.759 0.000171 ***
## compactness_mean -22.80941 11.84340 -1.926 0.054115 .
## concavity_mean 27.63625 8.02356 3.444 0.000572 ***
## symmetry_mean 20.51731 12.36067 1.660 0.096937 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 563.81 on 426 degrees of freedom
## Residual deviance: 109.94 on 420 degrees of freedom
## AIC: 123.94
##
## Number of Fisher Scoring iterations: 8
## fitting null model for pseudo-r2
## McFadden
## 0.8050058
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 89 0
## M 1 52
##
## Accuracy : 0.993
## 95% CI : (0.9614, 0.9998)
## No Information Rate : 0.6338
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9849
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 1.0000
## Specificity : 0.9889
## Pos Pred Value : 0.9811
## Neg Pred Value : 1.0000
## Precision : 0.9811
## Recall : 1.0000
## F1 : 0.9905
## Prevalence : 0.3662
## Detection Rate : 0.3662
## Detection Prevalence : 0.3732
## Balanced Accuracy : 0.9944
##
## 'Positive' Class : M
##
##
## Call:
## glm(formula = diagnosis ~ texture_worst + area_worst + smoothness_worst +
## concavity_worst + concave_points_worst + symmetry_worst +
## fractal_dimension_worst, family = "binomial", data = train_worst)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4564 -0.0837 -0.0116 0.0018 4.0632
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -26.942363 4.919300 -5.477 4.33e-08 ***
## texture_worst 0.267815 0.063886 4.192 2.76e-05 ***
## area_worst 0.012137 0.002427 5.001 5.71e-07 ***
## smoothness_worst 51.235210 23.666386 2.165 0.0304 *
## concavity_worst 4.444385 4.073509 1.091 0.2753
## concave_points_worst 29.575980 14.443473 2.048 0.0406 *
## symmetry_worst 10.518961 5.986965 1.757 0.0789 .
## fractal_dimension_worst -66.389888 33.621091 -1.975 0.0483 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 563.813 on 426 degrees of freedom
## Residual deviance: 72.707 on 419 degrees of freedom
## AIC: 88.707
##
## Number of Fisher Scoring iterations: 9
## fitting null model for pseudo-r2
## McFadden
## 0.8710438
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 88 2
## M 1 51
##
## Accuracy : 0.9789
## 95% CI : (0.9395, 0.9956)
## No Information Rate : 0.6268
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9547
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9623
## Specificity : 0.9888
## Pos Pred Value : 0.9808
## Neg Pred Value : 0.9778
## Precision : 0.9808
## Recall : 0.9623
## F1 : 0.9714
## Prevalence : 0.3732
## Detection Rate : 0.3592
## Detection Prevalence : 0.3662
## Balanced Accuracy : 0.9755
##
## 'Positive' Class : M
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 88 7
## M 1 46
##
## Accuracy : 0.9437
## 95% CI : (0.892, 0.9754)
## No Information Rate : 0.6268
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8768
##
## Mcnemar's Test P-Value : 0.0771
##
## Sensitivity : 0.8679
## Specificity : 0.9888
## Pos Pred Value : 0.9787
## Neg Pred Value : 0.9263
## Precision : 0.9787
## Recall : 0.8679
## F1 : 0.9200
## Prevalence : 0.3732
## Detection Rate : 0.3239
## Detection Prevalence : 0.3310
## Balanced Accuracy : 0.9283
##
## 'Positive' Class : M
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 89 2
## M 0 51
##
## Accuracy : 0.9859
## 95% CI : (0.95, 0.9983)
## No Information Rate : 0.6268
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9697
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.9623
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9780
## Precision : 1.0000
## Recall : 0.9623
## F1 : 0.9808
## Prevalence : 0.3732
## Detection Rate : 0.3592
## Detection Prevalence : 0.3592
## Balanced Accuracy : 0.9811
##
## 'Positive' Class : M
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 89 3
## M 0 50
##
## Accuracy : 0.9789
## 95% CI : (0.9395, 0.9956)
## No Information Rate : 0.6268
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9543
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 0.9434
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9674
## Precision : 1.0000
## Recall : 0.9434
## F1 : 0.9709
## Prevalence : 0.3732
## Detection Rate : 0.3521
## Detection Prevalence : 0.3521
## Balanced Accuracy : 0.9717
##
## 'Positive' Class : M
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 88 6
## M 1 47
##
## Accuracy : 0.9507
## 95% CI : (0.9011, 0.98)
## No Information Rate : 0.6268
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8926
##
## Mcnemar's Test P-Value : 0.1306
##
## Sensitivity : 0.8868
## Specificity : 0.9888
## Pos Pred Value : 0.9792
## Neg Pred Value : 0.9362
## Precision : 0.9792
## Recall : 0.8868
## F1 : 0.9307
## Prevalence : 0.3732
## Detection Rate : 0.3310
## Detection Prevalence : 0.3380
## Balanced Accuracy : 0.9378
##
## 'Positive' Class : M
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 89 2
## M 0 51
##
## Accuracy : 0.9859
## 95% CI : (0.95, 0.9983)
## No Information Rate : 0.6268
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9697
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.9623
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9780
## Precision : 1.0000
## Recall : 0.9623
## F1 : 0.9808
## Prevalence : 0.3732
## Detection Rate : 0.3592
## Detection Prevalence : 0.3592
## Balanced Accuracy : 0.9811
##
## 'Positive' Class : M
##
Since the main focus of this project was to identify the best model that could identify malignant breast cancer tumors, accuracy scores and false negative rates were emphasized. False negatives are a type of misclassification when the predicted value is negative even though the true value is positive. Although overall model accuracy is important for cancer diagnoses, incorrectly identifying a cancerous cell as benign can lead to more harm than incorrectly identifying a benign cell as cancerous.
For logistic regression the model with the highest accuracy score (0.993) and lowest false negative rate (Sensitivity = 1.000) was found when only looking at worst mean cancer data. The confusion matrix showed zero false negatives and one false positive. Combined with an AUC = 0.991, the model performed very well at classification. The model summary showed that all covariates except worst mean cell concavity and and worst mean cell symmetry were significant at an $$ = 0.05 level. Odd ratios for all covariates in the model could be calculated by exponentiating the coefficient estimates; for example, the odd ratio for the worst mean cell texture is 1.3071052. Meaning, that holding all other covariates constant, each additional increase in the standard deviation of grey-scale image values of the worst mean cells corresponded to a 30.7% increase in the odds of classifying the sample as malignant. A McFadden R2 of 0.867 indicated a good overall measure of model fit. The model diagnostics indicated that there were some outlier issues that could’ve influenced model performance.
For random forests, the model with the highest accuracy score (0.986) was found when only looking at worst mean cancer data. This model had two false negatives (Sensitivity = 0.962) and zero false positives (Specificity = 1). When hyperparameter tuning the model, it was found that including all ten covariates in each spit in the decision trees led to the highest training model accuracy. A variable importance plot indicated that the top three covariates that influenced model performance were the number of worst mean concave points, worst mean perimeter, and worst mean cell texture in that order.
For the XGBoost algorithm, the model with the highest accuracy score (0.986) and false negative rate (Sensitivity = 0.962) was found when only looking at the worst mean cancer data. This model had 2 false negatives and 0 false positives (Specificity = 1.00). A variance importance plot showed that the top three covariates that influenced the training model performance were cell radii, the number of concave points on the cell perimeters, and the perimeters themselves. ROC curves of the 5-fold cross validated hyperparameter searches had high AUC values (AUC = 0.981, AUC = 0.984) indicated that the training model performed well at classification.
Overall, the model that best classified cancer cells as benign or malignant was a logistic regression using different characteristics of worst average cell data. The XGBoost algorithm performed competitively, but had a lower sensitivity score than logistic regression. The results of logistic regression were easier to interpret compared to the results of XGBoost. An additional benefit of the logistic regression was that inferences on the covariates could also be made along side with the overall accuracy of the model.
These models didn’t include a validation set during the splitting process. A validation set would provide another subset of data to test the trained models. Although subset selection processes were used, feature engineering was not. All models used started with their full sets of covariates during model construction. In the future better domain knowledge could be used to assess which features would be most appropriate for the project problem. Model diagnostics of the logistic regression models indicated an outlier problem. There were at least two influential datum that impacted the assumptions of linear models. It should be determined if these points could be removed in the future, and see if their removal changes the results of logistic regression models. Although random forests and XGBoost should be more robust to outliers, all other models should be ran again with the new dataset to test for improvements.
https://medium.com/@taniyaghosh29/machine-learning-algorithms-what-are-the-differences-9b71df4f248f
https://archive-beta.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+diagnostic
https://pages.cs.wisc.edu/~olvi/uwmp/cancer.html#diag
https://courses.lumenlearning.com/introstats1/chapter/introduction-to-logistic-regression/
https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/
https://xgboost.readthedocs.io/en/stable/tutorials/model.html