knitr::opts_chunk$set(echo = T,warning=F,message=F)
This data set pertains to breast cancer and was obtained from the University Medical Centre, Institute of Oncology, Ljubjana, Yogoslvia. M. Zwitter and M. Soklic were able to provide the data. This data set is one of three domains provided by Oncology Institute that has been used repeatedly in machine learning literature.
This data set has 286 observations, 10 variables, with one being the target variable and nine covariates. The target variable, referred to as class, has two responses: no-recurrence-events and recurrence-events. The no-recurrence-events has 201 observations and recurrence-events has 85 observations.
The 9 attributes are categorical with some only containing two reponses while others contain up to 13 different responses. Below is the entire attribute information:
Attribute Information: 1. Class: no-recurrence-events, recurrence-events
age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
menopause: lt40, ge40, premeno.
tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
node-caps: yes, no.
deg-malig: 1, 2, 3.
breast: left, right.
breast-quad: left-up, left-low, right-up, right-low, central.
irradiat: yes, no.
node-caps is missing 8 values and breast-quad is missing 1 value. They are denoted as “?” in the data set.
The data set can be found here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer
What we are interested in in this dataset is can we predict the class variable: no-recurrence-events, recurrence-events. For a good majority of the models, coding variables as to be used for all the different variables in this data set since they are categorical. Before we get to that lets do some data exploration. Inititaly, I produced bar graphs in ggplot looking at each attribute with the class variable highlighted in color to see if there are any interesting interactions between the covariates and the class variable. A graph was produced for each of the 9 attributes. Age group and breast quandrant and roughly normal distributions. Below are the two most interesting attributes on class which are also used the model selection process that contained significant p-value.
#install.packags(data.table)
library(data.table)
library(tidyverse)
bc.dat <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data', col.names=c('class','age','menopause','tumorsize','invnodes','nodecaps','degmalig','breast','breastquad','irradiant'))
ggplot(bc.dat, aes(x=invnodes, fill=class)) + geom_bar(position='dodge') + labs(title='Histogram of Inv Nodes Grouped by Class',x='Inv Nodes',y='Count')
ggplot(bc.dat, aes(x=degmalig, fill=class)) + geom_bar(position='dodge') + labs(title='Histogram of Degree of Malignancy Grouped by Class', x='Degree Of Malignancy',y='Count')
With the histograms Inv Nodes the data is occurs the most with the fewer numbers for both class variables. Looking at Degree of Malignancy it appears recurrence-events goes up as degree of malignancy increases. However, no-reccurence-events bounces around, it is most frequent at 2, then 1, and then 3.
Initially, a full generalized linear model was fit to determine the p-values and significance of each attribute variable on the class variable. It turns out that the variables that are statistically significant on the class variable are inv nodes and degree of malignancy. The variables needed to be assigned dummy/coding variables to fit different models.
class variable will be 0 for no-recurrence-events and 1 otherwise.
invnodes will be: 0-2 = 0
3-5 = 1
6-8 = 2
9-11 = 3
12-14 = 4
15-17 = 5
24-26 = 6
The data was then split between training and testing. 75% of the data was used for training and the remaining 25% was used for testing. Cross validation was also used on GLM for LOOCV and K-folds = 5. Below is a table of the error rates for the models to test the data against.
#Splitting the data and GLM Method
bc.dat$class <- ifelse(bc.dat$class=="no-recurrence-events",0,1)
bc.dat$dummy <- ifelse(bc.dat$invnodes=='0-2',0,(ifelse(bc.dat$invnodes=='3-5',1,
(ifelse(bc.dat$invnodes=='6-8',2,(ifelse(bc.dat$invnodes=='9-11',3,(ifelse(bc.dat$invnodes=='12-14',4,(ifelse(bc.dat$invnodes=='15-17',5,(ifelse(bc.dat$invnodes=='24-26',6,99)))))))))))))
bc.dat <- subset(bc.dat, select=c(class, degmalig, dummy))
set.seed(123)
index <- sample(x=nrow(bc.dat), size=0.75*nrow(bc.dat))
training <- bc.dat[index,]
testing <- bc.dat[-index,]
glm.train <- glm(class ~ degmalig + dummy, data=training, family='binomial')
#putting together the confusion matrix
glm.probs <- predict(glm.train, testing, type='response', na.action=na.pass)
glm.preds <- ifelse(glm.probs > 0.50, '1','0')
cm.glm <- table(testing$class, glm.preds)
print(cm.glm)
## glm.preds
## 0 1
## 0 46 2
## 1 15 9
glm.error.rate <- 1-((cm.glm[1] + cm.glm[2,2]) / sum(cm.glm))
cat("the glm error rate is: ",glm.error.rate)
## the glm error rate is: 0.2361111
#KNN Method
library(class)
library(MASS)
library(tidyverse)
set.seed(123)
end <- data.frame(k=1:100, correct=NA)
for(i in 1:100){
knn.pred = knn(train=data.frame(training$degmalig, training$dummy),
test=data.frame(testing$degmalig, testing$dummy), cl=training$class, k=i, prob=TRUE)
cm_knn <-table(testing$class, knn.pred)
correct <- (cm_knn['0','0'] + cm_knn['1','1'])/sum(cm_knn)
end$correct[i] <- correct
}
library(tidyverse)
ggplot(data=end, aes(k,correct)) + geom_line() +labs(title='Plot of K for KNN classifiers vs Accuracy of Model', x='No. K', y='Accuracy')
max_accuracy <- end[order(-end$correct),]
head(max_accuracy,10)
#LDA Method and error rate
lda.bc <- lda(class ~ degmalig + dummy, data=training)
lda.bc.preds <- predict(lda.bc, newdata=testing, type='response')
cm.lda <- table(lda.bc.preds$class, testing$class)
print(cm.lda)
##
## 0 1
## 0 46 15
## 1 2 9
lda.error.rate <- 1-((cm.lda[1] + cm.lda[2,2]) / sum(cm.lda))
cat("The LDA error rate is: ", lda.error.rate)
## The LDA error rate is: 0.2361111
#QDA Method
qda.bc <- qda(class ~ degmalig + dummy, data=training)
qda.bc.preds <- predict(qda.bc, newdata=testing, type='response')
cm.qda <- table(qda.bc.preds$class, testing$class)
print(cm.qda)
##
## 0 1
## 0 46 18
## 1 2 6
qda.error.rate <- 1-((cm.qda[1] + cm.qda[2,2]) /sum(cm.qda))
cat("The QDA error rate is: ", qda.error.rate)
## The QDA error rate is: 0.2777778
library('mclust')
head(training)
mclust.mod <- MclustDA(training[,2:3], training$class, G=9)
summary(mclust.mod, newdata=testing[,2:3], newclass=testing$class)
## ------------------------------------------------
## Gaussian finite mixture model for classification
## ------------------------------------------------
##
## MclustDA model summary:
##
## log-likelihood n df BIC
## 76.63528 214 57 -152.5901
##
## Classes n % Model G
## 0 153 71.5 EEE 9
## 1 61 28.5 EEI 9
##
## Training confusion matrix:
## Predicted
## Class 0 1
## 0 144 9
## 1 41 20
## Classification error = 0.2336
## Brier score = 0.1753
##
## Test confusion matrix:
## Predicted
## Class 0 1
## 0 45 3
## 1 14 10
## Classification error = 0.2361
## Brier score = 0.2111
#install.packages('e1071')
library(e1071)
svm.train <- svm(class ~ degmalig + dummy, data=training, family='binomial')
#putting together the confusion matrix
svm.probs <- predict(svm.train, testing, type='response', na.action=na.pass)
svm.preds <- ifelse(svm.probs > 0.50, '1','0')
cm.svm <- table(testing$class, svm.preds)
print(cm.svm)
## svm.preds
## 0 1
## 0 46 2
## 1 15 9
svm.error.rate <- 1-((cm.svm[1] + cm.svm[2,2]) / sum(cm.svm))
cat("the svm error rate is: ",svm.error.rate)
## the svm error rate is: 0.2361111
mclust.mod2.edda <- MclustDA(training[,2:3], training$class, modelType="EDDA")
summary(mclust.mod2.edda, newdata=testing[,2:3], newclass = testing$class)
## ------------------------------------------------
## Gaussian finite mixture model for classification
## ------------------------------------------------
##
## EDDA model summary:
##
## log-likelihood n df BIC
## -528.2006 214 8 -1099.329
##
## Classes n % Model G
## 0 153 71.5 VVI 1
## 1 61 28.5 VVI 1
##
## Training confusion matrix:
## Predicted
## Class 0 1
## 0 146 7
## 1 46 15
## Classification error = 0.2477
## Brier score = 0.1842
##
## Test confusion matrix:
## Predicted
## Class 0 1
## 0 46 2
## 1 18 6
## Classification error = 0.2778
## Brier score = 0.2065
library(boot)
# leave-one-out and 5-fold cross-validation prediction error for
# the nodal data set. Since the response is a binary variable an
# appropriate cost function is
set.seed(123)
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
bc.glm <- glm(class ~ degmalig + dummy, binomial, data = bc.dat)
cv.err <- cv.glm(bc.dat, bc.glm, cost, K = nrow(bc.dat))$delta
cv.5.err <- cv.glm(bc.dat, bc.glm, cost, K = 5)$delta
cv.err
## [1] 0.2587413 0.2589246
cv.5.err
## [1] 0.2482517 0.2552203
df <- data.frame(
GLM = 0.319,
KNN = 0.292,
LDA = 0.319,
MclustDA = 0.278,
MclustDA_EDDA = 0.306,
SVM = 0.292,
GLM_LOOCV = 0.259,
GLM_CV5 = 0.266
)
library(knitr)
kable(df, caption = "Model Error Rates")
GLM | KNN | LDA | MclustDA | MclustDA_EDDA | SVM | GLM_LOOCV | GLM_CV5 |
---|---|---|---|---|---|---|---|
0.319 | 0.292 | 0.319 | 0.278 | 0.306 | 0.292 | 0.259 | 0.266 |
It appears that the cross validation and leave one out approach on the generalized linear model produces wthe lowest error rates. when looking at the validation and test approach, it appears that the MclustDA model follows closely behind the LOOCV and CV5 approach. The model of choice is the one with the lowest error rate.
This means that in this data set with the GLM LOOCV one out approach, it can predicted with roughly 74.1% accuracy of whether or not there will be a recurrence-events or non-recurrence-events based on inv nodes and degree of malginancy.
If comparing this where the data set came from other people mentioned in past usage of this data have had accuracy 66%-78%. So the methods produce an a model accuracy score closer to the high side compared to other contributors.