Introduction

Breast cancer is cancer that develops from breast tissue. Signs of breast cancer may include a lump in the breast, a change in breast shape, dimpling of the skin, fluid coming from the nipple, a newly-inverted nipple, or a red or scaly patch of skin. In this report, we will research factors which can affect the recurrence of breast cancer such as: age, menopause, tumor.size, inv.nodes, node.caps, deg.malig, breast, breast.quad.

This dataset is one of three domains provided by the Oncology Institute that has repeatedly appeared in the machine learning literature.

It includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes, some of which are linear and some are nominal.

Attribute Information:

Class: no-recurrence-events, recurrence-events
age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99.
menopause: lt40, ge40, premeno.
tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59.
inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39.
node-caps: yes, no.
deg-malig: 1, 2, 3.
breast: left, right.
breast-quad: left-up, left-low, right-up, right-low, central.
irradiat: yes, no.

Data loading

# Install the required library
list.of.packages <- c("dplyr", "qwraps2", "ggplot2", "gridExtra", "car","arm","psych","caTools","caret")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(qwraps2)
options(qwraps2_markup = "markdown")
library(dplyr)
library(ggplot2)
library(gridExtra)
library(arm)
library(car)
library(psych)
library(caTools)
library(caret)
library(keras)
# Load the dataset
data = read.csv("https://raw.githubusercontent.com/pnhuy/bioinfo/master/datasets/breast_cancer/breast-cancer.csv")
attach(data)

Exploratory data analysis

The data have 286 rows of patients, and 10 fields: Class, age, menopause, tumor.size, inv.nodes, node.caps, deg.malig, breast, breast.quad, irradiat

head(data)

##                  Class   age menopause tumor.size inv.nodes node.caps deg.malig
## 1 no-recurrence-events 30-39   premeno      30-34       0-2        no         3
## 2 no-recurrence-events 40-49   premeno      20-24       0-2        no         2
## 3 no-recurrence-events 40-49   premeno      20-24       0-2        no         2
## 4 no-recurrence-events 60-69      ge40      15-19       0-2        no         2
## 5 no-recurrence-events 40-49   premeno        0-4       0-2        no         2
## 6 no-recurrence-events 60-69      ge40      15-19       0-2        no         2
##   breast breast.quad irradiat
## 1   left    left_low       no
## 2  right    right_up       no
## 3   left    left_low       no
## 4  right     left_up       no
## 5  right   right_low       no
## 6   left    left_low       no

#Show structure of the dataset
str(data)

## 'data.frame':    286 obs. of  10 variables:
##  $ Class      : Factor w/ 2 levels "no-recurrence-events",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ age        : Factor w/ 6 levels "20-29","30-39",..: 2 3 3 5 3 5 4 5 3 3 ...
##  $ menopause  : Factor w/ 3 levels "ge40","lt40",..: 3 3 3 1 3 1 3 1 3 3 ...
##  $ tumor.size : Factor w/ 11 levels "0-4","10-14",..: 6 4 4 3 1 3 5 4 11 4 ...
##  $ inv.nodes  : Factor w/ 7 levels "0-2","12-14",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ node.caps  : Factor w/ 3 levels "?","no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ deg.malig  : int  3 2 2 2 2 2 2 1 2 2 ...
##  $ breast     : Factor w/ 2 levels "left","right": 1 2 1 2 2 1 1 1 1 2 ...
##  $ breast.quad: Factor w/ 6 levels "?","central",..: 3 6 3 4 5 3 3 3 3 4 ...
##  $ irradiat   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

The basic statisics of data was below:

summary(data)

##                   Class        age       menopause     tumor.size inv.nodes  
##  no-recurrence-events:201   20-29: 1   ge40   :129   30-34  :60   0-2  :213  
##  recurrence-events   : 85   30-39:36   lt40   :  7   25-29  :54   12-14:  3  
##                             40-49:90   premeno:150   20-24  :50   15-17:  6  
##                             50-59:96                 15-19  :30   24-26:  1  
##                             60-69:57                 10-14  :28   3-5  : 36  
##                             70-79: 6                 40-44  :22   6-8  : 17  
##                                                      (Other):42   9-11 : 10  
##  node.caps   deg.malig       breast       breast.quad  irradiat 
##  ?  :  8   Min.   :1.000   left :152   ?        :  1   no :218  
##  no :222   1st Qu.:2.000   right:134   central  : 21   yes: 68  
##  yes: 56   Median :2.000               left_low :110            
##            Mean   :2.049               left_up  : 97            
##            3rd Qu.:3.000               right_low: 24            
##            Max.   :3.000               right_up : 33            
##

We run the summary function to show each column, it’s data type and few other attributes which are useful for attributes (displays min, 1st quartile, median, mean, 3rd quartile, max values or the number of patient in each level).
Node-caps is missing 8 values and breast-quad is missing 1 value. They are denoted as "“?” in the data set.

#Illustrate the relationship between Class and age
table(data$Class,data$age)

##                       
##                        20-29 30-39 40-49 50-59 60-69 70-79
##   no-recurrence-events     1    21    63    71    40     5
##   recurrence-events        0    15    27    25    17     1

The percentage of cancer-patients occur rarely in youngest and oldest group. Besides, The high percentage of “no-reccurence-events” and “recurrence-events” are at age “40-69”. Specifically, age “40-49” accounts for the highest propotion in “recurrence-events” while age“50-59” accounts for the highest percentage in opposite class.

#correlation coefficient between variables
pairs.panels(data)

This graph shows coefficients between all variables. More specifically, the highest correlation coefficient (0,72) is between age and menopause (statistically significant) while the correlation coefficient between menopause and tumor.size is 0 ( no statistically significant). In addition, the above graph also illustrates scatter charts, histograms for each pair of variables.

Besides,the data might contain NA values and they would be checked before building the model.

sum(is.na(data))

## [1] 0

There is not NA showing in dataset but there are some “?” which appears in the summary view. Therefore, we will remove “?” values in breast.quad and node.caps columns.

data = filter(data, breast.quad != '?' & node.caps != '?')

Before explorating the other aspects of dataset and buliding the model, we produce bar graphs in ggplot for each attribute with the class variable highlighted in color to see if there are any interesting interactions between the covariates and the class variable. A graph was produced for each of the 9 attributes.

p1 = ggplot(data, aes(x=inv.nodes, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Inv Nodes Grouped by Class',x='Inv Nodes',y='Count')
p2 = ggplot(data, aes(x=menopause, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Menopause Grouped by Class',x='Menopause',y='Count')
p3 = ggplot(data, aes(x=irradiat, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Irradiat Grouped by Class',x='Irradiat',y='Count')
p4 = ggplot(data, aes(x=age, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Age Grouped by Class',x='Age',y='Count')
p5 = ggplot(data, aes(x=breast.quad, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Breast Quandrant Grouped by Class',x='Breast Quandrant',y='Count')
p6 = ggplot(data, aes(x=tumor.size, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Tumor size Grouped by Class',x=' Tumor size',y='Count')
p7 = ggplot(data, aes(x= node.caps, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of node.caps Grouped by Class',x='Node.caps',y='Count')
p8 = ggplot(data, aes(x=breast, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Breast Grouped by Class',x='Breast',y='Count')
p9 = ggplot(data, aes(x=deg.malig, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Deg.malig Grouped by Class',x='Degree of Malignancy',y='Count')
grid.arrange(p1,p2,p3,p4,p5,p6,p7,p8,p9)

1.From these graphs, we have some comments: + Age group and breast quandrant seems to follow normal distributions. + Moreover, when we see the histogram of Inv.nodes, class variables occurs the most with fewer axillary lymph nodes (0-2). + At Degree of Malignancy histogram: the higher degree of malignancy is, the more reccurence-events increase but there is a fluctuation of no-reccence-events, it is most frequent at 2, then 1, and then 3.

#Density plot
ggplot(data,aes(x=`deg.malig`,fill=`tumor.size`))+geom_density(alpha=0.4)+ggtitle("Deg Malignant vs Tumor size")+xlab("Deg Malignant")+ylab("Density")

By visualizing the distribution of degree of malignancy with different levels of tumor.size as multi-density plot, we can understand the effect of tumor.size on deg.malig

#Fix NA in nodecaps
data$node.caps=ifelse(is.na(data$node.caps),ave(data$node.caps,FUN=function(x)"no"),data$node.caps)
#Fix NA in breasquad
data$breast.quad=ifelse(is.na(data$breast.quad),ave(data$breast.quad,FUN=function(x)"left_low"),data$breast.quad)

#Labeling features
data$Class= factor (data$Class, labels= c(0,1) , levels= c("no-recurrence-events", "recurrence-events"))
data$age= factor (data$age,labels= c(0,1,2,3,4,5) , levels= c("20-29","30-39","40-49","50-59","60-69","70-79")) 
data$menopause= factor (data$menopause, labels= c(0,1,2) , levels= c("premeno","ge40","lt40")) 
data$tumor.size= factor (data$tumor.size, labels= c(0,1,2,3,4,5,6,7,8,9,10) , levels= c("0-4","5-9","10-14", "15-19", "20-24","25-29", "30-34", "35-39","40-44","45-49","50-54")) 
data$inv.nodes= factor (data$inv.nodes, labels= c(0,1,2,3,4,5,6) , levels= c("0-2","3-5","6-8", "9-11","12-14", "15-17", "24-26"))
data$node.caps= factor (data$node.caps, labels= c(0,1) , levels= c("2", "3"))
data$deg.malig= factor (data$deg.malig, labels= c(0,1,2) , levels= c("1", "2", "3"))
data$breast= factor (data$breast, labels= c(0,1) , levels= c("left", "right")) 
data$breast.quad= factor (data$breast.quad, labels= c(0,1,2,3,4) , levels= c("2", "3", "4", "5", "6")) 
data$irradiat= factor (data$irradiat, labels= c(0,1) , levels= c("no", "yes"))

Hypothesis testing

Ttest

#Testing deg.maig vs class
t.test(as.numeric(data$deg.malig)~data$Class)

## 
##  Welch Two Sample t-test
## 
## data:  as.numeric(data$deg.malig) by data$Class
## t = -5.8131, df = 149.93, p-value = 3.563e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7088626 -0.3492125
## sample estimates:
## mean in group 0 mean in group 1 
##        1.903061        2.432099

From the t-test results, we see that the p-value is much smaller than the desired rate of Type 1 Error of 0.05 or 5%. This supports rejection of Ho and also states that the alternative hypothesis, Ha-mean difference between after and before levels of stress is less than zero-is statistically significant.

Building the model

Logistic regression Model for Breast Cancer’s Class

We want to predict an outcome variable that ss categorical so we use logistic regression.

log_data <- data[c(1,7,10)]
#suffling
set.seed(1000)
shuf_ind <- sample(1:nrow(log_data))
log_data <- log_data[shuf_ind, ]
#splitting data
set.seed(123)
split <-sample.split(log_data$Class,SplitRatio =0.8)
#training set
training_set <-subset(log_data,split==T)
#test set
test_set <-subset(log_data,split==F)
#generalised linear model
classifier <- glm(formula = Class~. ,family = binomial(),data=training_set)
summary(classifier)

## 
## Call:
## glm(formula = Class ~ ., family = binomial(), data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5177  -0.5991  -0.5991   0.8717   2.0011  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.8571     0.3870  -4.798  1.6e-06 ***
## deg.malig1    0.2305     0.4683   0.492 0.622508    
## deg.malig2    1.7253     0.4610   3.743 0.000182 ***
## irradiat1     0.9035     0.3636   2.485 0.012962 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 268.46  on 221  degrees of freedom
## Residual deviance: 233.96  on 218  degrees of freedom
## AIC: 241.96
## 
## Number of Fisher Scoring iterations: 4

pre = predict(classifier,type="response",newdata=test_set[-1])
y_pre <- ifelse(pre>0.5,"1","0")
confusionMatrix(factor(y_pre), factor(test_set$Class))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 37 13
##          1  2  3
##                                           
##                Accuracy : 0.7273          
##                  95% CI : (0.5904, 0.8386)
##     No Information Rate : 0.7091          
##     P-Value [Acc > NIR] : 0.449190        
##                                           
##                   Kappa : 0.1709          
##                                           
##  Mcnemar's Test P-Value : 0.009823        
##                                           
##             Sensitivity : 0.9487          
##             Specificity : 0.1875          
##          Pos Pred Value : 0.7400          
##          Neg Pred Value : 0.6000          
##              Prevalence : 0.7091          
##          Detection Rate : 0.6727          
##    Detection Prevalence : 0.9091          
##       Balanced Accuracy : 0.5681          
##                                           
##        'Positive' Class : 0               
##

From the output above, the coefficients table shows the beta coefficient estimates and their significance levels. Columns are:

Estimate: the intercept (b0) and the beta coefficient estimates associated to each predictor variable Std.Error: the standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confident we are about the estimate. z value: the z-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3) Pr(>|z|): The p-value corresponding to the z-statistic. The smaller the p-value, the more significant the estimate is.

Next,We’ll make predictions using the test data in order to evaluate the performance of our logistic regression model.

The procedure is as follow:

Predict the class membership probabilities of observations based on predictor variables Assign the observations to the class with highest probability score (i.e above 0.5) The R function predict() can be used to predict the probability of recurrence-events, given the predictor values. Use the option type = “response” to directly obtain the probabilities

#probability predict
pre = predict(classifier,type="response",newdata=test_set[-1])
y_pre <- ifelse(pre>0.5,"recurrence","no_recurrence")
y_pre

##              29             154              96             140             135 
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" 
##             136             163             109             269              40 
## "no_recurrence" "no_recurrence" "no_recurrence"    "recurrence" "no_recurrence" 
##              56             108             265             113              66 
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" 
##             212             105              95             190             125 
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" 
##             210             240             197             179               2 
## "no_recurrence"    "recurrence" "no_recurrence"    "recurrence" "no_recurrence" 
##             116             204             189             104              63 
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" 
##             251             120              55             166             106 
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" 
##             215             244              86             187              48 
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" 
##              19              80             202             245             205 
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" 
##             257             236             184             142              12 
## "no_recurrence"    "recurrence" "no_recurrence" "no_recurrence" "no_recurrence" 
##             183             100             235             178             141 
## "no_recurrence" "no_recurrence" "no_recurrence"    "recurrence" "no_recurrence"

Neural network model

set.seed(100)
split <- sample.split(data$Class, SplitRatio = 0.8)
y_train <- subset(data, split == TRUE)
y_train <- y_train[1]
y_train = as.matrix(y_train)
y_train <- as.numeric(y_train)
y_test <- subset(data, split == FALSE)
y_test <- y_test[1]
y_test = as.matrix(y_test)
y_test <- as.numeric(y_test)
data_col_1  <- colnames(data)
data_col_1 <- data_col_1[data_col_1 != "Class"]
data <- data[data_col_1]
x_train <- subset(data, split == TRUE)
x_train <- x_train[1:9]
x_train$age <- as.numeric(x_train$age)
x_train$menopause <- as.numeric(x_train$menopause)
x_train$tumor.size <- as.numeric(x_train$tumor.size)
x_train$inv.nodes <- as.numeric(x_train$inv.nodes)
x_train$node.caps <- as.numeric(x_train$node.caps)
x_train$deg.malig <- as.numeric(x_train$deg.malig)
x_train$breast <- as.numeric(x_train$breast)
x_train$breast.quad <- as.numeric(x_train$breast.quad)
x_train$irradiat <- as.numeric(x_train$irradiat)
x_train <- as.matrix(x_train)
x_train <- scale(x_train)
x_test <- subset(data, split == FALSE)
x_test <- x_test[1:9]
x_test$age <- as.numeric(x_test$age)
x_test$menopause <- as.numeric(x_test$menopause)
x_test$tumor.size <- as.numeric(x_test$tumor.size)
x_test$inv.nodes <- as.numeric(x_test$inv.nodes)
x_test$node.caps <- as.numeric(x_test$node.caps)
x_test$deg.malig <- as.numeric(x_test$deg.malig)
x_test$breast <- as.numeric(x_test$breast)
x_test$breast.quad <- as.numeric(x_test$breast.quad)
x_test$irradiat <- as.numeric(x_test$irradiat)
x_test <- as.matrix(x_test)
col_means_train <- attr(x_train, "scaled:center") 
col_stddevs_train <- attr(x_train, "scaled:scale")
x_test <- scale(x_test, center = col_means_train, scale = col_stddevs_train)
x_test

##            age  menopause  tumor.size  inv.nodes  node.caps   deg.malig
## 7    0.3701586 -0.8858197  0.04825758 -0.4344863 -0.4677502 -0.07810125
## 12   0.3701586  0.9520493  0.04825758 -0.4344863 -0.4677502 -0.07810125
## 28   1.3369259  0.9520493  0.04825758 -0.4344863 -0.4677502  1.25562783
## 32   0.3701586 -0.8858197 -0.88332348 -0.4344863 -0.4677502 -0.07810125
## 34   0.3701586  0.9520493 -1.34911400 -0.4344863 -0.4677502 -1.41183033
## 36  -1.5633759 -0.8858197  0.51404810 -0.4344863 -0.4677502 -0.07810125
## 39  -0.5966086 -0.8858197 -1.34911400 -0.4344863 -0.4677502 -0.07810125
## 42   1.3369259  0.9520493  0.04825758 -0.4344863 -0.4677502 -0.07810125
## 44  -0.5966086 -0.8858197 -1.34911400 -0.4344863 -0.4677502 -0.07810125
## 48   0.3701586 -0.8858197  0.04825758 -0.4344863 -0.4677502 -1.41183033
## 63   0.3701586  0.9520493 -2.28069506 -0.4344863 -0.4677502 -1.41183033
## 74   0.3701586 -0.8858197 -0.88332348 -0.4344863 -0.4677502 -0.07810125
## 77   0.3701586  0.9520493 -1.34911400 -0.4344863 -0.4677502 -0.07810125
## 79   0.3701586 -0.8858197  0.04825758 -0.4344863 -0.4677502 -0.07810125
## 83   1.3369259  0.9520493 -0.88332348 -0.4344863 -0.4677502 -0.07810125
## 84   0.3701586  0.9520493 -0.88332348 -0.4344863 -0.4677502 -0.07810125
## 91  -1.5633759 -0.8858197 -2.28069506 -0.4344863 -0.4677502 -0.07810125
## 95   0.3701586 -0.8858197 -1.34911400 -0.4344863 -0.4677502 -1.41183033
## 107 -0.5966086 -0.8858197 -0.41753295 -0.4344863 -0.4677502 -0.07810125
## 111  0.3701586  0.9520493  0.04825758 -0.4344863 -0.4677502 -1.41183033
## 115 -0.5966086 -0.8858197  0.97983863 -0.4344863 -0.4677502 -0.07810125
## 117 -1.5633759 -0.8858197 -0.88332348 -0.4344863 -0.4677502 -1.41183033
## 120  1.3369259  0.9520493 -0.88332348 -0.4344863 -0.4677502 -1.41183033
## 127 -1.5633759 -0.8858197  0.51404810  1.4568070  2.1282633 -0.07810125
## 128 -1.5633759 -0.8858197  0.04825758  1.4568070  2.1282633 -0.07810125
## 132 -0.5966086 -0.8858197  1.44562916  0.5111604  2.1282633  1.25562783
## 144 -0.5966086 -0.8858197  1.91141969 -0.4344863 -0.4677502 -0.07810125
## 147  0.3701586 -0.8858197  0.51404810  0.5111604  2.1282633 -0.07810125
## 152  0.3701586  0.9520493  0.97983863  4.2937470 -0.4677502  1.25562783
## 156  0.3701586  0.9520493  0.04825758  0.5111604  2.1282633  1.25562783
## 157  0.3701586 -0.8858197  0.51404810 -0.4344863 -0.4677502 -1.41183033
## 161 -0.5966086 -0.8858197  0.51404810  0.5111604  2.1282633 -0.07810125
## 170  1.3369259  0.9520493  0.51404810  1.4568070  2.1282633 -0.07810125
## 174 -0.5966086  0.9520493  0.04825758 -0.4344863 -0.4677502 -0.07810125
## 177 -0.5966086 -0.8858197 -0.41753295 -0.4344863 -0.4677502  1.25562783
## 183  1.3369259  0.9520493 -0.88332348 -0.4344863 -0.4677502 -0.07810125
## 190  0.3701586 -0.8858197  2.37721021 -0.4344863  2.1282633 -0.07810125
## 191  0.3701586  0.9520493  0.97983863 -0.4344863 -0.4677502 -0.07810125
## 194  0.3701586  0.9520493 -0.88332348 -0.4344863  2.1282633 -0.07810125
## 202  0.3701586 -0.8858197  0.04825758 -0.4344863 -0.4677502 -0.07810125
## 203  0.3701586 -0.8858197  0.51404810 -0.4344863 -0.4677502  1.25562783
## 204 -0.5966086 -0.8858197  0.97983863 -0.4344863 -0.4677502 -1.41183033
## 206  0.3701586  0.9520493 -0.41753295 -0.4344863 -0.4677502 -0.07810125
## 207 -0.5966086 -0.8858197  0.51404810 -0.4344863 -0.4677502  1.25562783
## 217  1.3369259  0.9520493  0.04825758 -0.4344863 -0.4677502  1.25562783
## 224  1.3369259  0.9520493  1.91141969 -0.4344863 -0.4677502 -1.41183033
## 233 -0.5966086 -0.8858197 -0.41753295  0.5111604  2.1282633 -0.07810125
## 239 -1.5633759 -0.8858197  0.51404810  2.4024537 -0.4677502 -0.07810125
## 252  0.3701586  0.9520493  0.51404810  1.4568070  2.1282633 -0.07810125
## 255 -0.5966086  0.9520493  0.04825758  3.3481003  2.1282633  1.25562783
## 256  1.3369259  0.9520493  0.04825758 -0.4344863 -0.4677502  1.25562783
## 262  0.3701586  0.9520493  0.51404810  1.4568070  2.1282633  1.25562783
## 266  1.3369259  0.9520493  0.51404810  0.5111604  2.1282633 -0.07810125
## 269  1.3369259  0.9520493 -1.34911400  1.4568070  2.1282633  1.25562783
## 270  0.3701586 -0.8858197  0.97983863  4.2937470  2.1282633  1.25562783
##         breast breast.quad   irradiat
## 7   -0.9711339  -0.7102848 -0.5100846
## 12  -0.9711339  -0.7102848 -0.5100846
## 28   1.0250858   0.2283058 -0.5100846
## 32   1.0250858  -0.7102848 -0.5100846
## 34   1.0250858   0.2283058 -0.5100846
## 36  -0.9711339   0.2283058 -0.5100846
## 39  -0.9711339  -0.7102848 -0.5100846
## 42  -0.9711339  -0.7102848 -0.5100846
## 44  -0.9711339   0.2283058 -0.5100846
## 48   1.0250858   0.2283058 -0.5100846
## 63  -0.9711339  -0.7102848 -0.5100846
## 74   1.0250858   1.1668965 -0.5100846
## 77  -0.9711339  -0.7102848 -0.5100846
## 79  -0.9711339  -0.7102848 -0.5100846
## 83   1.0250858  -0.7102848 -0.5100846
## 84   1.0250858  -0.7102848 -0.5100846
## 91   1.0250858  -1.6488754 -0.5100846
## 95  -0.9711339  -0.7102848 -0.5100846
## 107 -0.9711339   0.2283058 -0.5100846
## 111 -0.9711339  -0.7102848 -0.5100846
## 115  1.0250858   2.1054871 -0.5100846
## 117 -0.9711339  -0.7102848 -0.5100846
## 120 -0.9711339   1.1668965 -0.5100846
## 127  1.0250858   2.1054871 -0.5100846
## 128  1.0250858   0.2283058  1.9516281
## 132  1.0250858   0.2283058  1.9516281
## 144 -0.9711339  -0.7102848  1.9516281
## 147 -0.9711339  -0.7102848  1.9516281
## 152 -0.9711339  -0.7102848 -0.5100846
## 156  1.0250858   0.2283058 -0.5100846
## 157 -0.9711339  -1.6488754 -0.5100846
## 161  1.0250858  -0.7102848 -0.5100846
## 170  1.0250858   2.1054871 -0.5100846
## 174 -0.9711339  -0.7102848 -0.5100846
## 177  1.0250858  -0.7102848  1.9516281
## 183 -0.9711339   0.2283058  1.9516281
## 190  1.0250858   0.2283058  1.9516281
## 191 -0.9711339   0.2283058 -0.5100846
## 194 -0.9711339  -1.6488754  1.9516281
## 202 -0.9711339   2.1054871 -0.5100846
## 203 -0.9711339   2.1054871 -0.5100846
## 204  1.0250858   0.2283058 -0.5100846
## 206  1.0250858  -1.6488754 -0.5100846
## 207  1.0250858   2.1054871 -0.5100846
## 217 -0.9711339   1.1668965  1.9516281
## 224  1.0250858   2.1054871  1.9516281
## 233  1.0250858   2.1054871  1.9516281
## 239  1.0250858   0.2283058  1.9516281
## 252 -0.9711339   1.1668965  1.9516281
## 255 -0.9711339   1.1668965  1.9516281
## 256 -0.9711339   0.2283058 -0.5100846
## 262 -0.9711339   1.1668965 -0.5100846
## 266 -0.9711339  -1.6488754  1.9516281
## 269 -0.9711339   0.2283058  1.9516281
## 270  1.0250858   2.1054871 -0.5100846
## attr(,"scaled:center")
##         age   menopause  tumor.size   inv.nodes   node.caps   deg.malig 
##    3.617117    1.481982    5.896396    1.459459    1.180180    2.058559 
##      breast breast.quad    irradiat 
##    1.486486    2.756757    1.207207 
## attr(,"scaled:scale")
##         age   menopause  tumor.size   inv.nodes   node.caps   deg.malig 
##   1.0343751   0.5441084   2.1468878   1.0574774   0.3852060   0.7497775 
##      breast breast.quad    irradiat 
##   0.5009469   1.0654272   0.4062212

# Build neural network with Class
model <- keras_model_sequential()
model %>%
  layer_dense(units = 32, activation = 'relu', input_shape = dim(x_train)[2]) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")
model %>% compile(
  optimizer = optimizer_adam(),
  loss = 'mse',
  metrics = c('accuracy')
)
model %>% fit(x_train, y_train, epoch = 200, batch_size = 64)
score <- model %>% evaluate(x_test, y_test)
p1 <- model %>% predict(x_test)
pred = as.factor(ifelse(p1 > 0.5 , '1','0'))
confusionMatrix(factor(pred),factor(y_test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 31  8
##          1  8  8
##                                          
##                Accuracy : 0.7091         
##                  95% CI : (0.571, 0.8237)
##     No Information Rate : 0.7091         
##     P-Value [Acc > NIR] : 0.5669         
##                                          
##                   Kappa : 0.2949         
##                                          
##  Mcnemar's Test P-Value : 1.0000         
##                                          
##             Sensitivity : 0.7949         
##             Specificity : 0.5000         
##          Pos Pred Value : 0.7949         
##          Neg Pred Value : 0.5000         
##              Prevalence : 0.7091         
##          Detection Rate : 0.5636         
##    Detection Prevalence : 0.7091         
##       Balanced Accuracy : 0.6474         
##                                          
##        'Positive' Class : 0              
##

Brief interpret the model

Logistic regression model

The output above shows the estimate of the regression beta coefficients and their significance levels.

The intercept \(b_{0}\) is -1.8569.
Each one-unit change in deg.malig2 will increase the log odds of getting admit by 1.8418, and its p-value indicates that it is somewhat significant in determining the admit.
Each unit increase in irradiat1 increases the log odds of getting admit by 0.7734 and p-value indicates that it is somewhat significant in determining the admit.

The output also shows Null deviance, Residual deviance, AIC:

Null Deviance and Residual Deviance - Null Deviance indicates the response predicted by a model with only an intercept. Lower the value, better the model.
Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model. The difference between Null deviance and Residual deviance tells us that the model is a good fit. Greater the difference, better the model.. It makes sense to consider the model good if that difference is big enough.

AIC provides a method for assessing the quality of your model through comparison of related models. It’s based on the Deviance, but penalizes you for making the model more complicated. Much like adjusted R-squared, it’s intent is to prevent you from including irrelevant predictors
The logistic equation can be written as p = \(\frac{e^{(-1.8569 + 1.8418*degmalig2 +0.7734*irradiat1)}}{1+ e^(-1.8569 + 1.8418*degmalig2 +0.7734*irradiat1)}\)

Neural network model

The graph shows good performance of training test (training loss: 0.0912, training accuracy :0.9324). However, it also illustrate a problem with neural network model - overfitting. Overfitting is a modeling error that occurs when a function is too closely fit to a limited set of data points. Overfitting the model generally takes the form of making an overly complex model to explain idiosyncrasies in the data under study.

Conclusion

Logistic regression and artificial neural networks are the models of choice in many medical data classification tasks. In comparison between logistic regression and neural network model, the logistic regression model was more accurate in predicting breast cancer’s class. Therefore, it seems that for classification of breast cancer’s class, logistic regression method is appropriate to be used.

Comparison between logistic regression and neural network models on Breast Cancer’s class

Hoang H. Ha and Thu A. Nguyen

4/28/2020

Introduction

Data loading

Exploratory data analysis

Hypothesis testing

Ttest

Building the model

Logistic regression Model for Breast Cancer’s Class

Neural network model

Brief interpret the model

Logistic regression model

Neural network model

Conclusion