Breast cancer is cancer that develops from breast tissue. Signs of breast cancer may include a lump in the breast, a change in breast shape, dimpling of the skin, fluid coming from the nipple, a newly-inverted nipple, or a red or scaly patch of skin. In this report, we will research factors which can affect the recurrence of breast cancer such as: age, menopause, tumor.size, inv.nodes, node.caps, deg.malig, breast, breast.quad.
This dataset is one of three domains provided by the Oncology Institute that has repeatedly appeared in the machine learning literature.
It includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes, some of which are linear and some are nominal.
Attribute Information:
# Install the required library
list.of.packages <- c("dplyr", "qwraps2", "ggplot2", "gridExtra", "car","arm","psych","caTools","caret")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(qwraps2)
options(qwraps2_markup = "markdown")
library(dplyr)
library(ggplot2)
library(gridExtra)
library(arm)
library(car)
library(psych)
library(caTools)
library(caret)
library(keras)
# Load the dataset
data = read.csv("https://raw.githubusercontent.com/pnhuy/bioinfo/master/datasets/breast_cancer/breast-cancer.csv")
attach(data)
The data have 286 rows of patients, and 10 fields: Class, age, menopause, tumor.size, inv.nodes, node.caps, deg.malig, breast, breast.quad, irradiat
head(data)
## Class age menopause tumor.size inv.nodes node.caps deg.malig
## 1 no-recurrence-events 30-39 premeno 30-34 0-2 no 3
## 2 no-recurrence-events 40-49 premeno 20-24 0-2 no 2
## 3 no-recurrence-events 40-49 premeno 20-24 0-2 no 2
## 4 no-recurrence-events 60-69 ge40 15-19 0-2 no 2
## 5 no-recurrence-events 40-49 premeno 0-4 0-2 no 2
## 6 no-recurrence-events 60-69 ge40 15-19 0-2 no 2
## breast breast.quad irradiat
## 1 left left_low no
## 2 right right_up no
## 3 left left_low no
## 4 right left_up no
## 5 right right_low no
## 6 left left_low no
#Show structure of the dataset
str(data)
## 'data.frame': 286 obs. of 10 variables:
## $ Class : Factor w/ 2 levels "no-recurrence-events",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ age : Factor w/ 6 levels "20-29","30-39",..: 2 3 3 5 3 5 4 5 3 3 ...
## $ menopause : Factor w/ 3 levels "ge40","lt40",..: 3 3 3 1 3 1 3 1 3 3 ...
## $ tumor.size : Factor w/ 11 levels "0-4","10-14",..: 6 4 4 3 1 3 5 4 11 4 ...
## $ inv.nodes : Factor w/ 7 levels "0-2","12-14",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ node.caps : Factor w/ 3 levels "?","no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ deg.malig : int 3 2 2 2 2 2 2 1 2 2 ...
## $ breast : Factor w/ 2 levels "left","right": 1 2 1 2 2 1 1 1 1 2 ...
## $ breast.quad: Factor w/ 6 levels "?","central",..: 3 6 3 4 5 3 3 3 3 4 ...
## $ irradiat : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
The basic statisics of data was below:
summary(data)
## Class age menopause tumor.size inv.nodes
## no-recurrence-events:201 20-29: 1 ge40 :129 30-34 :60 0-2 :213
## recurrence-events : 85 30-39:36 lt40 : 7 25-29 :54 12-14: 3
## 40-49:90 premeno:150 20-24 :50 15-17: 6
## 50-59:96 15-19 :30 24-26: 1
## 60-69:57 10-14 :28 3-5 : 36
## 70-79: 6 40-44 :22 6-8 : 17
## (Other):42 9-11 : 10
## node.caps deg.malig breast breast.quad irradiat
## ? : 8 Min. :1.000 left :152 ? : 1 no :218
## no :222 1st Qu.:2.000 right:134 central : 21 yes: 68
## yes: 56 Median :2.000 left_low :110
## Mean :2.049 left_up : 97
## 3rd Qu.:3.000 right_low: 24
## Max. :3.000 right_up : 33
##
#Illustrate the relationship between Class and age
table(data$Class,data$age)
##
## 20-29 30-39 40-49 50-59 60-69 70-79
## no-recurrence-events 1 21 63 71 40 5
## recurrence-events 0 15 27 25 17 1
The percentage of cancer-patients occur rarely in youngest and oldest group. Besides, The high percentage of “no-reccurence-events” and “recurrence-events” are at age “40-69”. Specifically, age “40-49” accounts for the highest propotion in “recurrence-events” while age“50-59” accounts for the highest percentage in opposite class.
#correlation coefficient between variables
pairs.panels(data)
This graph shows coefficients between all variables. More specifically, the highest correlation coefficient (0,72) is between age and menopause (statistically significant) while the correlation coefficient between menopause and tumor.size is 0 ( no statistically significant). In addition, the above graph also illustrates scatter charts, histograms for each pair of variables.
Besides,the data might contain NA values and they would be checked before building the model.
sum(is.na(data))
## [1] 0
There is not NA showing in dataset but there are some “?” which appears in the summary view. Therefore, we will remove “?” values in breast.quad and node.caps columns.
data = filter(data, breast.quad != '?' & node.caps != '?')
Before explorating the other aspects of dataset and buliding the model, we produce bar graphs in ggplot for each attribute with the class variable highlighted in color to see if there are any interesting interactions between the covariates and the class variable. A graph was produced for each of the 9 attributes.
p1 = ggplot(data, aes(x=inv.nodes, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Inv Nodes Grouped by Class',x='Inv Nodes',y='Count')
p2 = ggplot(data, aes(x=menopause, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Menopause Grouped by Class',x='Menopause',y='Count')
p3 = ggplot(data, aes(x=irradiat, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Irradiat Grouped by Class',x='Irradiat',y='Count')
p4 = ggplot(data, aes(x=age, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Age Grouped by Class',x='Age',y='Count')
p5 = ggplot(data, aes(x=breast.quad, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Breast Quandrant Grouped by Class',x='Breast Quandrant',y='Count')
p6 = ggplot(data, aes(x=tumor.size, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Tumor size Grouped by Class',x=' Tumor size',y='Count')
p7 = ggplot(data, aes(x= node.caps, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of node.caps Grouped by Class',x='Node.caps',y='Count')
p8 = ggplot(data, aes(x=breast, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Breast Grouped by Class',x='Breast',y='Count')
p9 = ggplot(data, aes(x=deg.malig, fill=Class)) + geom_bar(position='dodge') + labs(title='Histogram of Deg.malig Grouped by Class',x='Degree of Malignancy',y='Count')
grid.arrange(p1,p2,p3,p4,p5,p6,p7,p8,p9)
1.From these graphs, we have some comments: + Age group and breast quandrant seems to follow normal distributions. + Moreover, when we see the histogram of Inv.nodes, class variables occurs the most with fewer axillary lymph nodes (0-2). + At Degree of Malignancy histogram: the higher degree of malignancy is, the more reccurence-events increase but there is a fluctuation of no-reccence-events, it is most frequent at 2, then 1, and then 3.
#Density plot
ggplot(data,aes(x=`deg.malig`,fill=`tumor.size`))+geom_density(alpha=0.4)+ggtitle("Deg Malignant vs Tumor size")+xlab("Deg Malignant")+ylab("Density")
By visualizing the distribution of degree of malignancy with different levels of tumor.size as multi-density plot, we can understand the effect of tumor.size on deg.malig
#Fix NA in nodecaps
data$node.caps=ifelse(is.na(data$node.caps),ave(data$node.caps,FUN=function(x)"no"),data$node.caps)
#Fix NA in breasquad
data$breast.quad=ifelse(is.na(data$breast.quad),ave(data$breast.quad,FUN=function(x)"left_low"),data$breast.quad)
#Labeling features
data$Class= factor (data$Class, labels= c(0,1) , levels= c("no-recurrence-events", "recurrence-events"))
data$age= factor (data$age,labels= c(0,1,2,3,4,5) , levels= c("20-29","30-39","40-49","50-59","60-69","70-79"))
data$menopause= factor (data$menopause, labels= c(0,1,2) , levels= c("premeno","ge40","lt40"))
data$tumor.size= factor (data$tumor.size, labels= c(0,1,2,3,4,5,6,7,8,9,10) , levels= c("0-4","5-9","10-14", "15-19", "20-24","25-29", "30-34", "35-39","40-44","45-49","50-54"))
data$inv.nodes= factor (data$inv.nodes, labels= c(0,1,2,3,4,5,6) , levels= c("0-2","3-5","6-8", "9-11","12-14", "15-17", "24-26"))
data$node.caps= factor (data$node.caps, labels= c(0,1) , levels= c("2", "3"))
data$deg.malig= factor (data$deg.malig, labels= c(0,1,2) , levels= c("1", "2", "3"))
data$breast= factor (data$breast, labels= c(0,1) , levels= c("left", "right"))
data$breast.quad= factor (data$breast.quad, labels= c(0,1,2,3,4) , levels= c("2", "3", "4", "5", "6"))
data$irradiat= factor (data$irradiat, labels= c(0,1) , levels= c("no", "yes"))
#Testing deg.maig vs class
t.test(as.numeric(data$deg.malig)~data$Class)
##
## Welch Two Sample t-test
##
## data: as.numeric(data$deg.malig) by data$Class
## t = -5.8131, df = 149.93, p-value = 3.563e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.7088626 -0.3492125
## sample estimates:
## mean in group 0 mean in group 1
## 1.903061 2.432099
From the t-test results, we see that the p-value is much smaller than the desired rate of Type 1 Error of 0.05 or 5%. This supports rejection of Ho and also states that the alternative hypothesis, Ha-mean difference between after and before levels of stress is less than zero-is statistically significant.
We want to predict an outcome variable that ss categorical so we use logistic regression.
log_data <- data[c(1,7,10)]
#suffling
set.seed(1000)
shuf_ind <- sample(1:nrow(log_data))
log_data <- log_data[shuf_ind, ]
#splitting data
set.seed(123)
split <-sample.split(log_data$Class,SplitRatio =0.8)
#training set
training_set <-subset(log_data,split==T)
#test set
test_set <-subset(log_data,split==F)
#generalised linear model
classifier <- glm(formula = Class~. ,family = binomial(),data=training_set)
summary(classifier)
##
## Call:
## glm(formula = Class ~ ., family = binomial(), data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5177 -0.5991 -0.5991 0.8717 2.0011
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8571 0.3870 -4.798 1.6e-06 ***
## deg.malig1 0.2305 0.4683 0.492 0.622508
## deg.malig2 1.7253 0.4610 3.743 0.000182 ***
## irradiat1 0.9035 0.3636 2.485 0.012962 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 268.46 on 221 degrees of freedom
## Residual deviance: 233.96 on 218 degrees of freedom
## AIC: 241.96
##
## Number of Fisher Scoring iterations: 4
pre = predict(classifier,type="response",newdata=test_set[-1])
y_pre <- ifelse(pre>0.5,"1","0")
confusionMatrix(factor(y_pre), factor(test_set$Class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 37 13
## 1 2 3
##
## Accuracy : 0.7273
## 95% CI : (0.5904, 0.8386)
## No Information Rate : 0.7091
## P-Value [Acc > NIR] : 0.449190
##
## Kappa : 0.1709
##
## Mcnemar's Test P-Value : 0.009823
##
## Sensitivity : 0.9487
## Specificity : 0.1875
## Pos Pred Value : 0.7400
## Neg Pred Value : 0.6000
## Prevalence : 0.7091
## Detection Rate : 0.6727
## Detection Prevalence : 0.9091
## Balanced Accuracy : 0.5681
##
## 'Positive' Class : 0
##
From the output above, the coefficients table shows the beta coefficient estimates and their significance levels. Columns are:
Estimate: the intercept (b0) and the beta coefficient estimates associated to each predictor variable Std.Error: the standard error of the coefficient estimates. This represents the accuracy of the coefficients. The larger the standard error, the less confident we are about the estimate. z value: the z-statistic, which is the coefficient estimate (column 2) divided by the standard error of the estimate (column 3) Pr(>|z|): The p-value corresponding to the z-statistic. The smaller the p-value, the more significant the estimate is.
Next,We’ll make predictions using the test data in order to evaluate the performance of our logistic regression model.
The procedure is as follow:
Predict the class membership probabilities of observations based on predictor variables Assign the observations to the class with highest probability score (i.e above 0.5) The R function predict() can be used to predict the probability of recurrence-events, given the predictor values. Use the option type = “response” to directly obtain the probabilities
#probability predict
pre = predict(classifier,type="response",newdata=test_set[-1])
y_pre <- ifelse(pre>0.5,"recurrence","no_recurrence")
y_pre
## 29 154 96 140 135
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence"
## 136 163 109 269 40
## "no_recurrence" "no_recurrence" "no_recurrence" "recurrence" "no_recurrence"
## 56 108 265 113 66
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence"
## 212 105 95 190 125
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence"
## 210 240 197 179 2
## "no_recurrence" "recurrence" "no_recurrence" "recurrence" "no_recurrence"
## 116 204 189 104 63
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence"
## 251 120 55 166 106
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence"
## 215 244 86 187 48
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence"
## 19 80 202 245 205
## "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence" "no_recurrence"
## 257 236 184 142 12
## "no_recurrence" "recurrence" "no_recurrence" "no_recurrence" "no_recurrence"
## 183 100 235 178 141
## "no_recurrence" "no_recurrence" "no_recurrence" "recurrence" "no_recurrence"
set.seed(100)
split <- sample.split(data$Class, SplitRatio = 0.8)
y_train <- subset(data, split == TRUE)
y_train <- y_train[1]
y_train = as.matrix(y_train)
y_train <- as.numeric(y_train)
y_test <- subset(data, split == FALSE)
y_test <- y_test[1]
y_test = as.matrix(y_test)
y_test <- as.numeric(y_test)
data_col_1 <- colnames(data)
data_col_1 <- data_col_1[data_col_1 != "Class"]
data <- data[data_col_1]
x_train <- subset(data, split == TRUE)
x_train <- x_train[1:9]
x_train$age <- as.numeric(x_train$age)
x_train$menopause <- as.numeric(x_train$menopause)
x_train$tumor.size <- as.numeric(x_train$tumor.size)
x_train$inv.nodes <- as.numeric(x_train$inv.nodes)
x_train$node.caps <- as.numeric(x_train$node.caps)
x_train$deg.malig <- as.numeric(x_train$deg.malig)
x_train$breast <- as.numeric(x_train$breast)
x_train$breast.quad <- as.numeric(x_train$breast.quad)
x_train$irradiat <- as.numeric(x_train$irradiat)
x_train <- as.matrix(x_train)
x_train <- scale(x_train)
x_test <- subset(data, split == FALSE)
x_test <- x_test[1:9]
x_test$age <- as.numeric(x_test$age)
x_test$menopause <- as.numeric(x_test$menopause)
x_test$tumor.size <- as.numeric(x_test$tumor.size)
x_test$inv.nodes <- as.numeric(x_test$inv.nodes)
x_test$node.caps <- as.numeric(x_test$node.caps)
x_test$deg.malig <- as.numeric(x_test$deg.malig)
x_test$breast <- as.numeric(x_test$breast)
x_test$breast.quad <- as.numeric(x_test$breast.quad)
x_test$irradiat <- as.numeric(x_test$irradiat)
x_test <- as.matrix(x_test)
col_means_train <- attr(x_train, "scaled:center")
col_stddevs_train <- attr(x_train, "scaled:scale")
x_test <- scale(x_test, center = col_means_train, scale = col_stddevs_train)
x_test
## age menopause tumor.size inv.nodes node.caps deg.malig
## 7 0.3701586 -0.8858197 0.04825758 -0.4344863 -0.4677502 -0.07810125
## 12 0.3701586 0.9520493 0.04825758 -0.4344863 -0.4677502 -0.07810125
## 28 1.3369259 0.9520493 0.04825758 -0.4344863 -0.4677502 1.25562783
## 32 0.3701586 -0.8858197 -0.88332348 -0.4344863 -0.4677502 -0.07810125
## 34 0.3701586 0.9520493 -1.34911400 -0.4344863 -0.4677502 -1.41183033
## 36 -1.5633759 -0.8858197 0.51404810 -0.4344863 -0.4677502 -0.07810125
## 39 -0.5966086 -0.8858197 -1.34911400 -0.4344863 -0.4677502 -0.07810125
## 42 1.3369259 0.9520493 0.04825758 -0.4344863 -0.4677502 -0.07810125
## 44 -0.5966086 -0.8858197 -1.34911400 -0.4344863 -0.4677502 -0.07810125
## 48 0.3701586 -0.8858197 0.04825758 -0.4344863 -0.4677502 -1.41183033
## 63 0.3701586 0.9520493 -2.28069506 -0.4344863 -0.4677502 -1.41183033
## 74 0.3701586 -0.8858197 -0.88332348 -0.4344863 -0.4677502 -0.07810125
## 77 0.3701586 0.9520493 -1.34911400 -0.4344863 -0.4677502 -0.07810125
## 79 0.3701586 -0.8858197 0.04825758 -0.4344863 -0.4677502 -0.07810125
## 83 1.3369259 0.9520493 -0.88332348 -0.4344863 -0.4677502 -0.07810125
## 84 0.3701586 0.9520493 -0.88332348 -0.4344863 -0.4677502 -0.07810125
## 91 -1.5633759 -0.8858197 -2.28069506 -0.4344863 -0.4677502 -0.07810125
## 95 0.3701586 -0.8858197 -1.34911400 -0.4344863 -0.4677502 -1.41183033
## 107 -0.5966086 -0.8858197 -0.41753295 -0.4344863 -0.4677502 -0.07810125
## 111 0.3701586 0.9520493 0.04825758 -0.4344863 -0.4677502 -1.41183033
## 115 -0.5966086 -0.8858197 0.97983863 -0.4344863 -0.4677502 -0.07810125
## 117 -1.5633759 -0.8858197 -0.88332348 -0.4344863 -0.4677502 -1.41183033
## 120 1.3369259 0.9520493 -0.88332348 -0.4344863 -0.4677502 -1.41183033
## 127 -1.5633759 -0.8858197 0.51404810 1.4568070 2.1282633 -0.07810125
## 128 -1.5633759 -0.8858197 0.04825758 1.4568070 2.1282633 -0.07810125
## 132 -0.5966086 -0.8858197 1.44562916 0.5111604 2.1282633 1.25562783
## 144 -0.5966086 -0.8858197 1.91141969 -0.4344863 -0.4677502 -0.07810125
## 147 0.3701586 -0.8858197 0.51404810 0.5111604 2.1282633 -0.07810125
## 152 0.3701586 0.9520493 0.97983863 4.2937470 -0.4677502 1.25562783
## 156 0.3701586 0.9520493 0.04825758 0.5111604 2.1282633 1.25562783
## 157 0.3701586 -0.8858197 0.51404810 -0.4344863 -0.4677502 -1.41183033
## 161 -0.5966086 -0.8858197 0.51404810 0.5111604 2.1282633 -0.07810125
## 170 1.3369259 0.9520493 0.51404810 1.4568070 2.1282633 -0.07810125
## 174 -0.5966086 0.9520493 0.04825758 -0.4344863 -0.4677502 -0.07810125
## 177 -0.5966086 -0.8858197 -0.41753295 -0.4344863 -0.4677502 1.25562783
## 183 1.3369259 0.9520493 -0.88332348 -0.4344863 -0.4677502 -0.07810125
## 190 0.3701586 -0.8858197 2.37721021 -0.4344863 2.1282633 -0.07810125
## 191 0.3701586 0.9520493 0.97983863 -0.4344863 -0.4677502 -0.07810125
## 194 0.3701586 0.9520493 -0.88332348 -0.4344863 2.1282633 -0.07810125
## 202 0.3701586 -0.8858197 0.04825758 -0.4344863 -0.4677502 -0.07810125
## 203 0.3701586 -0.8858197 0.51404810 -0.4344863 -0.4677502 1.25562783
## 204 -0.5966086 -0.8858197 0.97983863 -0.4344863 -0.4677502 -1.41183033
## 206 0.3701586 0.9520493 -0.41753295 -0.4344863 -0.4677502 -0.07810125
## 207 -0.5966086 -0.8858197 0.51404810 -0.4344863 -0.4677502 1.25562783
## 217 1.3369259 0.9520493 0.04825758 -0.4344863 -0.4677502 1.25562783
## 224 1.3369259 0.9520493 1.91141969 -0.4344863 -0.4677502 -1.41183033
## 233 -0.5966086 -0.8858197 -0.41753295 0.5111604 2.1282633 -0.07810125
## 239 -1.5633759 -0.8858197 0.51404810 2.4024537 -0.4677502 -0.07810125
## 252 0.3701586 0.9520493 0.51404810 1.4568070 2.1282633 -0.07810125
## 255 -0.5966086 0.9520493 0.04825758 3.3481003 2.1282633 1.25562783
## 256 1.3369259 0.9520493 0.04825758 -0.4344863 -0.4677502 1.25562783
## 262 0.3701586 0.9520493 0.51404810 1.4568070 2.1282633 1.25562783
## 266 1.3369259 0.9520493 0.51404810 0.5111604 2.1282633 -0.07810125
## 269 1.3369259 0.9520493 -1.34911400 1.4568070 2.1282633 1.25562783
## 270 0.3701586 -0.8858197 0.97983863 4.2937470 2.1282633 1.25562783
## breast breast.quad irradiat
## 7 -0.9711339 -0.7102848 -0.5100846
## 12 -0.9711339 -0.7102848 -0.5100846
## 28 1.0250858 0.2283058 -0.5100846
## 32 1.0250858 -0.7102848 -0.5100846
## 34 1.0250858 0.2283058 -0.5100846
## 36 -0.9711339 0.2283058 -0.5100846
## 39 -0.9711339 -0.7102848 -0.5100846
## 42 -0.9711339 -0.7102848 -0.5100846
## 44 -0.9711339 0.2283058 -0.5100846
## 48 1.0250858 0.2283058 -0.5100846
## 63 -0.9711339 -0.7102848 -0.5100846
## 74 1.0250858 1.1668965 -0.5100846
## 77 -0.9711339 -0.7102848 -0.5100846
## 79 -0.9711339 -0.7102848 -0.5100846
## 83 1.0250858 -0.7102848 -0.5100846
## 84 1.0250858 -0.7102848 -0.5100846
## 91 1.0250858 -1.6488754 -0.5100846
## 95 -0.9711339 -0.7102848 -0.5100846
## 107 -0.9711339 0.2283058 -0.5100846
## 111 -0.9711339 -0.7102848 -0.5100846
## 115 1.0250858 2.1054871 -0.5100846
## 117 -0.9711339 -0.7102848 -0.5100846
## 120 -0.9711339 1.1668965 -0.5100846
## 127 1.0250858 2.1054871 -0.5100846
## 128 1.0250858 0.2283058 1.9516281
## 132 1.0250858 0.2283058 1.9516281
## 144 -0.9711339 -0.7102848 1.9516281
## 147 -0.9711339 -0.7102848 1.9516281
## 152 -0.9711339 -0.7102848 -0.5100846
## 156 1.0250858 0.2283058 -0.5100846
## 157 -0.9711339 -1.6488754 -0.5100846
## 161 1.0250858 -0.7102848 -0.5100846
## 170 1.0250858 2.1054871 -0.5100846
## 174 -0.9711339 -0.7102848 -0.5100846
## 177 1.0250858 -0.7102848 1.9516281
## 183 -0.9711339 0.2283058 1.9516281
## 190 1.0250858 0.2283058 1.9516281
## 191 -0.9711339 0.2283058 -0.5100846
## 194 -0.9711339 -1.6488754 1.9516281
## 202 -0.9711339 2.1054871 -0.5100846
## 203 -0.9711339 2.1054871 -0.5100846
## 204 1.0250858 0.2283058 -0.5100846
## 206 1.0250858 -1.6488754 -0.5100846
## 207 1.0250858 2.1054871 -0.5100846
## 217 -0.9711339 1.1668965 1.9516281
## 224 1.0250858 2.1054871 1.9516281
## 233 1.0250858 2.1054871 1.9516281
## 239 1.0250858 0.2283058 1.9516281
## 252 -0.9711339 1.1668965 1.9516281
## 255 -0.9711339 1.1668965 1.9516281
## 256 -0.9711339 0.2283058 -0.5100846
## 262 -0.9711339 1.1668965 -0.5100846
## 266 -0.9711339 -1.6488754 1.9516281
## 269 -0.9711339 0.2283058 1.9516281
## 270 1.0250858 2.1054871 -0.5100846
## attr(,"scaled:center")
## age menopause tumor.size inv.nodes node.caps deg.malig
## 3.617117 1.481982 5.896396 1.459459 1.180180 2.058559
## breast breast.quad irradiat
## 1.486486 2.756757 1.207207
## attr(,"scaled:scale")
## age menopause tumor.size inv.nodes node.caps deg.malig
## 1.0343751 0.5441084 2.1468878 1.0574774 0.3852060 0.7497775
## breast breast.quad irradiat
## 0.5009469 1.0654272 0.4062212
# Build neural network with Class
model <- keras_model_sequential()
model %>%
layer_dense(units = 32, activation = 'relu', input_shape = dim(x_train)[2]) %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
model %>% compile(
optimizer = optimizer_adam(),
loss = 'mse',
metrics = c('accuracy')
)
model %>% fit(x_train, y_train, epoch = 200, batch_size = 64)
score <- model %>% evaluate(x_test, y_test)
p1 <- model %>% predict(x_test)
pred = as.factor(ifelse(p1 > 0.5 , '1','0'))
confusionMatrix(factor(pred),factor(y_test))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 31 8
## 1 8 8
##
## Accuracy : 0.7091
## 95% CI : (0.571, 0.8237)
## No Information Rate : 0.7091
## P-Value [Acc > NIR] : 0.5669
##
## Kappa : 0.2949
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.7949
## Specificity : 0.5000
## Pos Pred Value : 0.7949
## Neg Pred Value : 0.5000
## Prevalence : 0.7091
## Detection Rate : 0.5636
## Detection Prevalence : 0.7091
## Balanced Accuracy : 0.6474
##
## 'Positive' Class : 0
##
AIC provides a method for assessing the quality of your model through comparison of related models. It’s based on the Deviance, but penalizes you for making the model more complicated. Much like adjusted R-squared, it’s intent is to prevent you from including irrelevant predictors
The logistic equation can be written as p = \(\frac{e^{(-1.8569 + 1.8418*degmalig2 +0.7734*irradiat1)}}{1+ e^(-1.8569 + 1.8418*degmalig2 +0.7734*irradiat1)}\)
The graph shows good performance of training test (training loss: 0.0912, training accuracy :0.9324). However, it also illustrate a problem with neural network model - overfitting. Overfitting is a modeling error that occurs when a function is too closely fit to a limited set of data points. Overfitting the model generally takes the form of making an overly complex model to explain idiosyncrasies in the data under study.
Logistic regression and artificial neural networks are the models of choice in many medical data classification tasks. In comparison between logistic regression and neural network model, the logistic regression model was more accurate in predicting breast cancer’s class. Therefore, it seems that for classification of breast cancer’s class, logistic regression method is appropriate to be used.