As reported by the American Cancer Society, as of 2024 breast cancer accounts for approximately 1 in 3 of all new female cancer cases in the United States annually. Despite the steadily declining rates in breast cancer mortality, breast cancer is still the second leading cause of cancer death in women. As early detection and diagnosis is pivotal to improved treatment outcome, increasing need has been identified for improved predictive models to assist in rapid diagnosis.
Tissue biopsy is currently the only diagnostic tool to determine cancer content in identified breast tumors. Majority of breast tissue biopsy is completed utilizing a core needle biopsy; a procedure that collects 1 or multiple small cylinders of tumor tissue using a fine gauge hollow needle. As illustrated in the image below, there is a chance that an incomplete tumor punch can be collected, leading to increased difficulty in diagnosis. Core needle biopsies are preferred by physicians and patients due to the increased tissue volume obtained compared to a fine needle aspirate biopsy and the ability to biopsy in an outpatient setting without surgery.
In order to improve breast cancer diagnostic capabilities through minimal tissue collection biopsies such as core needle punch biopsies, improved predictive models must be developed to assist in filling in the blanks in the event an incomplete punch biopsy. Increasing predictive ability with minimal tissue requirements will in effect reduce patient burden and increase the speed at which malignant tissue is diagnosed.
raw.data <- read.csv("https://nlepera.github.io/sta551/HW03/data/syntheticBreastCancerData.csv") %>%
subset( , select = c(2:11))
colnames(raw.data)[colnames(raw.data) == "Single_Epithelial_Cell_Size"] <- "eCellSize"
colnames(raw.data)[colnames(raw.data) == "Thickness_of_Clump"] <- "clumpThickness"
colnames(raw.data)[colnames(raw.data) == "Cell_Size_Uniformity"] <- "sizeUniform"
colnames(raw.data)[colnames(raw.data) == "Cell_Shape_Uniformity"] <- "shapeUniform"
colnames(raw.data)[colnames(raw.data) == "Marginal_Adhesion"] <- "margAdh"
colnames(raw.data)[colnames(raw.data) == "Bare_Nuclei"] <- "bareNuclei"
colnames(raw.data)[colnames(raw.data) == "Bland_Chromatin"] <- "blChrom"
colnames(raw.data)[colnames(raw.data) == "Normal_Nucleoli"] <- "normalNucleoli"
raw.data$Outcome <- as.factor(as.character(raw.data$Outcome))
Synthetic Breast Cancer Data was obtained from Applied Analytics
through Case Studies using SAS and R by Deepti Gupta [link].
The synthetic
data set initially contained 600 observations
and 11 variables. Initial variable names were converted for ease of
utilization and the first variable Sample_No
was dropped
for analysis purposes, as Sample_No
was an unrelated
identifier key rather than a true variable.
Variables retained are as follows:
clumpThickness
: Integer value measuring thickness
of tumor clump obtained by biopsy. Mono-layers (clumpThickness = 1)
indicate benign mass (no cancer) while higher values indicate malignancy
(cancer).
sizeUniform
: Integer value measuring variation in
cell size. Greater values indicate greater variation in cell size.
Increased cell size variation considered associated with
malignancy.
shapeUniform
: Integer value measuring variation in
cell shape. Greater values indicate greater variation in cell shape.
Increased cell shape variation considered associated with
malignancy.
margAdh
: Integer value measuring variation in
adhesion among neighboring cells within a clump. Greater values indicate
greater adhesion between neighboring cells. Decreased adhesion
considered associated with malignancy.
eCellSize
: Integer value measuring respective size
of single epithelial cells obtained from biopsy. Greater values indicate
greater epithelial cell size. Increased epithelial cell size considered
associated with malignancy.
bareNuclei
: Integer value measuring count of bare
nuclei observed in sample cells. Greater values indicate greater number
of cells with bare nuclei surrounded by cytoplasm. Increased bare nuclei
count considered associated with malignancy.
blChrom
: Integer value measuring coarseness of
chromatin. Greater values indicate greater chromatin coarseness.
Increased chromatin coarseness considered associated with
malignancy.
normalNucleoli
: Integer value measuring size normal
(mean) nucleoli size as measured in biopsy sample. Greater values
indicate mean nucleoli size. Increased mean nucleoli considered
associated with malignancy.
Mitoses
: Integer value measuring the number of
cells undergoing mitosis observed in the biopsy sample. Greater values
indicate greater rates of mitosis. Increased mitosis rates are
considered associated with malignancy.
Outcome
: Character value classifying final biopsy
diagnosis. “No” indicates tumor diagnosed as benign (no cancer). “Yes”
indicates tumor diagnosed as malignant (cancer).
Once the variable names were transformed it was confirmed the resultant data set contained 600 observations across 10 variables with 0 missing values. As no values are missing from the data set it was determined that no imputation was necessary.
kable(summary(raw.data), caption = "Structure of Synthetic Breast Cancer Data", align = "c", size = 6) %>%
kable_styling() %>%
row_spec(0, font_size = 12)
clumpThickness | sizeUniform | shapeUniform | margAdh | eCellSize | bareNuclei | blChrom | normalNucleoli | Mitoses | Outcome | |
---|---|---|---|---|---|---|---|---|---|---|
Min. : 1.00 | Min. : 1.000 | Min. : 1.000 | Min. : 1.000 | Min. : 1.000 | Min. : 1.0 | Min. : 1.000 | Min. : 1.000 | Min. : 1.000 | No :380 | |
1st Qu.: 3.00 | 1st Qu.: 2.000 | 1st Qu.: 2.000 | 1st Qu.: 2.000 | 1st Qu.: 3.000 | 1st Qu.: 2.0 | 1st Qu.: 3.000 | 1st Qu.: 2.000 | 1st Qu.: 1.000 | Yes:220 | |
Median : 5.00 | Median : 3.000 | Median : 3.000 | Median : 3.000 | Median : 4.000 | Median : 3.0 | Median : 4.000 | Median : 3.000 | Median : 2.000 | NA | |
Mean : 5.41 | Mean : 4.122 | Mean : 4.195 | Mean : 3.763 | Mean : 4.293 | Mean : 4.5 | Mean : 4.495 | Mean : 3.812 | Mean : 2.093 | NA | |
3rd Qu.: 7.25 | 3rd Qu.: 6.000 | 3rd Qu.: 6.000 | 3rd Qu.: 5.000 | 3rd Qu.: 5.000 | 3rd Qu.: 9.0 | 3rd Qu.: 6.000 | 3rd Qu.: 5.000 | 3rd Qu.: 2.000 | NA | |
Max. :10.00 | Max. :10.000 | Max. :10.000 | Max. :10.000 | Max. :10.000 | Max. :10.0 | Max. :10.000 | Max. :10.000 | Max. :10.000 | NA |
kable(sapply(raw.data, function (x) sum(is.na(x))), caption = "Count of Blank Values in for all Labels", col.names = c("Labels", "Count of Blank Values")) %>%
kable_styling()
Labels | Count of Blank Values |
---|---|
clumpThickness | 0 |
sizeUniform | 0 |
shapeUniform | 0 |
margAdh | 0 |
eCellSize | 0 |
bareNuclei | 0 |
blChrom | 0 |
normalNucleoli | 0 |
Mitoses | 0 |
Outcome | 0 |
One engineered variable was created to convert Outcome
from a character variable to a binary integer response variable:
out
: Engineered outcome variable converted to binary
integer response variable. 1 = “No” (biopsy diagnosed as benign) and 2 =
“Yes” (biopsy diagnosed as malignant).A brief summary of the variables with original variable values and additional supplimental details can be found at: [link].
Upon visualizing the pairwise distribution of all integer variables,
controlled for Outcome, it becomes evident that there are multiple
significant correlations across multiple variables. 16 combinations
demonstrate a correlation greater than 0.600 indicating a high value of
correlation between these variable pairs. The greatest correlation seen
was between shapeUniform
and
sizeUniform
indicating an increase in cell shape variation
is accompanied by an increase in cell size variation.
ggpairs(raw.data,
columns = c(1:10),
aes(alpha = 0.05, color = Outcome),
lower = list(continuous = wrap("smooth", method = "lm", se = FALSE, alpha = 0.25)),
upper = list(continuous = wrap("cor", size = 3))) +
theme (
strip.text = element_text(size = 7),
axis.text = element_text(size = 8),
axis.title = element_text(size = 6),
legend.text = element_text(size = 8),
legend.title = element_text(size = 8)
)
The high correlation values across nearly all variables in the data set underpin the need for a complex modeling procedure to create accurate predictions of the biopsy diagnosis.
Two logistic prediction models were created to examine the potential predictive value of a multiple variable logistic regression model in determining breast cancer biopsy outcomes. An initial model including all variables was created, and then trimmed down to create a secondary reduced model for more precise regression analysis and potential future predictions.
To start, a logistic regression model was created to compare the
logistic regression relationship between the Outcome
variable and all integer variables in the data set. This initial model
is referred to henceforth as full.model.logit
. The
full model
full.model.legit` is represented as follows:
Outcome ~ Clump Thickness + Cell Size Uniformity + Cell Shape Uniformity + Marginal Adhesion + Single Epithelial Cell Size + Bare Nuclei + Bland Chromatin + Normal Nucleoli + Mitoses
full.model.logit <- glm(Outcome ~ ., family = binomial, data = raw.data)
A summary of regression coefficients, z values, and probability of associated z values (p vlaue associated with listed z value) for each variable in the logistic regression is included below. Multiple variables illustrate a strong correlation as indicated by estimate values closer to 1.0, with strong statistic significance as indicated by Pr(|z|) values less than 0.050000. The stronger a correlation indicated in the Estimate value, the stronger that variable will serve as a predictive value during predictive model creation.
A good fit model will rely on both correlation and statistical signficance to prodcue a model that is sensitive, specific, and accurate.
kable(summary(full.model.logit)$coef, caption = "Logistic Regression Coefficients for Full Model", col.names = c("" , "Estimate", "Std. Error", "z Value", 'Pr (|>z|)'), format = "html") %>%
kable_styling()
Estimate | Std. Error | z Value | Pr (|>z|) | |
---|---|---|---|---|
(Intercept) | -11.3994839 | 1.2546649 | -9.0856797 | 0.0000000 |
clumpThickness | 0.4669781 | 0.1257561 | 3.7133626 | 0.0002045 |
sizeUniform | 0.0313879 | 0.1567402 | 0.2002543 | 0.8412817 |
shapeUniform | 0.3685873 | 0.1686559 | 2.1854394 | 0.0288566 |
margAdh | 0.1968390 | 0.1111111 | 1.7715519 | 0.0764690 |
eCellSize | 0.0760556 | 0.1391434 | 0.5465989 | 0.5846543 |
bareNuclei | 0.4282675 | 0.0871709 | 4.9129632 | 0.0000009 |
blChrom | 0.3640725 | 0.1365713 | 2.6658054 | 0.0076804 |
normalNucleoli | 0.1406214 | 0.1015091 | 1.3853083 | 0.1659582 |
Mitoses | 0.4012046 | 0.2317122 | 1.7314786 | 0.0833664 |
To obtain improved predictive values from a model, reducing the
number of variables for analysis and inclusion can improve model fitting
without accidental over fitting. In order to best select the variables
for retention in the reduced model, an automated step()
function was utilized to identify best fit reduced model. Trace was
included to illustrate the step-wise removal of variables based on
calculated Akaike information criterion (AIC) values until the final
best fit reduced model identified.
reduced.model.logit <- step(full.model.logit, directon = "both")
Start: AIC=138.62
Outcome ~ clumpThickness + sizeUniform + shapeUniform + margAdh +
eCellSize + bareNuclei + blChrom + normalNucleoli + Mitoses
Df Deviance AIC
- sizeUniform 1 118.66 136.66
- eCellSize 1 118.92 136.92
- normalNucleoli 1 120.53 138.53
<none> 118.62 138.62
- margAdh 1 121.87 139.87
- Mitoses 1 122.62 140.62
- shapeUniform 1 123.58 141.58
- blChrom 1 125.99 143.99
- clumpThickness 1 135.04 153.04
- bareNuclei 1 145.79 163.79
Step: AIC=136.66
Outcome ~ clumpThickness + shapeUniform + margAdh + eCellSize +
bareNuclei + blChrom + normalNucleoli + Mitoses
Df Deviance AIC
- eCellSize 1 119.02 135.02
<none> 118.66 136.66
- normalNucleoli 1 120.69 136.69
- margAdh 1 122.09 138.09
- Mitoses 1 122.93 138.93
- blChrom 1 126.26 142.26
- shapeUniform 1 128.26 144.26
- clumpThickness 1 135.58 151.58
- bareNuclei 1 146.31 162.31
Step: AIC=135.02
Outcome ~ clumpThickness + shapeUniform + margAdh + bareNuclei +
blChrom + normalNucleoli + Mitoses
Df Deviance AIC
<none> 119.02 135.02
- normalNucleoli 1 121.72 135.72
- margAdh 1 123.38 137.38
- Mitoses 1 123.81 137.81
- blChrom 1 127.13 141.13
- shapeUniform 1 129.50 143.50
- clumpThickness 1 136.26 150.26
- bareNuclei 1 146.97 160.97
The step function removed the variables sizeUniform
(Cell Size Uniformity) & eCellSize
(Single Epithelial
Cell Size) based on AIC to measure the reduced model’s maximum
likelihood estimation (log-likelihood) for model fit. The AIC equation
is given by:
\[AIC = -2*ln(L) + 2k\]
A summary of the model characteristics starting with the initial full model as well as at each step leading to creation of the reduced model is included below. To note the negative signs in the Step column indicate the removal of the listed variable from the logistic regression model.
kable(reduced.model.logit$anova) %>%
kable_styling()
Step | Df | Deviance | Resid. Df | Resid. Dev | AIC |
---|---|---|---|---|---|
NA | NA | 590 | 118.6189 | 138.6189 | |
|
1 | 0.0402921 | 591 | 118.6592 | 136.6592 |
|
1 | 0.3593399 | 592 | 119.0185 | 135.0185 |
reduced model - Outcome ~ clumpThickness + shapeUniform + margAdh + bareNuclei + blChrom + normalNucleoli + Mitoses
A summary of the reduced model’s regression coefficients is included
below. Unlike with the full.model.logit
(full model) all
but two remaining variables are found to have a statistical significance
of \(Pr(|z|) < 0.05\) along with
high correlation values obtained as evident by the regression
coefficients. The combination of all these findings indicate that the
reduced logistic regression model will likely serve as a good fit as a
predictor model for breast cancer biopsy diagnosis data.
kable(summary(reduced.model.logit)$coef, caption = "Logistic Regression Coefficients for Reduced.Model", col.names = c("" , "Estimate", "Std. Error", "z Value", 'Pr (|>z|)'), format = "html") %>%
kable_styling()
Estimate | Std. Error | z Value | Pr (|>z|) | |
---|---|---|---|---|
(Intercept) | -11.3645496 | 1.2250769 | -9.276601 | 0.0000000 |
clumpThickness | 0.4725699 | 0.1250565 | 3.778850 | 0.0001576 |
shapeUniform | 0.4054938 | 0.1346086 | 3.012391 | 0.0025920 |
margAdh | 0.2169802 | 0.1064015 | 2.039259 | 0.0414242 |
bareNuclei | 0.4324737 | 0.0873739 | 4.949691 | 0.0000007 |
blChrom | 0.3766827 | 0.1349834 | 2.790585 | 0.0052613 |
normalNucleoli | 0.1596089 | 0.0974167 | 1.638413 | 0.1013355 |
Mitoses | 0.4154991 | 0.2249179 | 1.847336 | 0.0646984 |
In order to formally compare both the full.model.logit
and reduced. model.logit
models to determine which will
serve as a better predictor for biopsy diagnosis, each model’s AIC and
ROC curves were compared to determine the best fit model.
When comparing AIC values, a lower value indicates a better model fit while balancing generalization. Meanwhile a greater AIC value indicates a poor fit and subsequently poor predictive values.
full.model.logit
(full model)kable(full.model.logit$aic, caption = "AIC value of Full Model", col.names = c("full.model.logit AIC ="), align = "c") %>%
kable_styling()
full.model.logit AIC = |
---|
138.6189 |
reduced.model.logit
(reduced model)kable(reduced.model.logit$aic, caption = "AIC value of Reduced Model", col.names = c("reduced.model.logit AIC ="), align = "c") %>%
kable_styling()
reduced.model.logit AIC = |
---|
135.0185 |
When comparing by AIC values, the reduced model returns a lower AIC value indicating a better fit to the initial data. Based on AIC values the reduced model will provide improved prediction values for patient’s outcome (accurate diagnosis of malignancies during biopsy).
When comparing ROC values, the area under the curve (AUC) values and
visible Receiver Operating Characteristics (ROC) curves can be used to
compare the fit of multiple logistic regression models as predictive
models for a binary output variable such as Outcome
. The
greater the AUC the better fit the model is to the original synthetic
data & the more accurate the model will be in providing predictions
on breast tissue biopsy diagnosis.
data.logit <- raw.data
no.outcome <- which(data.logit$Outcome == "No")
yes.outcome <- which(data.logit$Outcome == "Yes")
data.logit$out = 0
data.logit$out[yes.outcome] = 1
cutoff <- seq(0, 1, length = 100)
sensitiv = NULL
specif = NULL
sensitiv2 = NULL
specif2 = NULL
accur = NULL
accur2 = NULL
predict.full.logit <- predict.glm(full.model.logit, data.logit, type = "response")
predict.reduced.logit <- predict.glm(reduced.model.logit, data.logit, type = "response")
for (i in 1:100){
data.logit$out.train = as.numeric(predict.full.logit > cutoff[i])
TN = sum(data.logit$out.train == 0 & data.logit$out == 0) # true negative
FN = sum(data.logit$out.train == 0 & data.logit$out == 1) # false negative
FP = sum(data.logit$out.train == 1 & data.logit$out == 0) # false positive
TP = sum(data.logit$out.train == 1 & data.logit$out == 1) # true positive
sensitiv[i] = TP / (TP + FN)
specif[i] = TN / (TN + FP)
accur[i] = (TP + TN) / (TP + TN + FN + FP) #accuracy
}
for (i in 1:100){
data.logit$out.train = as.numeric(predict.reduced.logit > cutoff[i])
TN2 = sum(data.logit$out.train == 0 & data.logit$out == 0) # true negative
FN2 = sum(data.logit$out.train == 0 & data.logit$out == 1) # false negative
FP2 = sum(data.logit$out.train == 1 & data.logit$out == 0) # false positive
TP2 = sum(data.logit$out.train == 1 & data.logit$out == 1) # true positive
sensitiv2[i] = TP2 / (TP2 + FN2)
specif2[i] = TN2 / (TN2 + FP2)
accur2[i] = (TP2 + TN2)/(TP2 + TN2 + FN2 + FP2) #accuracy
}
specificity <- 1 - specif
sensitivity <- sensitiv
specificity2 <- 1 - specif2
sensitivity2 <- sensitiv2
prediction <- predict.full.logit
category <- data.logit$out == 1
ROCobj <- roc(category, prediction)
AUC.full <- round(auc(ROCobj), 5)
prediction2 <- predict.reduced.logit
category2 <- data.logit$out == 1
ROCobj2 <- roc(category2, prediction2)
AUC.reduced <- round(auc(ROCobj2), 5)
specif.ori <- ROCobj$specificities
specif.ori2 <- ROCobj2$specificities
sens.ori <- ROCobj$sensitivities
sens.ori2 <- ROCobj2$sensitivities
diff <- abs(sens.ori - specif.ori)
diff2 <- abs(sens.ori2 - specif.ori2)
min.full <- which(diff == min(diff))
min.reduced <- which(diff2 == min(diff2))
ROC <- plot(specificity, sensitivity,
type = "l",
main = "Logistic Regression Model Comparrison Using ROC and AUC",
xlab="1-specificity",
ylab="sensitivity") +
segments(0,0,1,1, lty=2, col = "red") +
lines(specificity, sensitivity, lwd = 1, col = "navy") +
lines(specificity2, sensitivity2, lwd = 1, col = "darkgreen") +
text(0.8, 0.2, paste("Full Model AUC = ", AUC.full), col = "navy", cex = 0.8) +
text(0.8, 0.25, paste("Reduced Model AUC = ", AUC.reduced), col = "darkgreen", cex = 0.8) +
points((1-specif.ori[min.full]), specif.ori[min.full], col = "navy", pch = 17) +
points((1-specif.ori2[min.reduced]), specif.ori2[min.reduced], col = "darkgreen", pch = 17)
text(0.8, 0.18, paste("Full Model Calculated Cutoff = (", round((1-specif.ori[min.full]),4), ", " , round(sens.ori[min.full],4),")"), col = "navy", cex = 0.8) +
text(0.8, 0.23, paste("Reduced Model Calculated Cutoff = (", round((1-specif.ori2[min.reduced]),4), ", " , round(sens.ori2[min.reduced],4),")"), col = "darkgreen", cex = 0.8)
integer(0)
While the ROC curves are near identical and the AUC values differ by only 0.00004 both models are near identical in terms of predictive power. While the reduced model has a marginally greater AUC value (0.0040% greater than the full model AUC), the reduced model also contains fewer variables and therefore should be selected as the predictive model of choice.
A machine learning algorithm known as the Perceptron model can be utilized to create predictive models for binary classifiers, capable of undergoing consistent supervised learning.
data.neural <- raw.data
In order to run a neural network as a predictive model, the following considerations must be met. The data was modified accordingly to meet these conditions:
\[ x_{scaled} = \frac{x - min(x)}{max(x) - min(x)}\]
data.neural$Outcome <- as.numeric(as.factor(data.neural$Outcome))
data.neural$clumpThickScale <- ((data.neural$clumpThickness - min(data.neural$clumpThickness))/(max(data.neural$clumpThickness)-min(data.neural$clumpThickness))) #scale clumpThickness
data.neural$sizeUniScale <- ((data.neural$sizeUniform - min(data.neural$sizeUniform))/(max(data.neural$sizeUniform)-min(data.neural$sizeUniform))) #scale sizeUniform
data.neural$shapeUniScale <- ((data.neural$shapeUniform - min(data.neural$shapeUniform))/(max(data.neural$shapeUniform)-min(data.neural$shapeUniform))) #scale shapeUniform
data.neural$margAdhScale <- ((data.neural$margAdh - min(data.neural$margAdh))/(max(data.neural$margAdh)-min(data.neural$margAdh))) #scale margAdh
data.neural$eCellSizeScale <- ((data.neural$eCellSize - min(data.neural$eCellSize))/(max(data.neural$eCellSize)-min(data.neural$eCellSize))) #scale eCellSize
data.neural$blChromScale <- ((data.neural$blChrom - min(data.neural$blChrom))/(max(data.neural$blChrom)-min(data.neural$blChrom))) #scale blCrom
data.neural$normalNuclScale <- ((data.neural$normalNucleoli - min(data.neural$normalNucleoli))/(max(data.neural$normalNucleoli)-min(data.neural$normalNucleoli))) #scale normalNucleoli
data.neural$MitosesScale <- ((data.neural$Mitoses - min(data.neural$Mitoses))/(max(data.neural$Mitoses)-min(data.neural$Mitoses))) #scale mitoses
data.neural$OutcomeScale <- ((data.neural$Outcome - min(data.neural$Outcome))/(max(data.neural$Outcome)-min(data.neural$Outcome))) #scale outcome
data.neural.scale <- subset(data.neural, select = c(clumpThickScale, sizeUniScale, shapeUniScale, margAdhScale, eCellSizeScale, blChromScale, normalNuclScale, MitosesScale, OutcomeScale))
A summary of the transformed (min-max scaled) synthetic breast cancer biopsy data is included below
kable(summary(data.neural.scale), caption = "Summary of Min-Max transformed Synthetic Breast Cancer Data", align = "c", size = 6) %>%
kable_styling() %>%
row_spec(0, font_size = 12)
clumpThickScale | sizeUniScale | shapeUniScale | margAdhScale | eCellSizeScale | blChromScale | normalNuclScale | MitosesScale | OutcomeScale | |
---|---|---|---|---|---|---|---|---|---|
Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | |
1st Qu.:0.2222 | 1st Qu.:0.1111 | 1st Qu.:0.1111 | 1st Qu.:0.1111 | 1st Qu.:0.2222 | 1st Qu.:0.2222 | 1st Qu.:0.1111 | 1st Qu.:0.0000 | 1st Qu.:0.0000 | |
Median :0.4444 | Median :0.2222 | Median :0.2222 | Median :0.2222 | Median :0.3333 | Median :0.3333 | Median :0.2222 | Median :0.1111 | Median :0.0000 | |
Mean :0.4900 | Mean :0.3469 | Mean :0.3550 | Mean :0.3070 | Mean :0.3659 | Mean :0.3883 | Mean :0.3124 | Mean :0.1215 | Mean :0.3667 | |
3rd Qu.:0.6944 | 3rd Qu.:0.5556 | 3rd Qu.:0.5556 | 3rd Qu.:0.4444 | 3rd Qu.:0.4444 | 3rd Qu.:0.5556 | 3rd Qu.:0.4444 | 3rd Qu.:0.1111 | 3rd Qu.:1.0000 | |
Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 |
The model was created implicitly utilizing the current variable names. Models do not have to be created implicitly but implicit creation ensures no data entry errors during model creation and prevents opportunity for human error.
All variable names were extracted in preparation of model and model formula creation.
neural.implic.design = model.matrix( ~., data = data.neural.scale)
colnames(neural.implic.design)
[1] "(Intercept)" "clumpThickScale" "sizeUniScale" "shapeUniScale"
[5] "margAdhScale" "eCellSizeScale" "blChromScale" "normalNuclScale"
[9] "MitosesScale" "OutcomeScale"
OutcomeScale
is response variable and requires proper
attention in ensemble model formula creation.
Once the variable names were extracted, a string function was utilized to design the model formula from the model.matrix. The model format must be in the following equation format:
response ~ var.1 + var.2 + ... + var.n
neural.formula.names <- colnames(neural.implic.design)
neural.response.name <- paste(neural.formula.names[10])
neural.var.names <- paste(neural.formula.names[-c(1,10)], collapse = "+")
neural.formula <- formula(paste(c(neural.response.name, "~", neural.var.names), collapse = ""))
neural.formula
OutcomeScale ~ clumpThickScale + sizeUniScale + shapeUniScale +
margAdhScale + eCellSizeScale + blChromScale + normalNuclScale +
MitosesScale
The above ensemble model formula was then utilized in the
neuralnet()
function to create a single layer neural
network predictive model.
neural.model <- neuralnet(neural.formula,
data = neural.implic.design, #design matrix
hidden = 1,
act.fct = "logistic", #sigmoid activation fxn
linear.output = FALSE #must be false for logistic
)
kable(neural.model$result.matrix)
error | 7.5079378 |
reached.threshold | 0.0098939 |
steps | 5317.0000000 |
Intercept.to.1layhid1 | -13.3255423 |
clumpThickScale.to.1layhid1 | 7.1910769 |
sizeUniScale.to.1layhid1 | 2.2113967 |
shapeUniScale.to.1layhid1 | 9.6031389 |
margAdhScale.to.1layhid1 | 3.9146535 |
eCellSizeScale.to.1layhid1 | -5.5270566 |
blChromScale.to.1layhid1 | 7.0063959 |
normalNuclScale.to.1layhid1 | 7.1026281 |
MitosesScale.to.1layhid1 | 8.8580648 |
Intercept.to.OutcomeScale | -27.5277181 |
1layhid1.to.OutcomeScale | 64.8763323 |
A visual representation of the neural network and the respective calculated weights for each variable (and hidden weights) is included below.
plot(neural.model, rep="best", title = "Single-layer back propogation neural network for breast tissue biopsy diagnosis outcome")
Values along black lines following variable names represent the calculated weight of each variable. All weight adjustment was completed through \(z = w_0 +w_1 x_1 + ... + w_n x_n\)
Blue lines, nodes, and associated weight values represent the hidden layers required for logistic regression (a.k.a. bias)
First node (left) represents the linear transfer function (z equation above) and bias
Second node (right) represents the sigmoid activation function. This function is calculated by:
\(g(\delta) = \frac{e^\delta}{1+e^\delta}\) where \(\delta = \sum_{i=1}^n z = \sum_{i=1}^{n} w_i x_i + w_0\)
Decision threshold (\(\delta_0\)) is not represented on the flowchart but is used to determine the final output value. If \(g(\delta) > \delta_0\) then \(\hat{y} = 1\) otherwise \(\hat{y} = 0\)
OutcomeScale represents the output prediction values \(\hat{y}\) (0 = no cancer / 1 = cancer)
predict.neural <- predict(neural.model, newdata = neural.implic.design, linear.output = FALSE)
cat.nn <- data.neural$OutcomeScale == 1
ROCobj.nn <- roc(cat.nn, predict.neural)
plot(1- ROCobj.nn$sensitivities, ROCobj.nn$specificities,
type = "l",
xlim = c(0,1),
ylim = c(0,1),
xlab = "1 - Specificity (False Positive)",
ylab = "Sensitivity (True Negative)",
col = "darkviolet",
lwd = 2,
lty = 1,
main = "ROC Curve of Neural Network Perceptron Model")
abline(0,1, lty = 2, col = "darkred", lwd = 2)
legend("bottomright", c(paste("Perceptron Model, AUC =", ROCobj.nn$auc)),
col = c("darkviolet", "darkorange", "darkgreen", "navy"),
lwd=2, lty=rep(1), cex = 0.8, bty = "n")
data.tree <- raw.data
Another common approach to creating predictive models is through the creation of decision trees. Two main approaches to decision tree splitting were utilized: Gini Index & Entropy
tree.model <- function (input.data, fp, fn, purity){
tree = rpart(Outcome ~ .,
data = input.data,
na.action = na.rpart,
method = "class",
model = FALSE,
x = FALSE,
y = TRUE,
parms = list(loss = matrix(c(0, fp, fn, 0), ncol = 2, byrow = TRUE),
split = purity), #parms creates ability to weight false positives and false negatives
control = rpart.control(
minsplit = 10, #min values before split
minbucket = 3, #min values for leaf nodes
cp = 0.01, #comlpexity parameter
xval = 5 #number of cross validation runs
)
)
}
gini.tree.unweight <- tree.model(input.data = data.tree, fp = 1, fn = 1, purity = "gini")
info.tree.unweight <- tree.model(input.data = data.tree, fp = 1, fn = 1, purity = "info")
gini.tree.neg <- tree.model(input.data = data.tree, fp = 10, fn = 10, purity = "gini")
info.tree.neg <- tree.model(input.data = data.tree, fp = 10, fn = 10, purity = "info")
A note on interpreting the decision trees below. Each node & leaf contains the following values in the following order: 1. Predicted class (Yes/No) 2. Probability of Predicted Class 3. Percentage of Observations in Node
The gini index utilized for calculating node split values is given by the following equation:
\[ Gini(D) = \sum_{i=1}^{n} p_i (1-p_i) = 1 - \sum_{i=1}^{n} p_i^2 \]
where \(p_i\) represents the probability of being classified to a particular class.
par(mfrow=c(1,2))
rpart.plot(gini.tree.unweight, main = "No Additional Weight", type = 1)
rpart.plot(gini.tree.neg, main = "False Values x10 Weight True Values", type = 1)
The Entropy utilized for calculating node split values is given by the following equation:
\[ E = \sum_{i=1}^{n} -p_i \log_2 (p_i) \] where \(p_i\) represents the probability of being classified to a particular class. Entropy can then be utilized to calculate the information gain from a model InfoGain = E(Parent Node) - E(Child Node)
Information gain is a measure of the information that would be gained by splitting a feature of interest. Higher info gain values indicate greater model benefit to splitting that feature.
par(mfrow=c(1,2))
rpart.plot(info.tree.unweight, main = "No Additional Weight", type = 1)
rpart.plot(info.tree.neg, main = "False Values x10 Weight True Values", type = 1)
To compare the four decision trees and determine the best fit model and subsequent best model to use for predictions, the respective ROC curves and AUC values were compared.
Both the Gini and Entropy models were compared with no weighting as well as with 10x weighing for false negatives compared to false positives. False positives of a core needle punch biopsy can be identified with further blood biomarker screening, surgical biopsy, and additional core needle punch biopsies. Alternatively, a false negative result will lead a patient to delay or ultimately forgo potentially life saving treatment. In terms of cancer, false negatives are costly, and thus were weighted as such in the prediction models.
#Creating a function allows for streamlined "plug and play" processing of various decision tree models for ROC & AUC comparison.
dec.matrix.sesp <- function (input.data, fp, fn, purity){
cutoff <- seq(0,1,length=100)
model = tree.model(input.data, fp, fn, purity)
pred.dec = predict(model, newdata = input.data, type = "prob") #creates 2 column matrix
matrix.sesp <- matrix(0, ncol = length(cutoff), nrow =3, byrow = FALSE)
for (i in 1:length(cutoff)){
pred.output <- ifelse(pred.dec[,"Yes"] >= cutoff[i], "Yes", "No") #[,"Yes"] forces selection of only "success" (yes) probs
TP1 = sum(pred.output == "Yes" & input.data$Outcome == "Yes")
TN1 = sum(pred.output == "No" & input.data$Outcome == "No")
FP1 = sum(pred.output == "Yes" & input.data$Outcome == "No")
FN1 = sum(pred.output == "No" & input.data$Outcome == "Yes")
matrix.sesp[1,i] = TP1/(TP1 + FN1)
matrix.sesp[2,i] = TN1/(TN1 + FP1)
matrix.sesp[3,i] = (TP1 + TN1)/(TP1 + TN1 + FN1 + FP1) #accuracy
}
pred.dec.final <- pred.dec[ , "Yes"]
cat.dec <- input.data$Outcome == "Yes"
ROCObj.dec <- roc(cat.dec, pred.dec.final)
AUC.dec <- auc(ROCObj.dec)
list(matrix.sesp = matrix.sesp, AUC = round(AUC.dec, 3))
}
gini.ROC.unweight <- dec.matrix.sesp(input.data = data.tree, fp = 1, fn = 1, purity = "gini")
info.ROC.unweight <- dec.matrix.sesp(input.data = data.tree, fp = 1, fn = 1, purity = "information")
gini.ROC.neg <- dec.matrix.sesp(input.data = data.tree, fp = 1, fn = 10, purity = "gini")
info.ROC.neg <- dec.matrix.sesp(input.data = data.tree, fp = 1, fn = 10, purity = "information")
plot(1 - gini.ROC.unweight$matrix.sesp[2,], gini.ROC.unweight$matrix.sesp[1,],
type = "l",
xlim = c(0,1),
ylim = c(0,1),
xlab = "1 - Specificity (False Positive)",
ylab = "Sensitivity (True Negative)",
col = "darkviolet",
lwd = 2,
lty = 3,
main = "ROC Curve Comparrison of Decision Tree Models")
abline(0,1, lty = 2, col = "darkred", lwd = 2)
lines(1 - info.ROC.unweight$matrix.sesp[2,], info.ROC.unweight$matrix.sesp[1,], col = "darkorange", lwd = 2, lty = 3)
lines(1 - gini.ROC.neg$matrix.sesp[2,], gini.ROC.neg$matrix.sesp[1,], col = "darkgreen", lwd = 2, lty = 3)
lines(1 - info.ROC.neg$matrix.sesp[2,], info.ROC.neg$matrix.sesp[1,], col = "navy", lwd = 2, lty = 3)
legend("bottomright", c(paste("Gini Unweighted, AUC =", gini.ROC.unweight$AUC),
paste("Info Unweighted, AUC =", info.ROC.unweight$AUC),
paste("Gini Neg. Weight, AUC =", gini.ROC.neg$AUC),
paste("Info Neg. Weight, AUC =", info.ROC.neg$AUC)),
col = c("darkviolet", "darkorange", "darkgreen", "navy"),
lwd=2, lty=rep(3), cex = 0.8, bty = "n")
Info.neg.weight
(Entropy model with false negative
weighted 10x) proves to be the best fit model with the AUC closest to
1.000 (\(AUC_{info.neg.weight} =
0.982\)). With the greatest AUC value, this model will provide
the most accurate predictions of the four decision tree models
compared.
Finally, a bootstrap prediction model with bagging was implemented to provide an additional potential predictive model for breast cancer biopsy outcomes.
Assumptions of the bootstrap model:
To note: the bootstrap model does not assume any specific distribution of the data prior to model fitting.
An initial bootstrap model was run to predict the biopsy outcome but all AUC values returned as 1.000 indicating model over-fitting of the model. In order to prevent over-fitting and thus inaccurate future predictions, a bagging ensemble was utilized.
The data was subset into test and train data, in a 70:30 random split, with no replacement. An entropy was utilized and false negatives were again weighted 10x of a false positive. Bagging was employed to force the model to aggregate the parallel models and thus reduce variance across models. Additionally, bagging does not allow a single observation to be repeated in a sample, further reducing variance.
The bagged model was first trained on the training data.
data.boot <- raw.data
#converts from 1/2 to 0/1 variable
no.out.boot <- which(data.boot$Outcome == "No")
yes.out.boot <- which(data.boot$Outcome == "Yes")
data.boot$out = 0
data.boot$out[yes.out.boot] = 1
data.boot <- data.boot[,-10] #remove original outcome variable to prevent overfitting
nbag = dim(data.boot)[1] #sample size
train.boot.id <- sample(1:nbag, round(0.7*nbag), replace = FALSE) #split with no replacement
train.boot <- data.boot[train.boot.id,] #test data
test.boot <- data.boot[-train.boot.id,] #train data
outcome.bag.train <- bagging(as.factor(out) ~ .,
data = train.boot,
nbagg = 150,
coob = TRUE,
parms = list(loss = matrix(c(0,1,10,0),
byrow = TRUE),
split = "info"),
control = rpart.control(minsplit = 10,
cp = 0.02))
bag.predict <- predict(outcome.bag.train, train.boot, type ="prob")
boot.cutoff <- seq(0,1,length = 20)
boot.snsp.matrix <- matrix(0, ncol = length(boot.cutoff), nrow = 3, byrow = FALSE)
for(i in 1:length(boot.cutoff)){
pred.boot.out <- ifelse(bag.predict[,2] >= boot.cutoff[i], 1, 0)
TNb = sum(pred.boot.out == 0 & train.boot$out == 0) # true negative
FNb = sum(pred.boot.out == 0 & train.boot$out == 1) # false negative
FPb = sum(pred.boot.out == 1 & train.boot$out == 0) # false positive
TPb = sum(pred.boot.out == 1 & train.boot$out == 1) # true positive
boot.snsp.matrix[1,i] = TPb/(TPb + FNb) #sensitivity
boot.snsp.matrix[2,i] = TNb/(TNb + FPb) #specificity
boot.snsp.matrix[3,i] = (TPb + TNb)/(TPb + TNb + FNb + FPb) #accuracy
}
prediction.boot <- bag.predict[,1]
category.boot <- train.boot$out == 1
ROCobj.boot <- roc(category.boot, prediction.boot)
bootAUC <- auc(ROCobj.boot)
Once trained, the bagged model was then re-run on the test data to ensure an equal or better fit to the utilized training data. This ensures the model was not over-fitted to the training data and is able to be utilized for predictions.
outcome.bag.test <- bagging(as.factor(out) ~ .,
data = test.boot,
nbagg = 150,
coob = TRUE,
parms = list(loss = matrix(c(0,1,10,0),
byrow = TRUE),
split = "info"),
control = rpart.control(minsplit = 10,
cp = 0.02))
bag.predict.test <- predict(outcome.bag.test, test.boot, type ="prob")
boot.cutoff.test <- seq(0,1,length = 20)
boot.snsp.matrix.test <- matrix(0, ncol = length(boot.cutoff.test), nrow = 3, byrow = FALSE)
for(i in 1:length(boot.cutoff.test)){
pred.boot.out.test <- ifelse(bag.predict.test[,2] >= boot.cutoff.test[i], 1, 0)
TNbt = sum(pred.boot.out.test == 0 & test.boot$out == 0) # true negative
FNbt = sum(pred.boot.out.test == 0 & test.boot$out == 1) # false negative
FPbt = sum(pred.boot.out.test == 1 & test.boot$out == 0) # false positive
TPbt = sum(pred.boot.out.test == 1 & test.boot$out == 1) # true positive
boot.snsp.matrix.test[1,i] = TPbt/(TPbt + FNbt) #sensitivity
boot.snsp.matrix.test[2,i] = TNbt/(TNbt + FPbt) #specificity
boot.snsp.matrix.test[3,i] = (TPbt + TNbt)/(TPbt + TNbt + FNbt + FPbt) #accuracy
}
prediction.boot.test <- bag.predict.test[,1]
category.boot.test <- test.boot$out == 1
ROCobj.boot.test <- roc(category.boot.test, prediction.boot.test)
bootAUC.test <- auc(ROCobj.boot.test)
nl.test <- length(boot.snsp.matrix.test[3,])
idx.test <- which(boot.snsp.matrix.test[3,] == max(boot.snsp.matrix[3,])) #max accuracy
tick.test <- as.character(round(boot.cutoff.test, 2))
Finally the bagged model was run a third time on the entire dataset to allow for comparrison across all models utilized as potential predictive models. Checking against the full dataset again also ensures no over-fitting took place during model construction.
outcome.bag.full <- bagging(as.factor(out) ~ .,
data = data.boot,
nbagg = 150,
coob = TRUE,
parms = list(loss = matrix(c(0,1,10,0),
byrow = TRUE),
split = "info"),
control = rpart.control(minsplit = 10,
cp = 0.02))
bag.predict.full <- predict(outcome.bag.full, data.boot, type ="prob")
boot.cutoff.full <- seq(0,1,length = 20)
boot.snsp.matrix.full <- matrix(0, ncol = length(boot.cutoff.full), nrow = 3, byrow = FALSE)
for(i in 1:length(boot.cutoff.full)){
pred.boot.out.full <- ifelse(bag.predict.full[,2] >= boot.cutoff.full[i], 1, 0)
TNbf = sum(pred.boot.out.full == 0 & data.boot$out == 0) # true negative
FNbf = sum(pred.boot.out.full == 0 & data.boot$out == 1) # false negative
FPbf = sum(pred.boot.out.full == 1 & data.boot$out == 0) # false positive
TPbf = sum(pred.boot.out.full == 1 & data.boot$out == 1) # true positive
boot.snsp.matrix.full[1,i] = TPbf/(TPbf + FNbf) #sensitivity
boot.snsp.matrix.full[2,i] = TNbf/(TNbf + FPbf) #specificity
boot.snsp.matrix.full[3,i] = (TPbf + TNbf)/(TPbf + TNbf + FNbf + FPbf) #accuracy
}
prediction.boot.full <- bag.predict.full[,1]
category.boot.full <- data.boot$out == 1
ROCobj.boot.full <- roc(category.boot.full, prediction.boot.full)
bootAUC.full <- auc(ROCobj.boot.full)
nl.full <- length(boot.snsp.matrix.full[3,])
idx.full <- which(boot.snsp.matrix.full[3,] == max(boot.snsp.matrix[3,])) #max accuracy
tick.full <- as.character(round(boot.cutoff.full, 2))
label <- "BAGGING Bootstrap Model:
Entropy Model Weighted 10x for False Negatives
(A missed diagnosis of malignancy 10x worse than
a false positive diagnosis of malignancy at biopsy)"
plot((1 - boot.snsp.matrix[2,]), boot.snsp.matrix[1,],
type = "l",
xlim = c(0,1),
ylim = c(0,1),
xlab = "1 - Specificity (False Positive)",
ylab = "Sensitivity (True Negative)",
col = "darkviolet",
lwd = 2,
lty = 1,
main = "ROC Curve of BAGGING Bootstrap Training Data")
abline(0,1, lty = 2, col = "darkred", lwd = 2)
lines((1 - boot.snsp.matrix.test[2,]), boot.snsp.matrix.test[1,], col = "darkorange", lwd = 2, lty = 1)
lines((1 - boot.snsp.matrix.full[2,]), boot.snsp.matrix.full[1,], col = "darkblue", lwd = 2, lty = 1)
text(0.75, 0.4, label)
legend("bottomright", c(paste("Info Weighted Train, AUC =", bootAUC),
paste("Info Weighted Test, AUC =", bootAUC.test),
paste("Info Weighted Full, AUC =", bootAUC.full)),
col = c("darkviolet", "darkorange", "darkblue"),
lwd=2, lty=rep(1), cex = 0.8, bty = "n")
Each model explored returned a high AUC value and overall demonstrated a good fit and high predictive possibility for the breast cancer biopsy data. In order to determine the true best fit model, all four ROC curves and AUC values were plotted for comparison.
Both the reduced logistic regression model and bagged model have near identical AUC values (0.994 and 0.993 respectively), and both can be utilized to accurately predict the outcome values for breast cancer biopsy data.
plot((1 - boot.snsp.matrix.full[2,]), boot.snsp.matrix.full[1,],
type = "l",
xlim = c(0,1),
ylim = c(0,1),
xlab = "1 - Specificity (False Positive)",
ylab = "Sensitivity (True Negative)",
col = "darkviolet",
lwd = 2,
lty = 1,
main = "ROC Curve of BAGGING Bootstrap Training Data")
abline(0,1, lty = 2, col = "darkred", lwd = 2)
lines((1 - info.ROC.neg$matrix.sesp[2,]), info.ROC.neg$matrix.sesp[1,], col = "darkorange", lwd = 2, lty = 1)
lines((1 - ROCobj.nn$sensitivities), ROCobj.nn$specificities, col = "darkblue", lwd = 2, lty = 1)
lines((1 - sensitiv), specif2, col = "darkgreen", lwd = 2, lty = 1)
legend("bottomright", c(paste("BAGGED Bootstrap Model, AUC =", round(bootAUC.full, 3)),
paste("Entropy Neg. Weighted Model, AUC =", info.ROC.neg$AUC),
paste("Neural Net Perceptron Model AUC =", round(ROCobj.nn$auc, 3)),
paste("Reduced Linear Regression Model AUC =", round(ROCobj2$auc, 3))),
col = c("darkviolet", "darkorange", "darkblue", "darkgreen"),
lwd=2, lty=rep(1), cex = 0.8, bty = "n")
plot((1 - boot.snsp.matrix.full[2,]), boot.snsp.matrix.full[1,],
type = "l",
xlim = c(0,1),
ylim = c(0,1),
xlab = "1 - Specificity (False Positive)",
ylab = "Sensitivity (True Negative)",
col = "darkviolet",
lwd = 3,
lty = 1,
main = "ROC Curves of BAGGED Bootstrap &
Reduced Logistic Regression Models")
abline(0,1, lty = 2, col = "darkred", lwd = 2)
lines((1 - sensitiv), specif2, col = "darkgreen", lwd = 3, lty = 1)
legend("bottomright", c(paste("BAGGED Bootstrap Model, AUC =", round(bootAUC.full, 3)),
paste("Reduced Linear Regression Model AUC =", round(ROCobj2$auc, 3))),
col = c("darkviolet", "darkgreen"),
lwd=2, lty=rep(1), cex = 0.8, bty = "n")
Data Source: Book - Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6
Reference Materials: https://bookdown.org/gmcirco42/decision_trees/decision_trees.html
https://www.cancer.org/cancer/types/breast-cancer/about/how-common-is-breast-cancer.html