In this report, we will be using synthetic breast cancer data to make a multiple logistic regression models. Using global goodness of fit tests, we will decide which is the best model for association analysis.
BreastCancerData<- read.csv("https://raw.githubusercontent.com/emmalaughin/sta321/main/data/SyntheticBreastCancerData")
The data set we will be using is synthetic breast cancer data and can be found in the book “Applied Analytics through Case Studies Using SAS and R” and Yested on github. The response variable (Outcome) is a binary categorical variable, and all of the predictor variable are coded as continuous. Before we begin with logistic regression on this data set, we need to make some of the predictor variables categorical. For this I decided to use the variables Bare_Nuceli and Mitosis. The variables are as follows:
Sample_No: Identification Variable
Thickness_of_Clump: Benign cells are more likely monolayers and malignant or cancerous cells are multilayer
Cell_Size_Uniformity: Benign cells does not vary in size and malignant or cancer cell vary in size
Cell_Shape_Uniformity: Benign cells does not vary in shape and malignant or cancer cell vary in shape
Marginal_Adhesion: Benign cells are more likely stick together and cancer cells are loose or does not stick together
Single_Epithelial_Cell_Size: In benign cells epithelial cells are normal and malignant or cancer cells are significantly enlarged
Bare_Nuclei: In benign cells the bare nuclei is not surrounded by cytoplasm and in cancer cells it is surrounded by cytoplasm
** Made categorical:
Score given < 3, variable is coded as "Normal",
Score > 3, variable is coded as "Surrounded".Bland_Chromatin: Benign cells have uniform or fine chromatin and cancer cells have coarse chromatin
Normal_Nucleoli: In Benign cells nucleoli is very small and in cancer cells nucleoli is more prominent
Mitoses: In benign cells the cell growth is normal and in cancer cells there is abnormal cell growth
** Made categorical:
Score < 3, variable is coded as "Normal",
3 < Score < 7, variable is codded as "Abnormal"
Score > 7, variable is coded as "Rapid"Outcome(response): No denotes the presence of benign and Yes denotes the presence of malignant breast cancer
In this report, I will be building a logistic regression model to predict the outcome of if a breast cancer tumor is benign or malignant, using various risk factors associated with the individual patient.
We need our response variable to be binary (either 0 or 1) in order to use logistic regression. In this report, an outcome of “No” is coded as 0 and “Yes” is coded as 1. Since we are using simple logistic regression, we only need one explanatory variable.
We can start by making a pairwise scatterplot of the variables to see the correlation, spot patterns, and assess potential violations.
pairs.panels(BreastCancerData[,-c(7,10)],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
The predictor variables are all unimodial, if they are skewed we would be able to discretize them.
Since this is a predictive model, we don’t worry about the interpretation of the coefficients. The objective is to identify a model that has the best predictive performance, so we will standardize the numerical predictor variables.
We will use the cross validation method to create candidate models, validate them and identify the final model.
To begin the cross validation process, we must split the data into two parts. These two parts are 80% (the training set) and 20% (the testing set). We will do all of our candidate modeling using the training set.
We use the full, reduced, and final models obtained based on the step-wise variable selection as the three candidate models.
kable(avg.pe, caption = "Average of prediction errors of candidate models")
| PE1 | PE2 | PE3 |
|---|---|---|
| 0.0729167 | 0.0729167 | 0.0645833 |
The table below shows an accuracy rate the model that we will use. In this model, we are using the test data which we have not used up until now.
kable(accuracy, caption="The actual accuracy of the final model")
| x |
|---|
| 1 |
We first estimate the TPR (true Yesitive rate, sensitivity) and FPR (false Yesitive rate, 1 - specificity) at each cut-off probability for each of the three candidate models.
Then we make ROC curves of the three candidate models, these are are given below.
## candidate models
## full model
candidate01 = glm(factor(Outcome) ~sd.Thickness_of_Clump + sd.Cell_Size_Uniformity +sd.Cell_Shape_Uniformity+ sd.Marginal_Adhesion + sd.Single_Epithelial_Cell_Size + sd.Normal_Nucleoli +
sd.Bland_Chromatin + Bare_Nuclei.cat + Mitoses.cat, family = binomial(link = "logit"),
data = train.dat)
## reduced model
candidate03 = glm(factor(Outcome) ~ sd.Thickness_of_Clump + sd.Cell_Shape_Uniformity+ sd.Marginal_Adhesion + Bare_Nuclei.cat +sd.Bland_Chromatin,
family = binomial(link = "logit"),
data = train.dat)
##
candidate02 = stepAIC(candidate01,
scope = list(lower=formula(candidate03),upper=formula(candidate01)),
direction = "forward", # forward selection
trace = 0 # do not show the details
)
pred01 = predict(candidate01, newdata = test, type="response")
pred02 = predict(candidate02, newdata = test, type="response")
pred03 = predict(candidate03, newdata = test, type="response")
####
## ROC curve
plot(TPR.FPR(pred01)[,1], TPR.FPR(pred01)[,2],
type="b", col=2, lty=1, xlim=c(0,1), ylim=c(0,1),
xlab = "FPR: 1 - specificity",
ylab ="TPR: sensitivity",
main = "ROC Curve",
cex.main = 0.8,
col.main = "navy")
lines(TPR.FPR(pred02)[,1], TPR.FPR(pred02)[,2], type="b", col=3, lty=2)
lines(TPR.FPR(pred03)[,1], TPR.FPR(pred03)[,2], type="b", col=4, lty=3)
legend("bottomright", c("Candidate #1", "Candidate #2","Candidate #3"),
col=2:4, lty=1:3, cex = 0.8, bty="n")
From the ROC curves, we can see that the candidate models are all
similar in sensitivity and specificity. This means we can use any of the
models that we want as the final working model, if we are just using
this measurement.
The report focused on predicting breast cancer. We used three models as candidates and use both cross-validation and ROC curve to select the final working model. The ROC curves told us that we can use any of the three models, but the cross-validation told us we should use candidate 2 or the candidate where we used the step AIC procedure.