In this report, we will be using synthetic breast cancer data to make a multiple logistic regression models. Using global goodness of fit tests, we will decide which is the best model for association analysis.
The data set we will be using is synthetic breast cancer data and can be found in the book “Applied Analytics through Case Studies Using SAS and R”. The response variable (Outcome) is a binary categorical variable, and all of the predictor variable are coded as continuous. Before we begin with logistic regression on this data set, we need to make some of the predictor variables categorical. For this I decided to use the variables Bare_Nuceli and Mitosis. The variables are as follows:
Sample_No: Identification Variable
Thickness_of_Clump: Benign cells are more likely monolayers and malignant or cancerous cells are multilayer
Cell_Size_Uniformity: Benign cells does not vary in size and malignant or cancer cell vary in size
Cell_Shape_Uniformity: Benign cells does not vary in shape and malignant or cancer cell vary in shape
Marginal_Adhesion: Benign cells are more likely stick together and cancer cells are loose or does not stick together
Single_Epithelial_Cell_Size: In benign cells epithelial cells are normal and malignant or cancer cells are significantly enlarged
Bare_Nuclei: In benign cells the bare nuclei is not surrounded by cytoplasm and in cancer cells it is surrounded by cytoplasm
** Made categorical:
Score given < 3, variable is coded as "Normal",
Score > 3, variable is coded as "Surrounded".Bland_Chromatin: Benign cells have uniform or fine chromatin and cancer cells have coarse chromatin
Normal_Nucleoli: In Benign cells nucleoli is very small and in cancer cells nucleoli is more prominent
Mitoses: In benign cells the cell growth is normal and in cancer cells there is abnormal cell growth
** Made categorical:
Score < 3, variable is coded as "Normal",
3 < Score < 7, variable is codded as "Abnormal"
Score > 7, variable is coded as "Rapid"Outcome(response): No denotes the presence of benign and Yes denotes the presence of malignant breast cancer
y0=BreastCancerData$Outcome
outcome.01 = rep(0, length(y0)) # define a 0-1 to test which probability is used in glm()
outcome.01[which(y0=="Yes")] = 1
BreastCancerData$outcome.01 = outcome.01
In this report, I will be exploring the effect of bland chromatin on the outcome of if a breast cancer tumor is benign or malignant. My practical and analytical questions are both the same: Which variables are most significant in if a woman has breast cancer or not? To do this, we will need to fit the best prediction model. We will be testing this using multiple logistic regression analysis.
We need our response variable to be binary (either 0 or 1) in order to use logistic regression. In this report, an outcome of “No” is coded as 0 and “Yes” is coded as 1. Since we are using simple logistic regression, we only need one explanatory variable.
We can start by making a pairwise scatterplot of the variables to see the correlation, spot patterns, and assess potential violations.
pairs.panels(BreastCancerData[,-9],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
Since none of them look completely normal, we will continue our model
building with caution.
To make the first model, we will use all of the predictor variables. The regression coefficients are shown below.
full.model = glm(outcome.01 ~Thickness_of_Clump+Cell_Size_Uniformity+Cell_Shape_Uniformity+Marginal_Adhesion
+Single_Epithelial_Cell_Size+Bare_Nuclei.cat+Mitoses.cat+Bland_Chromatin+Normal_Nucleoli,
family = binomial(link = "logit"), # logit(p) = log(p/(1-p))!
data = BreastCancerData)
kable(summary(full.model)$coef,
caption="Summary of inferential statistics of the full model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -11.9515676 | 1.3443777 | -8.8900369 | 0.0000000 |
| Thickness_of_Clump | 0.5469395 | 0.1157764 | 4.7241005 | 0.0000023 |
| Cell_Size_Uniformity | 0.0490749 | 0.1486768 | 0.3300777 | 0.7413413 |
| Cell_Shape_Uniformity | 0.4709384 | 0.1521813 | 3.0945872 | 0.0019709 |
| Marginal_Adhesion | 0.2378641 | 0.1005272 | 2.3661666 | 0.0179734 |
| Single_Epithelial_Cell_Size | 0.1444214 | 0.1315867 | 1.0975374 | 0.2724065 |
| Bare_Nuclei.catRapid | 2.0286567 | 0.6834313 | 2.9683402 | 0.0029941 |
| Mitoses.catAbnormal | 1.1440367 | 0.8251000 | 1.3865431 | 0.1655811 |
| Mitoses.catRapid | 2.7657898 | 2.7928979 | 0.9902939 | 0.3220305 |
| Bland_Chromatin | 0.4499208 | 0.1255988 | 3.5822052 | 0.0003407 |
| Normal_Nucleoli | 0.1129408 | 0.0960952 | 1.1753013 | 0.2398742 |
In order to manually select variables, we used an alpha level of .05 and remove any variables with a p-value over it. It is to be noted that we can not be completely sure about the practical significance of the variables that we are dropping based off of p-value. After consulting experts, it is possible that there may be variables that we should not have removed from the model.
reduced.model = glm(outcome.01 ~ Thickness_of_Clump+Cell_Shape_Uniformity+Marginal_Adhesion + Bare_Nuclei.cat+Bland_Chromatin,
family = binomial(link = "logit"),
data = BreastCancerData)
kable(summary(reduced.model)$coef,
caption="Summary of inferential statistics of the reduced model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -12.2584056 | 1.2997209 | -9.431567 | 0.0000000 |
| Thickness_of_Clump | 0.6237005 | 0.1094352 | 5.699270 | 0.0000000 |
| Cell_Shape_Uniformity | 0.6108980 | 0.1164100 | 5.247814 | 0.0000002 |
| Marginal_Adhesion | 0.2912003 | 0.0921735 | 3.159264 | 0.0015817 |
| Bare_Nuclei.catRapid | 2.2576754 | 0.6727562 | 3.355860 | 0.0007912 |
| Bland_Chromatin | 0.5203022 | 0.1215298 | 4.281271 | 0.0000186 |
Another way we can eliminate variables from the model is the stepAIC function, which uses a forward stepwise procedure to add in the model until it finds the model with the lowest AIC value.
final.model.forward = stepAIC(reduced.model,
scope = list(lower=formula(reduced.model),upper=formula(full.model)),
direction = "forward", # forward selection
trace = 0 # do not show the details
)
kable(summary(final.model.forward)$coef,
caption="Summary of inferential statistics of the final model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -12.6489780 | 1.3505450 | -9.365832 | 0.0000000 |
| Thickness_of_Clump | 0.6190066 | 0.1101233 | 5.621030 | 0.0000000 |
| Cell_Shape_Uniformity | 0.5291962 | 0.1199551 | 4.411619 | 0.0000103 |
| Marginal_Adhesion | 0.2466034 | 0.0959302 | 2.570654 | 0.0101507 |
| Bare_Nuclei.catRapid | 2.3139921 | 0.6809299 | 3.398282 | 0.0006781 |
| Bland_Chromatin | 0.4839178 | 0.1222115 | 3.959673 | 0.0000751 |
| Single_Epithelial_Cell_Size | 0.2309150 | 0.1211191 | 1.906511 | 0.0565840 |
First, we will use global goodness of fit tests to see which model is best. Here, we are looking at the deviance, null deviance, and aic score.
global.measure=function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
AIC = aic)
goodness
}
goodness=rbind(full.model = global.measure(full.model),
reduced.model=global.measure(reduced.model),
final.model=global.measure(final.model.forward))
row.names(goodness) = c("full.model", "reduced.model", "final.model")
kable(goodness, caption ="Comparison of global goodness-of-fit statistics")
| Deviance.residual | Null.Deviance.Residual | AIC | |
|---|---|---|---|
| full.model | 135.9872 | 788.5893 | 157.9872 |
| reduced.model | 145.2619 | 788.5893 | 157.2619 |
| final.model | 141.5054 | 788.5893 | 155.5054 |
We want the model with the lowest AIC and deviance, which is the final model that we made using automatic variable selection, \[ y=-12.6489780 + 0.6190066 \times Thickness\_Of\_Clump + 0.5291962 \times Cell\_Shape\_Uniformity \\ + 0.2466034 \times Marginal \_ Adhesion + 2.3139921 \times Bare\_Nuclei.catRapid \\ + 0.4839178 \times Bland\_Chromatin+ 0.2309150 \times Single\_Epithelial\_Cell\_Size \] This is the model we will use to find the odds ratios.
In the table below, we added an odds ratio that we converted from the regression coefficients. These make more practical sense to read.
model.coef.stats = summary(final.model.forward)$coef
odds.ratio = exp(coef(final.model.forward))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)
kable(out.stats,caption = "Summary Stats with Odds Ratios")
| Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
|---|---|---|---|---|---|
| (Intercept) | -12.6489780 | 1.3505450 | -9.365832 | 0.0000000 | 0.0000032 |
| Thickness_of_Clump | 0.6190066 | 0.1101233 | 5.621030 | 0.0000000 | 1.8570823 |
| Cell_Shape_Uniformity | 0.5291962 | 0.1199551 | 4.411619 | 0.0000103 | 1.6975672 |
| Marginal_Adhesion | 0.2466034 | 0.0959302 | 2.570654 | 0.0101507 | 1.2796715 |
| Bare_Nuclei.catRapid | 2.3139921 | 0.6809299 | 3.398282 | 0.0006781 | 10.1147234 |
| Bland_Chromatin | 0.4839178 | 0.1222115 | 3.959673 | 0.0000751 | 1.6224183 |
| Single_Epithelial_Cell_Size | 0.2309150 | 0.1211191 | 1.906511 | 0.0565840 | 1.2597521 |
The odds ratio associated with Bare_Nuclei.catRapid is 10.1147 meaning that when Bare nuclei increases rapidly and goes up by one unit, the odds of having a “Yes” outcome increase by about \(10.115\%\). This is the most significant risk factor for breast cancer, statistically and practically. The odds ratio can be found in the table above and interpreted the same way for each response variable.
This report focuses on the association analysis of the potential risk factors of breast cancer. We started with 9 variables, 2 of which we made categorical and in our final model, we ended up with 6. These 6 variables are Thickness_of_Clump, Cell_Shape_Uniformity, Marginal_Adhesion, Bare_Nuclei.catRapid, Bland_Chromatin, and Single_Epithelial_Cell_Size.
We can not be sure if the variables we dropped had practical significance, as we did not consult breast cancer experts for this report. This is the best model we have made for association analysis.