Introduction

In this report, we will be using synthetic breast cancer data to make a multiple logistic regression models. Using global goodness of fit tests, we will decide which is the best model for association analysis.

Data Description

The data set we will be using is synthetic breast cancer data and can be found in the book “Applied Analytics through Case Studies Using SAS and R”. The response variable (Outcome) is a binary categorical variable, and all of the predictor variable are coded as continuous. Before we begin with logistic regression on this data set, we need to make some of the predictor variables categorical. For this I decided to use the variables Bare_Nuceli and Mitosis. The variables are as follows:

y0=BreastCancerData$Outcome
outcome.01 = rep(0, length(y0))      # define a 0-1 to test which probability is used in glm()
outcome.01[which(y0=="Yes")] = 1
BreastCancerData$outcome.01 = outcome.01

In this report, I will be exploring the effect of bland chromatin on the outcome of if a breast cancer tumor is benign or malignant. My practical and analytical questions are both the same: Which variables are most significant in if a woman has breast cancer or not? To do this, we will need to fit the best prediction model. We will be testing this using multiple logistic regression analysis.

We need our response variable to be binary (either 0 or 1) in order to use logistic regression. In this report, an outcome of “No” is coded as 0 and “Yes” is coded as 1. Since we are using simple logistic regression, we only need one explanatory variable.

Data Exploration

We can start by making a pairwise scatterplot of the variables to see the correlation, spot patterns, and assess potential violations.

pairs.panels(BreastCancerData[,-9], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

Since none of them look completely normal, we will continue our model building with caution.

Multiple Logistic Regression Models

Candidate Model

To make the first model, we will use all of the predictor variables. The regression coefficients are shown below.

full.model = glm(outcome.01 ~Thickness_of_Clump+Cell_Size_Uniformity+Cell_Shape_Uniformity+Marginal_Adhesion
                 +Single_Epithelial_Cell_Size+Bare_Nuclei.cat+Mitoses.cat+Bland_Chromatin+Normal_Nucleoli, 
          family = binomial(link = "logit"),  #  logit(p) = log(p/(1-p))!
          data = BreastCancerData)  
kable(summary(full.model)$coef, 
      caption="Summary of inferential statistics of the full model")
Summary of inferential statistics of the full model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.9515676 1.3443777 -8.8900369 0.0000000
Thickness_of_Clump 0.5469395 0.1157764 4.7241005 0.0000023
Cell_Size_Uniformity 0.0490749 0.1486768 0.3300777 0.7413413
Cell_Shape_Uniformity 0.4709384 0.1521813 3.0945872 0.0019709
Marginal_Adhesion 0.2378641 0.1005272 2.3661666 0.0179734
Single_Epithelial_Cell_Size 0.1444214 0.1315867 1.0975374 0.2724065
Bare_Nuclei.catRapid 2.0286567 0.6834313 2.9683402 0.0029941
Mitoses.catAbnormal 1.1440367 0.8251000 1.3865431 0.1655811
Mitoses.catRapid 2.7657898 2.7928979 0.9902939 0.3220305
Bland_Chromatin 0.4499208 0.1255988 3.5822052 0.0003407
Normal_Nucleoli 0.1129408 0.0960952 1.1753013 0.2398742

Manually Selected Variables

In order to manually select variables, we used an alpha level of .05 and remove any variables with a p-value over it. It is to be noted that we can not be completely sure about the practical significance of the variables that we are dropping based off of p-value. After consulting experts, it is possible that there may be variables that we should not have removed from the model.

reduced.model = glm(outcome.01 ~ Thickness_of_Clump+Cell_Shape_Uniformity+Marginal_Adhesion + Bare_Nuclei.cat+Bland_Chromatin, 
          family = binomial(link = "logit"),  
          data = BreastCancerData) 
kable(summary(reduced.model)$coef, 
      caption="Summary of inferential statistics of the reduced model")
Summary of inferential statistics of the reduced model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.2584056 1.2997209 -9.431567 0.0000000
Thickness_of_Clump 0.6237005 0.1094352 5.699270 0.0000000
Cell_Shape_Uniformity 0.6108980 0.1164100 5.247814 0.0000002
Marginal_Adhesion 0.2912003 0.0921735 3.159264 0.0015817
Bare_Nuclei.catRapid 2.2576754 0.6727562 3.355860 0.0007912
Bland_Chromatin 0.5203022 0.1215298 4.281271 0.0000186

Automatic Variable Selection

Another way we can eliminate variables from the model is the stepAIC function, which uses a forward stepwise procedure to add in the model until it finds the model with the lowest AIC value.

final.model.forward = stepAIC(reduced.model, 
                      scope = list(lower=formula(reduced.model),upper=formula(full.model)),
                      direction = "forward",   # forward selection
                      trace = 0   # do not show the details
                      )
kable(summary(final.model.forward)$coef, 
      caption="Summary of inferential statistics of the final model")
Summary of inferential statistics of the final model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -12.6489780 1.3505450 -9.365832 0.0000000
Thickness_of_Clump 0.6190066 0.1101233 5.621030 0.0000000
Cell_Shape_Uniformity 0.5291962 0.1199551 4.411619 0.0000103
Marginal_Adhesion 0.2466034 0.0959302 2.570654 0.0101507
Bare_Nuclei.catRapid 2.3139921 0.6809299 3.398282 0.0006781
Bland_Chromatin 0.4839178 0.1222115 3.959673 0.0000751
Single_Epithelial_Cell_Size 0.2309150 0.1211191 1.906511 0.0565840

Final Model Selection

First, we will use global goodness of fit tests to see which model is best. Here, we are looking at the deviance, null deviance, and aic score.

global.measure=function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
      AIC = aic)
goodness
}
goodness=rbind(full.model = global.measure(full.model),
      reduced.model=global.measure(reduced.model),
      final.model=global.measure(final.model.forward))
row.names(goodness) = c("full.model", "reduced.model", "final.model")
kable(goodness, caption ="Comparison of global goodness-of-fit statistics")
Comparison of global goodness-of-fit statistics
Deviance.residual Null.Deviance.Residual AIC
full.model 135.9872 788.5893 157.9872
reduced.model 145.2619 788.5893 157.2619
final.model 141.5054 788.5893 155.5054

We want the model with the lowest AIC and deviance, which is the final model that we made using automatic variable selection, \[ y=-12.6489780 + 0.6190066 \times Thickness\_Of\_Clump + 0.5291962 \times Cell\_Shape\_Uniformity \\ + 0.2466034 \times Marginal \_ Adhesion + 2.3139921 \times Bare\_Nuclei.catRapid \\ + 0.4839178 \times Bland\_Chromatin+ 0.2309150 \times Single\_Epithelial\_Cell\_Size \] This is the model we will use to find the odds ratios.

Odds Ratio

In the table below, we added an odds ratio that we converted from the regression coefficients. These make more practical sense to read.

model.coef.stats = summary(final.model.forward)$coef
odds.ratio = exp(coef(final.model.forward))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")
Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -12.6489780 1.3505450 -9.365832 0.0000000 0.0000032
Thickness_of_Clump 0.6190066 0.1101233 5.621030 0.0000000 1.8570823
Cell_Shape_Uniformity 0.5291962 0.1199551 4.411619 0.0000103 1.6975672
Marginal_Adhesion 0.2466034 0.0959302 2.570654 0.0101507 1.2796715
Bare_Nuclei.catRapid 2.3139921 0.6809299 3.398282 0.0006781 10.1147234
Bland_Chromatin 0.4839178 0.1222115 3.959673 0.0000751 1.6224183
Single_Epithelial_Cell_Size 0.2309150 0.1211191 1.906511 0.0565840 1.2597521

The odds ratio associated with Bare_Nuclei.catRapid is 10.1147 meaning that when Bare nuclei increases rapidly and goes up by one unit, the odds of having a “Yes” outcome increase by about \(10.115\%\). This is the most significant risk factor for breast cancer, statistically and practically. The odds ratio can be found in the table above and interpreted the same way for each response variable.

Discussion/Conclusion

This report focuses on the association analysis of the potential risk factors of breast cancer. We started with 9 variables, 2 of which we made categorical and in our final model, we ended up with 6. These 6 variables are Thickness_of_Clump, Cell_Shape_Uniformity, Marginal_Adhesion, Bare_Nuclei.catRapid, Bland_Chromatin, and Single_Epithelial_Cell_Size.

We can not be sure if the variables we dropped had practical significance, as we did not consult breast cancer experts for this report. This is the best model we have made for association analysis.