Introduction

In this report, we will be using synthetic breast cancer data to make a multiple logistic regression models. Using global goodness of fit tests, we will decide which is the best model for association analysis.

Data Description

The data set we will be using is synthetic breast cancer data and can be found in the book “Applied Analytics through Case Studies Using SAS and R”. The response variable (Outcome) is a binary categorical variable, and all of the predictor variable are coded as continuous. Before we begin with logistic regression on this data set, we need to make some of the predictor variables categorical. For this I decided to use the variables Bare_Nuceli and Mitosis. The variables are as follows:

Sample_No: Identification Variable
Thickness_of_Clump: Benign cells are more likely monolayers and malignant or cancerous cells are multilayer
Cell_Size_Uniformity: Benign cells does not vary in size and malignant or cancer cell vary in size
Cell_Shape_Uniformity: Benign cells does not vary in shape and malignant or cancer cell vary in shape
Marginal_Adhesion: Benign cells are more likely stick together and cancer cells are loose or does not stick together
Single_Epithelial_Cell_Size: In benign cells epithelial cells are normal and malignant or cancer cells are significantly enlarged

Bare_Nuclei: In benign cells the bare nuclei is not surrounded by cytoplasm and in cancer cells it is surrounded by cytoplasm

      ** Made categorical:
          Score given < 3, variable is coded as "Normal",
          Score > 3, variable is coded as "Surrounded".

Bland_Chromatin: Benign cells have uniform or fine chromatin and cancer cells have coarse chromatin
Normal_Nucleoli: In Benign cells nucleoli is very small and in cancer cells nucleoli is more prominent

Mitoses: In benign cells the cell growth is normal and in cancer cells there is abnormal cell growth

      ** Made categorical:
          Score < 3, variable is coded as "Normal",
          3 < Score < 7, variable is codded as "Abnormal"
          Score > 7, variable is coded as "Rapid"

Outcome(response): No denotes the presence of benign and Yes denotes the presence of malignant breast cancer

y0=BreastCancerData$Outcome
outcome.01 = rep(0, length(y0))      # define a 0-1 to test which probability is used in glm()
outcome.01[which(y0=="Yes")] = 1
BreastCancerData$outcome.01 = outcome.01

In this report, I will be exploring the effect of bland chromatin on the outcome of if a breast cancer tumor is benign or malignant. My practical and analytical questions are both the same: Which variables are most significant in if a woman has breast cancer or not? To do this, we will need to fit the best prediction model. We will be testing this using multiple logistic regression analysis.

We need our response variable to be binary (either 0 or 1) in order to use logistic regression. In this report, an outcome of “No” is coded as 0 and “Yes” is coded as 1. Since we are using simple logistic regression, we only need one explanatory variable.

Data Exploration

We can start by making a pairwise scatterplot of the variables to see the correlation, spot patterns, and assess potential violations.

pairs.panels(BreastCancerData[,-9], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

Since none of them look completely normal, we will continue our model building with caution.

Multiple Logistic Regression Models

Candidate Model

To make the first model, we will use all of the predictor variables. The regression coefficients are shown below.

full.model = glm(outcome.01 ~Thickness_of_Clump+Cell_Size_Uniformity+Cell_Shape_Uniformity+Marginal_Adhesion
                 +Single_Epithelial_Cell_Size+Bare_Nuclei.cat+Mitoses.cat+Bland_Chromatin+Normal_Nucleoli, 
          family = binomial(link = "logit"),  #  logit(p) = log(p/(1-p))!
          data = BreastCancerData)  
kable(summary(full.model)$coef, 
      caption="Summary of inferential statistics of the full model")

Summary of inferential statistics of the full model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-11.9515676	1.3443777	-8.8900369	0.0000000
Thickness_of_Clump	0.5469395	0.1157764	4.7241005	0.0000023
Cell_Size_Uniformity	0.0490749	0.1486768	0.3300777	0.7413413
Cell_Shape_Uniformity	0.4709384	0.1521813	3.0945872	0.0019709
Marginal_Adhesion	0.2378641	0.1005272	2.3661666	0.0179734
Single_Epithelial_Cell_Size	0.1444214	0.1315867	1.0975374	0.2724065
Bare_Nuclei.catRapid	2.0286567	0.6834313	2.9683402	0.0029941
Mitoses.catAbnormal	1.1440367	0.8251000	1.3865431	0.1655811
Mitoses.catRapid	2.7657898	2.7928979	0.9902939	0.3220305
Bland_Chromatin	0.4499208	0.1255988	3.5822052	0.0003407
Normal_Nucleoli	0.1129408	0.0960952	1.1753013	0.2398742

Manually Selected Variables

In order to manually select variables, we used an alpha level of .05 and remove any variables with a p-value over it. It is to be noted that we can not be completely sure about the practical significance of the variables that we are dropping based off of p-value. After consulting experts, it is possible that there may be variables that we should not have removed from the model.

reduced.model = glm(outcome.01 ~ Thickness_of_Clump+Cell_Shape_Uniformity+Marginal_Adhesion + Bare_Nuclei.cat+Bland_Chromatin, 
          family = binomial(link = "logit"),  
          data = BreastCancerData) 
kable(summary(reduced.model)$coef, 
      caption="Summary of inferential statistics of the reduced model")

Summary of inferential statistics of the reduced model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-12.2584056	1.2997209	-9.431567	0.0000000
Thickness_of_Clump	0.6237005	0.1094352	5.699270	0.0000000
Cell_Shape_Uniformity	0.6108980	0.1164100	5.247814	0.0000002
Marginal_Adhesion	0.2912003	0.0921735	3.159264	0.0015817
Bare_Nuclei.catRapid	2.2576754	0.6727562	3.355860	0.0007912
Bland_Chromatin	0.5203022	0.1215298	4.281271	0.0000186

Automatic Variable Selection

Another way we can eliminate variables from the model is the stepAIC function, which uses a forward stepwise procedure to add in the model until it finds the model with the lowest AIC value.

final.model.forward = stepAIC(reduced.model, 
                      scope = list(lower=formula(reduced.model),upper=formula(full.model)),
                      direction = "forward",   # forward selection
                      trace = 0   # do not show the details
                      )
kable(summary(final.model.forward)$coef, 
      caption="Summary of inferential statistics of the final model")

Summary of inferential statistics of the final model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-12.6489780	1.3505450	-9.365832	0.0000000
Thickness_of_Clump	0.6190066	0.1101233	5.621030	0.0000000
Cell_Shape_Uniformity	0.5291962	0.1199551	4.411619	0.0000103
Marginal_Adhesion	0.2466034	0.0959302	2.570654	0.0101507
Bare_Nuclei.catRapid	2.3139921	0.6809299	3.398282	0.0006781
Bland_Chromatin	0.4839178	0.1222115	3.959673	0.0000751
Single_Epithelial_Cell_Size	0.2309150	0.1211191	1.906511	0.0565840

Final Model Selection

First, we will use global goodness of fit tests to see which model is best. Here, we are looking at the deviance, null deviance, and aic score.

global.measure=function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
      AIC = aic)
goodness
}
goodness=rbind(full.model = global.measure(full.model),
      reduced.model=global.measure(reduced.model),
      final.model=global.measure(final.model.forward))
row.names(goodness) = c("full.model", "reduced.model", "final.model")
kable(goodness, caption ="Comparison of global goodness-of-fit statistics")

Comparison of global goodness-of-fit statistics
	Deviance.residual	Null.Deviance.Residual	AIC
full.model	135.9872	788.5893	157.9872
reduced.model	145.2619	788.5893	157.2619
final.model	141.5054	788.5893	155.5054

We want the model with the lowest AIC and deviance, which is the final model that we made using automatic variable selection, \[ y=-12.6489780 + 0.6190066 \times Thickness\_Of\_Clump + 0.5291962 \times Cell\_Shape\_Uniformity \\ + 0.2466034 \times Marginal \_ Adhesion + 2.3139921 \times Bare\_Nuclei.catRapid \\ + 0.4839178 \times Bland\_Chromatin+ 0.2309150 \times Single\_Epithelial\_Cell\_Size \] This is the model we will use to find the odds ratios.

Odds Ratio

In the table below, we added an odds ratio that we converted from the regression coefficients. These make more practical sense to read.

model.coef.stats = summary(final.model.forward)$coef
odds.ratio = exp(coef(final.model.forward))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")

Summary Stats with Odds Ratios
	Estimate	Std. Error	z value	Pr(>\|z\|)	odds.ratio
(Intercept)	-12.6489780	1.3505450	-9.365832	0.0000000	0.0000032
Thickness_of_Clump	0.6190066	0.1101233	5.621030	0.0000000	1.8570823
Cell_Shape_Uniformity	0.5291962	0.1199551	4.411619	0.0000103	1.6975672
Marginal_Adhesion	0.2466034	0.0959302	2.570654	0.0101507	1.2796715
Bare_Nuclei.catRapid	2.3139921	0.6809299	3.398282	0.0006781	10.1147234
Bland_Chromatin	0.4839178	0.1222115	3.959673	0.0000751	1.6224183
Single_Epithelial_Cell_Size	0.2309150	0.1211191	1.906511	0.0565840	1.2597521

The odds ratio associated with Bare_Nuclei.catRapid is 10.1147 meaning that when Bare nuclei increases rapidly and goes up by one unit, the odds of having a “Yes” outcome increase by about \(10.115\%\). This is the most significant risk factor for breast cancer, statistically and practically. The odds ratio can be found in the table above and interpreted the same way for each response variable.

Discussion/Conclusion

This report focuses on the association analysis of the potential risk factors of breast cancer. We started with 9 variables, 2 of which we made categorical and in our final model, we ended up with 6. These 6 variables are Thickness_of_Clump, Cell_Shape_Uniformity, Marginal_Adhesion, Bare_Nuclei.catRapid, Bland_Chromatin, and Single_Epithelial_Cell_Size.

We can not be sure if the variables we dropped had practical significance, as we did not consult breast cancer experts for this report. This is the best model we have made for association analysis.

Multiple Logistic Regression

Emma Laughlin