Introduction

In this report, we will be using synthetic breast cancer data to make a simple logistic regression model.

Data Description

The data set we will be using is synthetic breast cancer data and can be found in the book “Applied Analytics through Case Studies Using SAS and R”. The response variable (Outcome) is a binary categorical variable, and all of the predictor variable are coded as continuous. Before we begin with logistic regression on this data set, we need to make some of the predictor variables categorical. For this I decided to use the variables Bare_Nuceli and Mitosis. The variables are as follows:

Before we begin with logistic regression on this data set, we need to make some of the predictor variables categorical. For this I decided to use the variables Bare_Nuceli and Mitosis.

y0=BreastCancerData$Outcome
outcome.01 = rep(0, length(y0))      # define a 0-1 to test which probability is used in glm()
outcome.01[which(y0=="Yes")] = 1
BreastCancerData$outcome.01 = outcome.01

In this report, I will be exploring the effect of bland chromatin on the outcome of if a breast cancer tumor is benign or malignant. My practical and analytical questions are both the same: How much effect does chromatin have on if a woman has breast cancer? Typically, benign cells have uniform or fine chromatin and cancer cells have coarse chromatin so we will be testing this using logistic regression analysis.

We need our response variable to be binary (either 0 or 1) in order to use logistic regression. In this report, an outcome of “No” is coded as 0 and “Yes” is coded as 1. Since we are using simple logistic regression, we only need one explanatory variable. We can start by making a histogram of this variable to make sure it is approximately normal.

ylimit = max(density(BreastCancerData$Bland_Chromatin)$y)
hist(BreastCancerData$Bland_Chromatin, probability = TRUE, main = "Bland Chromatin Distribution", xlab="", 
       col = "azure1", border="lightseagreen")
  lines(density(BreastCancerData$Bland_Chromatin, adjust=2), col="blue") 

This data is not normal but since we are doing simple logistic regression and have a binary response variable, we do not need to do any transformation to fit the Bland_Chromatin variable to the data set.

Simple Logistic Regression

We will be using the GLM function to find a simple logistic regression model.

s.logit = glm(outcome.01 ~Bland_Chromatin, 
          family = binomial(link = "logit"),  
          data = BreastCancerData)  
model.coef.stats = summary(s.logit)$coef       # output stats of coefficients
conf.ci = confint(s.logit)                     # confidence intervals of betas
## Waiting for profiling to be done...
sum.stats = cbind(model.coef.stats, conf.ci.95=conf.ci)   # rounding off decimals
kable(sum.stats,caption = "Summary Stats of Regression Coefficients")  
Summary Stats of Regression Coefficients
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) -5.1057489 0.3867279 -13.20243 0 -5.9062506 -4.386375
Bland_Chromatin 0.9790174 0.0803681 12.18167 0 0.8302015 1.145992

Bland Chromatin is positively associated with the status of diabetes since \(\beta_1 = 0.9790174\) with a p-value close to 0. The 95% confidence interval [0.8302015, 1.145992]. These values support the research.

Odds Ratio

In the table below, we added an odds ratio that we converted from the regression coefficients. These make more practical sense to read.

model.coef.stats = summary(s.logit)$coef
odds.ratio = exp(coef(s.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")
Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -5.1057489 0.3867279 -13.20243 0 0.0060618
Bland_Chromatin 0.9790174 0.0803681 12.18167 0 2.6618395

The odds ratio associated with Bland Chromatin is 2.66 meaning that as the Bland Chromatin increases by one unit, the odds of having a “Yes” outcome increase by about \(2.66\%\). This is a practically significant risk factor for breast cancer.

Since we are only making one model based off of the likelihood function, we can disregard other goodness of fit measures.

Success Probability Curve

Bland_Chromatin.range = range(BreastCancerData$Bland_Chromatin)
x = seq(Bland_Chromatin.range[1], Bland_Chromatin.range[2], length = 200)
beta.x = coef(s.logit)[1] + coef(s.logit)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)

beta1 = coef(s.logit)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2

par(mfrow = c(1,2))
plot(x, success.prob, type = "l", lwd = 2, col = "navy",
     main = "The probability of being \n  tested positive in Breast Cancer", 
     ylim=c(0, 1),
     xlab = "Bland Chromatin",
     ylab = "probability",
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8)

axis(1, pos = 0)
axis(2)

y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "navy",
     main = "The rate of change in the probability \n  of being tested positive in Breast Cancer", 
     xlab = "Bland_Chromatin",
     ylab = "Rate of Change",
     ylim=c(0,1.1*y.rate),
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8)
axis(1, pos = 0)
axis(2)

Our success probability S curve shows us an S shape, which represents the probability increase of a positive breast cancer result as the chromatin thickness increases. The rate of change graph shows us the turning point of the graph is around when Bland_Chromatin is around 5.

Conclusion

In conclusion our logistic model shows us that the thickness of the Bland Chromatin does have a positive effect on whether a patient has a positive breast cancer test or not.