In this report, we will be using synthetic breast cancer data to make a simple logistic regression model.
The data set we will be using is synthetic breast cancer data and can be found in the book “Applied Analytics through Case Studies Using SAS and R”. The response variable (Outcome) is a binary categorical variable, and all of the predictor variable are coded as continuous. Before we begin with logistic regression on this data set, we need to make some of the predictor variables categorical. For this I decided to use the variables Bare_Nuceli and Mitosis. The variables are as follows:
Sample_No: Identification Variable
Thickness_of_Clump: Benign cells are more likely monolayers and malignant or cancerous cells are multilayer
Cell_Size_Uniformity: Benign cells does not vary in size and malignant or cancer cell vary in size
Cell_Shape_Uniformity: Benign cells does not vary in shape and malignant or cancer cell vary in shape
Marginal_Adhesion: Benign cells are more likely stick together and cancer cells are loose or does not stick together
Single_Epithelial_Cell_Size: In benign cells epithelial cells are normal and malignant or cancer cells are significantly enlarged
Bare_Nuclei: In benign cells the bare nuclei is not surrounded by cytoplasm and in cancer cells it is surrounded by cytoplasm
** Made categorical:
Score given < 3, variable is coded as "Normal",
Score > 3, variable is coded as "Surrounded".Bland_Chromatin: Benign cells have uniform or fine chromatin and cancer cells have coarse chromatin
Normal_Nucleoli: In Benign cells nucleoli is very small and in cancer cells nucleoli is more prominent
Mitoses: In benign cells the cell growth is normal and in cancer cells there is abnormal cell growth
** Made categorical:
Score < 3, variable is coded as "Normal",
3 < Score < 7, variable is codded as "Abnormal"
Score > 7, variable is coded as "Rapid"Outcome(response): No denotes the presence of benign and Yes denotes the presence of malignant breast cancer
Before we begin with logistic regression on this data set, we need to make some of the predictor variables categorical. For this I decided to use the variables Bare_Nuceli and Mitosis.
y0=BreastCancerData$Outcome
outcome.01 = rep(0, length(y0)) # define a 0-1 to test which probability is used in glm()
outcome.01[which(y0=="Yes")] = 1
BreastCancerData$outcome.01 = outcome.01
In this report, I will be exploring the effect of bland chromatin on the outcome of if a breast cancer tumor is benign or malignant. My practical and analytical questions are both the same: How much effect does chromatin have on if a woman has breast cancer? Typically, benign cells have uniform or fine chromatin and cancer cells have coarse chromatin so we will be testing this using logistic regression analysis.
We need our response variable to be binary (either 0 or 1) in order to use logistic regression. In this report, an outcome of “No” is coded as 0 and “Yes” is coded as 1. Since we are using simple logistic regression, we only need one explanatory variable. We can start by making a histogram of this variable to make sure it is approximately normal.
ylimit = max(density(BreastCancerData$Bland_Chromatin)$y)
hist(BreastCancerData$Bland_Chromatin, probability = TRUE, main = "Bland Chromatin Distribution", xlab="",
col = "azure1", border="lightseagreen")
lines(density(BreastCancerData$Bland_Chromatin, adjust=2), col="blue")
This data is not normal but since we are doing simple logistic regression and have a binary response variable, we do not need to do any transformation to fit the Bland_Chromatin variable to the data set.
We will be using the GLM function to find a simple logistic regression model.
s.logit = glm(outcome.01 ~Bland_Chromatin,
family = binomial(link = "logit"),
data = BreastCancerData)
model.coef.stats = summary(s.logit)$coef # output stats of coefficients
conf.ci = confint(s.logit) # confidence intervals of betas
## Waiting for profiling to be done...
sum.stats = cbind(model.coef.stats, conf.ci.95=conf.ci) # rounding off decimals
kable(sum.stats,caption = "Summary Stats of Regression Coefficients")
| Estimate | Std. Error | z value | Pr(>|z|) | 2.5 % | 97.5 % | |
|---|---|---|---|---|---|---|
| (Intercept) | -5.1057489 | 0.3867279 | -13.20243 | 0 | -5.9062506 | -4.386375 |
| Bland_Chromatin | 0.9790174 | 0.0803681 | 12.18167 | 0 | 0.8302015 | 1.145992 |
Bland Chromatin is positively associated with the status of diabetes since \(\beta_1 = 0.9790174\) with a p-value close to 0. The 95% confidence interval [0.8302015, 1.145992]. These values support the research.
In the table below, we added an odds ratio that we converted from the regression coefficients. These make more practical sense to read.
model.coef.stats = summary(s.logit)$coef
odds.ratio = exp(coef(s.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)
kable(out.stats,caption = "Summary Stats with Odds Ratios")
| Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
|---|---|---|---|---|---|
| (Intercept) | -5.1057489 | 0.3867279 | -13.20243 | 0 | 0.0060618 |
| Bland_Chromatin | 0.9790174 | 0.0803681 | 12.18167 | 0 | 2.6618395 |
The odds ratio associated with Bland Chromatin is 2.66 meaning that as the Bland Chromatin increases by one unit, the odds of having a “Yes” outcome increase by about \(2.66\%\). This is a practically significant risk factor for breast cancer.
Since we are only making one model based off of the likelihood function, we can disregard other goodness of fit measures.
Bland_Chromatin.range = range(BreastCancerData$Bland_Chromatin)
x = seq(Bland_Chromatin.range[1], Bland_Chromatin.range[2], length = 200)
beta.x = coef(s.logit)[1] + coef(s.logit)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
beta1 = coef(s.logit)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
par(mfrow = c(1,2))
plot(x, success.prob, type = "l", lwd = 2, col = "navy",
main = "The probability of being \n tested positive in Breast Cancer",
ylim=c(0, 1),
xlab = "Bland Chromatin",
ylab = "probability",
axes = FALSE,
col.main = "navy",
cex.main = 0.8)
axis(1, pos = 0)
axis(2)
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "navy",
main = "The rate of change in the probability \n of being tested positive in Breast Cancer",
xlab = "Bland_Chromatin",
ylab = "Rate of Change",
ylim=c(0,1.1*y.rate),
axes = FALSE,
col.main = "navy",
cex.main = 0.8)
axis(1, pos = 0)
axis(2)
Our success probability S curve shows us an S shape, which represents the probability increase of a positive breast cancer result as the chromatin thickness increases. The rate of change graph shows us the turning point of the graph is around when Bland_Chromatin is around 5.
In conclusion our logistic model shows us that the thickness of the Bland Chromatin does have a positive effect on whether a patient has a positive breast cancer test or not.