## Building the Simple Logistic Regression

strokekaggle = read.csv("https://raw.githubusercontent.com/connormckee1323/STA-321/main/stroke.csv", header = TRUE) #reading in csv file and naming it
#kable(head(strokekaggle))  #clean table of the first 5 observations of the data


bmi =as.numeric(strokekaggle$bmi)

## Warning: NAs introduced by coercion

strokekaggle$bmi = bmi

stroke.0 = strokekaggle    # make a copy of the data for data cleansing 
stroke = na.omit(stroke.0)       # Delete all records with missing components
y0=stroke$heart_disease

Exploratory Analysis using a Histogram

We are performing simple logistic regression so we will only be using one explantory variable. We will create a histogram to check for normality.

ylimit = max(density(stroke$bmi)$y)
hist(stroke$bmi, probability = TRUE, main = "BMI Distribution", xlab="", 
       col = "azure1", border="lightseagreen")
  lines(density(stroke$bmi, adjust=2), col="blue")

Build a Simple Logistic Regression

s.logit = glm(heart_disease ~ bmi,  
          family = binomial(link = "logit"),  # family is the binomial, logit(p) = log(p/(1-p))!
          data = stroke)                    # the data frame is a subset of the original iris data
# summary(s.logit)

The summary of major statistics is given below.

model.coef.stats = summary(s.logit)$coef       # output stats of coefficients
conf.ci = confint(s.logit)                     # confidence intervals of betas

## Waiting for profiling to be done...

sum.stats = cbind(model.coef.stats, conf.ci.95=conf.ci)   # rounding off decimals
kable(sum.stats,caption = "The summary stats of regression coefficients")

The summary stats of regression coefficients
	Estimate	Std. Error	z value	Pr(>\|z\|)	2.5 %	97.5 %
(Intercept)	-3.6106616	0.2405196	-15.011922	0.0000000	-4.0805124	-3.1371403
bmi	0.0221954	0.0076556	2.899255	0.0037405	0.0068676	0.0369035

# Odds ratio
model.coef.stats = summary(s.logit)$coef
odds.ratio = exp(coef(s.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")

Summary Stats with Odds Ratios
	Estimate	Std. Error	z value	Pr(>\|z\|)	odds.ratio
(Intercept)	-3.6106616	0.2405196	-15.011922	0.0000000	0.027034
bmi	0.0221954	0.0076556	2.899255	0.0037405	1.022444

The odds ratio associated with BMI is 1.09 meaning that as the BMI increases by one unit, the odds of being tested positive for diabetes increase by about \(2\%\). This is a practically significant risk factor for diabetes.

Some global goodness-of-fit measures are summarized in the following table

## Other global goodness-of-fit
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
      AIC = aic)
pander(goodness)

Deviance.residual	Null.Deviance.Residual	AIC
1927	1935	1931

We cannot compare goodness-of-fit measures because we do not have another model to compare our AIC value to.

bmi.range = range(stroke$bmi)
x = seq(bmi.range[1], bmi.range[2], length = 100)
beta.x = coef(s.logit)[1] + coef(s.logit)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(s.logit)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,2))
plot(x, success.prob, type = "l", lwd = 2, col = "navy",
     main = "The probability of being \n  having a stroke", 
     ylim=c(0, 1.1*ylimit),
     xlab = "BMI",
     ylab = "probability",
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)
# legend(30, 1, c("Success Probability", "Failure Probability"), lwd = rep(2,2), 
#       col = c("navy", "darkred"), cex = 0.7, bty = "n")
##
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "navy",
     main = "The rate of change in the probability \n  of having a stroke", 
     xlab = "BMI",
     ylab = "Rate of Change",
     ylim=c(0,1.1*y.rate),
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8
     )
axis(1, pos = 0)
axis(2)

Both graphs show an expoentential curve that shows as BMI increases, the probabilty of having a stroke increases and the rate of change increases.

Week 6 Assignment

Connor McKee

2023-03-03

Introduction

Data and Variable Descriptions

Clinical Question

## Building the Simple Logistic Regression

Exploratory Analysis using a Histogram

Build a Simple Logistic Regression