The stroke data set contains 5110 observations on 12 variables. This data set is free to the public on kaggle.com. This data set contains missing values and an ID variable. The missing values and ID variable will be deleted from the data set.
Many studies show that body mass index (BMI) is a leading factor linked to strokes. I will be analyzing this data to see the association between BMI and strokes.
The general interpretation of BMI for adults is given below:
strokekaggle = read.csv("https://raw.githubusercontent.com/connormckee1323/STA-321/main/stroke.csv", header = TRUE) #reading in csv file and naming it
#kable(head(strokekaggle)) #clean table of the first 5 observations of the data
bmi =as.numeric(strokekaggle$bmi)
## Warning: NAs introduced by coercion
strokekaggle$bmi = bmi
stroke.0 = strokekaggle # make a copy of the data for data cleansing
stroke = na.omit(stroke.0) # Delete all records with missing components
y0=stroke$heart_disease
We are performing simple logistic regression so we will only be using one explantory variable. We will create a histogram to check for normality.
ylimit = max(density(stroke$bmi)$y)
hist(stroke$bmi, probability = TRUE, main = "BMI Distribution", xlab="",
col = "azure1", border="lightseagreen")
lines(density(stroke$bmi, adjust=2), col="blue")
s.logit = glm(heart_disease ~ bmi,
family = binomial(link = "logit"), # family is the binomial, logit(p) = log(p/(1-p))!
data = stroke) # the data frame is a subset of the original iris data
# summary(s.logit)
The summary of major statistics is given below.
model.coef.stats = summary(s.logit)$coef # output stats of coefficients
conf.ci = confint(s.logit) # confidence intervals of betas
## Waiting for profiling to be done...
sum.stats = cbind(model.coef.stats, conf.ci.95=conf.ci) # rounding off decimals
kable(sum.stats,caption = "The summary stats of regression coefficients")
| Estimate | Std. Error | z value | Pr(>|z|) | 2.5 % | 97.5 % | |
|---|---|---|---|---|---|---|
| (Intercept) | -3.6106616 | 0.2405196 | -15.011922 | 0.0000000 | -4.0805124 | -3.1371403 |
| bmi | 0.0221954 | 0.0076556 | 2.899255 | 0.0037405 | 0.0068676 | 0.0369035 |
# Odds ratio
model.coef.stats = summary(s.logit)$coef
odds.ratio = exp(coef(s.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)
kable(out.stats,caption = "Summary Stats with Odds Ratios")
| Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
|---|---|---|---|---|---|
| (Intercept) | -3.6106616 | 0.2405196 | -15.011922 | 0.0000000 | 0.027034 |
| bmi | 0.0221954 | 0.0076556 | 2.899255 | 0.0037405 | 1.022444 |
The odds ratio associated with BMI is 1.09 meaning that as the BMI increases by one unit, the odds of being tested positive for diabetes increase by about \(2\%\). This is a practically significant risk factor for diabetes.
Some global goodness-of-fit measures are summarized in the following table
## Other global goodness-of-fit
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
AIC = aic)
pander(goodness)
| Deviance.residual | Null.Deviance.Residual | AIC |
|---|---|---|
| 1927 | 1935 | 1931 |
We cannot compare goodness-of-fit measures because we do not have another model to compare our AIC value to.
bmi.range = range(stroke$bmi)
x = seq(bmi.range[1], bmi.range[2], length = 100)
beta.x = coef(s.logit)[1] + coef(s.logit)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(s.logit)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,2))
plot(x, success.prob, type = "l", lwd = 2, col = "navy",
main = "The probability of being \n having a stroke",
ylim=c(0, 1.1*ylimit),
xlab = "BMI",
ylab = "probability",
axes = FALSE,
col.main = "navy",
cex.main = 0.8)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)
# legend(30, 1, c("Success Probability", "Failure Probability"), lwd = rep(2,2),
# col = c("navy", "darkred"), cex = 0.7, bty = "n")
##
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "navy",
main = "The rate of change in the probability \n of having a stroke",
xlab = "BMI",
ylab = "Rate of Change",
ylim=c(0,1.1*y.rate),
axes = FALSE,
col.main = "navy",
cex.main = 0.8
)
axis(1, pos = 0)
axis(2)
Both graphs show an expoentential curve that shows as BMI increases, the probabilty of having a stroke increases and the rate of change increases.