library(knitr) library(MASS) library(nleqslv) library("scales") library(pander)

Introduction

The stroke data set contains 5110 observations on 12 variables. This data set is free to the public on kaggle.com. This data set contains missing values and an ID variable. The missing values and ID variable will be deleted from the data set.

Data and Variable Descriptions

  1. Gender: Male or Female
  2. Age: age in years
  3. Hypertension: Yes or No
  4. Ever_Married: Yes or No
  5. Work_type: Private, unemployed, self-employed, children, govt_job
  6. Residence_type: Rural or Urban
  7. avg_gluclose_level: Level of gluclose
  8. bmi: body mass index ((weight in kg/(height in m)^2))
  9. smoking_status: formerly smoked, never smoked, smokes
  10. heart_disease: Yes or No
  11. stroke: Yes or no

Clinical Question

Many studies show that body mass index (BMI) is a leading factor linked to strokes. I will be analyzing this data to see the association between BMI and strokes.

The general interpretation of BMI for adults is given below:

  • underweight: < 18.5
  • Normal and Healthy Weight: [18.5, 24.9]
  • Overweight: [25.0, 29.9]
  • Obese: > 30.0

## Building the Simple Logistic Regression

strokekaggle = read.csv("https://raw.githubusercontent.com/connormckee1323/STA-321/main/stroke.csv", header = TRUE) #reading in csv file and naming it
#kable(head(strokekaggle))  #clean table of the first 5 observations of the data


bmi =as.numeric(strokekaggle$bmi)
## Warning: NAs introduced by coercion
strokekaggle$bmi = bmi
stroke.0 = strokekaggle    # make a copy of the data for data cleansing 
stroke = na.omit(stroke.0)       # Delete all records with missing components
y0=stroke$heart_disease

Exploratory Analysis using a Histogram

We are performing simple logistic regression so we will only be using one explantory variable. We will create a histogram to check for normality.

ylimit = max(density(stroke$bmi)$y)
hist(stroke$bmi, probability = TRUE, main = "BMI Distribution", xlab="", 
       col = "azure1", border="lightseagreen")
  lines(density(stroke$bmi, adjust=2), col="blue") 

Build a Simple Logistic Regression

s.logit = glm(heart_disease ~ bmi,  
          family = binomial(link = "logit"),  # family is the binomial, logit(p) = log(p/(1-p))!
          data = stroke)                    # the data frame is a subset of the original iris data
# summary(s.logit)

The summary of major statistics is given below.

model.coef.stats = summary(s.logit)$coef       # output stats of coefficients
conf.ci = confint(s.logit)                     # confidence intervals of betas
## Waiting for profiling to be done...
sum.stats = cbind(model.coef.stats, conf.ci.95=conf.ci)   # rounding off decimals
kable(sum.stats,caption = "The summary stats of regression coefficients")  
The summary stats of regression coefficients
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) -3.6106616 0.2405196 -15.011922 0.0000000 -4.0805124 -3.1371403
bmi 0.0221954 0.0076556 2.899255 0.0037405 0.0068676 0.0369035
# Odds ratio
model.coef.stats = summary(s.logit)$coef
odds.ratio = exp(coef(s.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")
Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -3.6106616 0.2405196 -15.011922 0.0000000 0.027034
bmi 0.0221954 0.0076556 2.899255 0.0037405 1.022444

The odds ratio associated with BMI is 1.09 meaning that as the BMI increases by one unit, the odds of being tested positive for diabetes increase by about \(2\%\). This is a practically significant risk factor for diabetes.

Some global goodness-of-fit measures are summarized in the following table

## Other global goodness-of-fit
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
      AIC = aic)
pander(goodness)
Deviance.residual Null.Deviance.Residual AIC
1927 1935 1931

We cannot compare goodness-of-fit measures because we do not have another model to compare our AIC value to.

bmi.range = range(stroke$bmi)
x = seq(bmi.range[1], bmi.range[2], length = 100)
beta.x = coef(s.logit)[1] + coef(s.logit)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(s.logit)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,2))
plot(x, success.prob, type = "l", lwd = 2, col = "navy",
     main = "The probability of being \n  having a stroke", 
     ylim=c(0, 1.1*ylimit),
     xlab = "BMI",
     ylab = "probability",
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)
# legend(30, 1, c("Success Probability", "Failure Probability"), lwd = rep(2,2), 
#       col = c("navy", "darkred"), cex = 0.7, bty = "n")
##
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "navy",
     main = "The rate of change in the probability \n  of having a stroke", 
     xlab = "BMI",
     ylab = "Rate of Change",
     ylim=c(0,1.1*y.rate),
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8
     )
axis(1, pos = 0)
axis(2)

Both graphs show an expoentential curve that shows as BMI increases, the probabilty of having a stroke increases and the rate of change increases.