This assignment is going to analyze the association between BMI and the probability of having a stroke. This relationship will be analyzes using simple logistic regression. This relationship is being analyzed because the World Health Organization identifies strokes as the cause of approximately 11% of deaths around the world, and the World Stroke Organization identifies being overweight, classified as a high BMI, to be one of the top ten causes of strokes. In this assignment, we will analyze and see if the results support the literature of there being an association between BMI and probability of having a stroke.
The World Health Organization (WHO) identified stroke as the second leading cause of death across the globe. According to the WHO strokes are responsible for approximately 11% of total deaths worldwide. Using patients identifiers like age, gender, disease status, and lifestyle status this data set was created to predict a patients likelihood of having a stroke.
url="https://ChloeWinters79.github.io/STA321/Data/healthcare-dataset-stroke-data.csv"
stroke = read.csv(url, header = TRUE)
The data set, called stroke prediction, has 12 variables and they are as follows
The variable has some missing variables that were manually enters as “N/A” in the data set, which made BMI a character variable. To correct this all “N/A” observations were removed. This resulted in the removal of approximately 4% of the observation. After all “N/A” observations were removed BMI was able to be converted into a numerical variable.
stroke.clean = subset(stroke, bmi != "N/A")
stroke.clean$bmi = as.numeric(stroke.clean$bmi)
The World Stroke Organization (WSO) identifies being overweight as one of the top ten risk factors of having a stroke, with almost 20% of strokes being associated with the patient being overweight. Since strokes are identified as a leading cause of death around the world it is important to identify any factors that have an association with stroke. In the hopes that if these factors are properly identified people can attempt to limit their likelihood of having a stroke.
Simple logistic regression is going to be used to explore the potential association between BMI and the probability of having a stroke.
Since this assignment will only be studying the simple logistic regression model, only one predictor variable, in this case BMI, will be included in the model. It is important to conduct exploratory data analysis on BMI to make sure it is not extremely skewed.
ylimit = max(density(stroke.clean$bmi)$y)
hist(stroke.clean$bmi, probability = TRUE, main = "Body Mass Index Distribution", xlab="BMI")
lines(density(stroke.clean$bmi, adjust=2))
While the range of the graph does give the impression of potential
skewness, the range reaches to 100 to account for a few outlines that
account for less than 1% of the data set. The distribution itself, looks
to be approximately normal, so there does not seem to be any concerning
skewness.
Since stroke, is a binary categorical variable, and BMI is a continuous variable there is no issue of potential imbalance. This means when fitting a logistic regression model directly to the data, the variable BMI will not need to undergo any transformation.
stroke.logit = glm(stroke ~ bmi, family = binomial(link = "logit"), data = stroke.clean)
result = summary(stroke.logit)
result
##
## Call:
## glm(formula = stroke ~ bmi, family = binomial(link = "logit"),
## data = stroke.clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.828443 0.256842 -14.906 < 2e-16 ***
## bmi 0.024160 0.008129 2.972 0.00296 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1728.4 on 4908 degrees of freedom
## Residual deviance: 1720.1 on 4907 degrees of freedom
## AIC: 1724.1
##
## Number of Fisher Scoring iterations: 6
As noted earlier, the response variable stroke is a binary variable in which 0 means the patient has not had a stroke and 1 means that the patient has had a stroke. The simple logistic regression model is defined as the following, \[ stroke = -3.828443 + 0.024160*bmi \].
Below are the summary statistics.
model.coef.stats = summary(stroke.logit)$coef
conf.ci = confint(stroke.logit)
## Waiting for profiling to be done...
sum.stats = cbind(model.coef.stats, conf.ci.95 = conf.ci)
kable(sum.stats, caption = "The Summary Statistics of Regression Coefficients")
| Estimate | Std. Error | z value | Pr(>|z|) | 2.5 % | 97.5 % | |
|---|---|---|---|---|---|---|
| (Intercept) | -3.8284434 | 0.2568421 | -14.905828 | 0.0000000 | -4.3297856 | -3.3223855 |
| bmi | 0.0241596 | 0.0081288 | 2.972084 | 0.0029579 | 0.0078435 | 0.0397394 |
The above table depicts a positive association between BMI and a patients stroke status since \[ \beta_1 = 0.0241596\] and the p-value equals is less than 0.05. The 95% confidence interval that was constructed above also supports this positive association, 95% CI [0.0078435, 0.0397394]. This also supports the statement from the World Stroke Organization.
A more common and practical method of interpreting the association between BMI and strokes the odds ratio. Below the estimated regression coefficients have been converted to the odds ratio. This supports the statement from the WSO that being overweight is a significant risk factor for having a stroke.
model.coef.stats = summary(stroke.logit)$coef
odds.ratio = exp(coef(stroke.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)
kable(out.stats,caption = "Summary Stats with Odds Ratios")
| Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
|---|---|---|---|---|---|
| (Intercept) | -3.8284434 | 0.2568421 | -14.905828 | 0.0000000 | 0.0217434 |
| bmi | 0.0241596 | 0.0081288 | 2.972084 | 0.0029579 | 1.0244538 |
The odds ratio associated with BMI is approximately 1.02, which means that as BMI increases by one unit, the odds that the patient has had a stroke increases by about 2%.
The below table summarizes some global goodness-of-fit measures.
dev.resid = stroke.logit$deviance
dev.0.resid = stroke.logit$null.deviance
aic = stroke.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
AIC = aic)
pander(goodness)
| Deviance.residual | Null.Deviance.Residual | AIC |
|---|---|---|
| 1720 | 1728 | 1724 |
The global goodness-of-fit model is based on the likelihood function from above. This means there are not other candidate models with corresponding likelihood to compare it to. Since comparisons are not possible interpreting the above goodness-of-fit model would not result in meaningful information.
The success probability curve the rate of change curve are displayed below.
bmi.range = range(stroke.clean$bmi)
x = seq(bmi.range[1], bmi.range[2], length = 200)
beta.x = coef(stroke.logit)[1] + coef(stroke.logit)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
beta.1 = coef(stroke.logit)[2]
success.prob.rate = beta.1*exp(beta.x)/(1+exp(beta.x))^2
par(mfrow = c(1,2))
plot(x, success.prob, type = "l", lwd = 2, col = "darkred",
main = "The probability of having a stroke",
ylim = c(0, 1.1*ylimit),
xlab = "BMI",
ylab = "Probability of Stroke",
axes = FALSE,
col.main = "black",
cex.main = 0.8)
axis(1, pos = 0)
axis(2)
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type="l", lwd =2, col = "darkred",
main = "The rate of change in the probability \n of having a stroke",
xlab = "BMI",
ylab = "Rate of Change",
ylim = c(0,1.1*y.rate),
axes = FALSE,
col.main = "black",
cex.main = 0.8
)
axis(1, pos = 0)
axis(2)
The success probability curve, which is displayed on the right, is not the standard S curve, while it looks like it could be the start of a standard S curve, it gets cut off as BMI hits 100 since there are no observations where the BMI is above 100. Since the curve could turn out to be either a straight line or an S curve the rate of change curve was also created. The rate of change curve shows that the rate of change in the probability of having a stroke constantly increase as BMI increases. However the range depicted in this curve for rate of change is very small, ranging from 0 to 0.004. Additionally in the success probability curve, the probability does not get above approximately 0.2.
While the curves do not depict the typical S curve that would be expected in a simple logistic regression, a straight line is also not depicted. Additionally, The success probability curve also does not indicate that the predicted Y would exceed the 0 to 1 range. Since the few values that exceed a BMI of 61 in this data set are perceived to be outliers seeing any values significantly higher than 100 is unlikely. Additionally, since the graph depicts a predicted Y of approximately 0.2 when BMI is in the 100 range to surpass a predicted Y of 1 would require an incredibly high BMI. Since the predicted Y would stay withing the 0 to 1 range with any likely value for BMI it can be concluded that the curve is a logistic regression S curve.
Delving deeper into the determination of the S curve, adding the context of BMI is important in this conversation. While theoretically it could be said that an observation with a BMI of over 600 could occur showing that the curve is not actually a curve and it is actually a straight line an the model is actually a linear model instead of logistic, the practical applications need to be considered. Considering the real world practical applications and understandings of BMI having a patient with a BMI even over 300 is most likely not humanly possible. In this case, the real world restrictions on a variable are incredibly important when it comes to model building and results analysis, it allows us to come to the proper conclusions when the graphs are unable to provide satisfactory information.
Additionally, it is important to note BMI’s positive association with the probability of a patient having a stroke. While a 2% increase in likelihood per one unit BMI increase may not seem substantial, for some people a one unit BMI increase is gaining just a few extra pound. That 2% can quickly add up and drastically increase someones probability of having a stroke if they are not careful. Additionally, with strokes being the cause of approximately 11% of deaths across the world, a small increase in BMI could turn out to be much more serious than going up a pants size.
STROKE RISK FACTORS WEIGHT UNDERSTANDING WEIGHT AND STROKE. https://www.world-stroke.org/assets/downloads/WSO_Don tBeTheOne_PI_Leaflets_-_WEIGHT.pdf