This data set includes 12 variables, I will be deleting the variable ID.

  1. Gender: Male or Female

  2. Age: Age (years)

  3. Hypertension: Yes or No (High blood pressure)

  4. Heart_disease: Yes or No

  5. Ever_married: Yes or No

  6. Work_type: Private, Self-employed, Govt_job, Children, or Never_worked

  7. Residence_type: Urban or Royal

  8. Avg_glucose_level: Average Glucose Level

  9. BMI: Body Mass Index

  10. Smoking_status: Formerly smoked, never smoked, smokes, unknown

  11. Stroke: Class variable(test for a stroke)

In the following code, I delete all observations with missing records, I delete the variable ID, and I delete records where age is less than 2. This is because there are records with decimal places so we do not know the true age in years.

The data used for analysis will include 11 variables and 4795 observations.

strokekaggle = read.csv("https://raw.githubusercontent.com/connormckee1323/STA-321/main/stroke.csv", header = TRUE) #reading in csv file and naming it
strokekaggle$bmi =as.numeric(strokekaggle$bmi) #numeric to delete missing records

stroke.0 = strokekaggle    # make a copy of the data for data cleansing 
stroke = na.omit(stroke.0)       # Delete all records with missing components

stroke <- subset(stroke, select = -(id)) #delete ID variable
stroke <- subset(stroke, age>=2)

Research Question

I am looking for factors that increase the chance of having a stroke.

Exploratory Analysis

I am looking at pairwise scatterplots to look for potential issues with predictor variables, and correlation between variables.

library(psych)
pairs.panels(stroke[, -11], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB",
             density = TRUE,  # show density plots
             ellipses = TRUE # show correlation ellipses
             )

We do not see a linear assocation between any variables. Most variables are bimodal, or multimodal because they are categorial variables. We see that age is skewed to the left, so we build a histogram to take a closer look below.

par(mfrow=c(1,1))
hist(stroke$age, xlab="age", main = "")

Based on the histogram, we discretize age.

age = stroke$age

grp.age = age
grp.age[age %in% c(0.00:20.00)] = "0-20"
grp.age[age %in% c(21:24)] = "21-24"
grp.age[age %in% c(25:30)] = "25-30"
grp.age[age %in% c(31:40)] = "31-40"
grp.age[age %in% c(41:50)] = "41-50"
grp.age[age %in% c(51:99)] = "50 +"
## added to the stroke data set
stroke$grp.age = grp.age

stroke$heart_disease = as.factor(stroke$heart_disease)
stroke$hypertension = as.factor(stroke$hypertension)

Model Building Candidates

According to WebMD, main factors associated with having a stroke is hypertension and heart disease. I am going to include these variables in my final model for practical reasons, regardless of their statistical significance.

I am going to build a model with all variables, and a small model with only hypertension and heart disease being the predictor variables. These two models will help me build my last model, which is an automatic selection model using forward step-wise regression which will start with all variables from my small model, and choose variables from my full model based on statistical significance.

Full Model

full.model = glm(stroke~gender+grp.age+hypertension+heart_disease+ever_married+work_type+Residence_type+avg_glucose_level+bmi+smoking_status, 
          family = binomial(link = "logit"),  #  logit(p) = log(p/(1-p))!
          data = stroke)  
kable(summary(full.model)$coef, 
      caption="Summary of inferential statistics of the full model")
Summary of inferential statistics of the full model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.2178587 1.061345e+00 -5.8584728 0.0000000
genderMale -0.0728082 1.516328e-01 -0.4801611 0.6311128
genderOther -0.5942210 1.077026e+04 -0.0000552 0.9999560
grp.age21-24 0.0104641 9.558103e+02 0.0000109 0.9999913
grp.age25-30 0.1043019 8.486763e+02 0.0001229 0.9999019
grp.age31-40 14.9125675 6.087398e+02 0.0244974 0.9804558
grp.age41-50 15.9398410 6.087397e+02 0.0261850 0.9791098
grp.age50 + 17.2267749 6.087397e+02 0.0282991 0.9774236
hypertension1 0.6272358 1.718666e-01 3.6495493 0.0002627
heart_disease1 0.6452723 2.012008e-01 3.2071066 0.0013408
ever_marriedYes -0.3057168 2.445262e-01 -1.2502415 0.2112114
work_typeGovt_job -13.4838709 6.087405e+02 -0.0221504 0.9823279
work_typeNever_worked -13.2717808 2.286195e+03 -0.0058052 0.9953682
work_typePrivate -13.2311130 6.087404e+02 -0.0217352 0.9826592
work_typeSelf-employed -13.3618875 6.087405e+02 -0.0219501 0.9824878
Residence_typeUrban 0.0410236 1.473891e-01 0.2783355 0.7807548
avg_glucose_level 0.0055324 1.285500e-03 4.3038432 0.0000168
bmi -0.0187560 1.160620e-02 -1.6160276 0.1060883
smoking_statusnever smoked -0.0665951 1.851887e-01 -0.3596067 0.7191413
smoking_statussmokes 0.0844136 2.234039e-01 0.3778517 0.7055407
smoking_statusUnknown -0.2635420 2.439421e-01 -1.0803465 0.2799879

Manual Model

reduced.model = glm(stroke ~ hypertension + heart_disease, 
          family = binomial(link = "logit"),  # logit(p) = log(p/(1-p))!
          data = stroke) 
kable(summary(reduced.model)$coef, 
      caption="Summary of inferential statistics of the reduced model")
Summary of inferential statistics of the reduced model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.451502 0.0876580 -39.374652 0
hypertension1 1.307380 0.1666566 7.844756 0
heart_disease1 1.407844 0.1970324 7.145241 0

Automatic Variable Selection

library(MASS)
final.model.forward = stepAIC(reduced.model, 
                      scope = list(lower=formula(reduced.model),upper=formula(full.model)),
                      direction = "forward",   # forward selection
                      trace = 0   # do not show the details
                      )
kable(summary(final.model.forward)$coef, 
      caption="Summary of inferential statistics of the final model")
Summary of inferential statistics of the final model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.8909795 1.0339416 -6.6647667 0.0000000
hypertension1 0.6515714 0.1705278 3.8209100 0.0001330
heart_disease1 0.6680755 0.1980582 3.3731278 0.0007432
grp.age21-24 -12.6870260 741.5158532 -0.0171096 0.9863492
grp.age25-30 -12.6779784 592.3335413 -0.0214034 0.9829238
grp.age31-40 2.0147358 1.1009074 1.8300685 0.0672397
grp.age41-50 2.9866882 1.0392570 2.8738687 0.0040548
grp.age50 + 4.2528703 1.0095812 4.2125095 0.0000253
avg_glucose_level 0.0054563 0.0012784 4.2680899 0.0000197
bmi -0.0187445 0.0115840 -1.6181364 0.1056332

Goodness of Fit Measures

## Other global goodness-of-fit
global.measure=function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
      AIC = aic)
goodness
}
goodness=rbind(full.model = global.measure(full.model),
      reduced.model=global.measure(reduced.model),
      final.model=global.measure(final.model.forward))
row.names(goodness) = c("full.model", "reduced.model", "final.model")
kable(goodness, caption ="Comparison of global goodness-of-fit statistics")
Comparison of global goodness-of-fit statistics
Deviance.residual Null.Deviance.Residual AIC
full.model 1409.708 1718.347 1451.708
reduced.model 1610.339 1718.347 1616.339
final.model 1415.833 1718.347 1435.833

Based on our AIC values above, we will be using the final.model as our final model. In practice, it is always best to use the model with the lowest AIC value.

Final Model

# Odds ratio
model.coef.stats = summary(final.model.forward)$coef
odds.ratio = exp(coef(final.model.forward))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")
Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -6.8909795 1.0339416 -6.6647667 0.0000000 0.0010169
hypertension1 0.6515714 0.1705278 3.8209100 0.0001330 1.9185533
heart_disease1 0.6680755 0.1980582 3.3731278 0.0007432 1.9504800
grp.age21-24 -12.6870260 741.5158532 -0.0171096 0.9863492 0.0000031
grp.age25-30 -12.6779784 592.3335413 -0.0214034 0.9829238 0.0000031
grp.age31-40 2.0147358 1.1009074 1.8300685 0.0672397 7.4987462
grp.age41-50 2.9866882 1.0392570 2.8738687 0.0040548 19.8199344
grp.age50 + 4.2528703 1.0095812 4.2125095 0.0000253 70.3069241
avg_glucose_level 0.0054563 0.0012784 4.2680899 0.0000197 1.0054712
bmi -0.0187445 0.0115840 -1.6181364 0.1056332 0.9814301

The group-age variable grp.age has six categories. The baseline category is ages 0-20. We can see from the above table inferential table that the odds of having a stroke increase as age increases. For example, the odds ratio association with the age group 41-50 is 19.82. This means if you have high blood pressure, heart disease, and the same level of glucose and bmi, your odds of having a stroke is nearly 20 times higher than the base lined group age 0-20. The same ratio increases to 70 times higher than the base line group 0-20 when you are looking at those who are 50+ years old.

Summary and Conclusion

The case study focused on the association analysis between a set of potential risk factors for having a stroke. The initial data set has 12 numerical and categorical variables.

After exploratory analysis, we decide to re-group the discrete variable age, and then define dummy variables for the associated variable. This new variable was used in the analysis.

Since hypertension and heart disease are considered to be major contributors to the risk of having a stroke, we include these factors in the final model regardless of the statistical significance, although our analysis shows that are statistically significant.

After automatic variable selection, we obtain the final model with 5 factors, hypertension (1 dummy variable), heart disease (1 dummy variable), age (with 5 dummy variables), glucose, and BMI.

Our final conclusion from our data is that health factors are what lead to strokes. The odds ratio shows that having heart disease, high blood pressure(hypertension), and higher glucose lead to a stroke.