This data set includes 12 variables, I will be deleting the variable ID.
Gender: Male or Female
Age: Age (years)
Hypertension: Yes or No (High blood pressure)
Heart_disease: Yes or No
Ever_married: Yes or No
Work_type: Private, Self-employed, Govt_job, Children, or Never_worked
Residence_type: Urban or Royal
Avg_glucose_level: Average Glucose Level
BMI: Body Mass Index
Smoking_status: Formerly smoked, never smoked, smokes, unknown
Stroke: Class variable(test for a stroke)
In the following code, I delete all observations with missing records, I delete the variable ID, and I delete records where age is less than 2. This is because there are records with decimal places so we do not know the true age in years.
The data used for analysis will include 11 variables and 4795 observations.
strokekaggle = read.csv("https://raw.githubusercontent.com/connormckee1323/STA-321/main/stroke.csv", header = TRUE) #reading in csv file and naming it
strokekaggle$bmi =as.numeric(strokekaggle$bmi) #numeric to delete missing records
stroke.0 = strokekaggle # make a copy of the data for data cleansing
stroke = na.omit(stroke.0) # Delete all records with missing components
stroke <- subset(stroke, select = -(id)) #delete ID variable
stroke <- subset(stroke, age>=2)
I am looking for factors that increase the chance of having a stroke.
I am looking at pairwise scatterplots to look for potential issues with predictor variables, and correlation between variables.
library(psych)
pairs.panels(stroke[, -11],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)
We do not see a linear assocation between any variables. Most variables are bimodal, or multimodal because they are categorial variables. We see that age is skewed to the left, so we build a histogram to take a closer look below.
par(mfrow=c(1,1))
hist(stroke$age, xlab="age", main = "")
Based on the histogram, we discretize age.
age = stroke$age
grp.age = age
grp.age[age %in% c(0.00:20.00)] = "0-20"
grp.age[age %in% c(21:24)] = "21-24"
grp.age[age %in% c(25:30)] = "25-30"
grp.age[age %in% c(31:40)] = "31-40"
grp.age[age %in% c(41:50)] = "41-50"
grp.age[age %in% c(51:99)] = "50 +"
## added to the stroke data set
stroke$grp.age = grp.age
stroke$heart_disease = as.factor(stroke$heart_disease)
stroke$hypertension = as.factor(stroke$hypertension)
According to WebMD, main factors associated with having a stroke is hypertension and heart disease. I am going to include these variables in my final model for practical reasons, regardless of their statistical significance.
I am going to build a model with all variables, and a small model with only hypertension and heart disease being the predictor variables. These two models will help me build my last model, which is an automatic selection model using forward step-wise regression which will start with all variables from my small model, and choose variables from my full model based on statistical significance.
full.model = glm(stroke~gender+grp.age+hypertension+heart_disease+ever_married+work_type+Residence_type+avg_glucose_level+bmi+smoking_status,
family = binomial(link = "logit"), # logit(p) = log(p/(1-p))!
data = stroke)
kable(summary(full.model)$coef,
caption="Summary of inferential statistics of the full model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -6.2178587 | 1.061345e+00 | -5.8584728 | 0.0000000 |
| genderMale | -0.0728082 | 1.516328e-01 | -0.4801611 | 0.6311128 |
| genderOther | -0.5942210 | 1.077026e+04 | -0.0000552 | 0.9999560 |
| grp.age21-24 | 0.0104641 | 9.558103e+02 | 0.0000109 | 0.9999913 |
| grp.age25-30 | 0.1043019 | 8.486763e+02 | 0.0001229 | 0.9999019 |
| grp.age31-40 | 14.9125675 | 6.087398e+02 | 0.0244974 | 0.9804558 |
| grp.age41-50 | 15.9398410 | 6.087397e+02 | 0.0261850 | 0.9791098 |
| grp.age50 + | 17.2267749 | 6.087397e+02 | 0.0282991 | 0.9774236 |
| hypertension1 | 0.6272358 | 1.718666e-01 | 3.6495493 | 0.0002627 |
| heart_disease1 | 0.6452723 | 2.012008e-01 | 3.2071066 | 0.0013408 |
| ever_marriedYes | -0.3057168 | 2.445262e-01 | -1.2502415 | 0.2112114 |
| work_typeGovt_job | -13.4838709 | 6.087405e+02 | -0.0221504 | 0.9823279 |
| work_typeNever_worked | -13.2717808 | 2.286195e+03 | -0.0058052 | 0.9953682 |
| work_typePrivate | -13.2311130 | 6.087404e+02 | -0.0217352 | 0.9826592 |
| work_typeSelf-employed | -13.3618875 | 6.087405e+02 | -0.0219501 | 0.9824878 |
| Residence_typeUrban | 0.0410236 | 1.473891e-01 | 0.2783355 | 0.7807548 |
| avg_glucose_level | 0.0055324 | 1.285500e-03 | 4.3038432 | 0.0000168 |
| bmi | -0.0187560 | 1.160620e-02 | -1.6160276 | 0.1060883 |
| smoking_statusnever smoked | -0.0665951 | 1.851887e-01 | -0.3596067 | 0.7191413 |
| smoking_statussmokes | 0.0844136 | 2.234039e-01 | 0.3778517 | 0.7055407 |
| smoking_statusUnknown | -0.2635420 | 2.439421e-01 | -1.0803465 | 0.2799879 |
reduced.model = glm(stroke ~ hypertension + heart_disease,
family = binomial(link = "logit"), # logit(p) = log(p/(1-p))!
data = stroke)
kable(summary(reduced.model)$coef,
caption="Summary of inferential statistics of the reduced model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -3.451502 | 0.0876580 | -39.374652 | 0 |
| hypertension1 | 1.307380 | 0.1666566 | 7.844756 | 0 |
| heart_disease1 | 1.407844 | 0.1970324 | 7.145241 | 0 |
library(MASS)
final.model.forward = stepAIC(reduced.model,
scope = list(lower=formula(reduced.model),upper=formula(full.model)),
direction = "forward", # forward selection
trace = 0 # do not show the details
)
kable(summary(final.model.forward)$coef,
caption="Summary of inferential statistics of the final model")
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -6.8909795 | 1.0339416 | -6.6647667 | 0.0000000 |
| hypertension1 | 0.6515714 | 0.1705278 | 3.8209100 | 0.0001330 |
| heart_disease1 | 0.6680755 | 0.1980582 | 3.3731278 | 0.0007432 |
| grp.age21-24 | -12.6870260 | 741.5158532 | -0.0171096 | 0.9863492 |
| grp.age25-30 | -12.6779784 | 592.3335413 | -0.0214034 | 0.9829238 |
| grp.age31-40 | 2.0147358 | 1.1009074 | 1.8300685 | 0.0672397 |
| grp.age41-50 | 2.9866882 | 1.0392570 | 2.8738687 | 0.0040548 |
| grp.age50 + | 4.2528703 | 1.0095812 | 4.2125095 | 0.0000253 |
| avg_glucose_level | 0.0054563 | 0.0012784 | 4.2680899 | 0.0000197 |
| bmi | -0.0187445 | 0.0115840 | -1.6181364 | 0.1056332 |
## Other global goodness-of-fit
global.measure=function(s.logit){
dev.resid = s.logit$deviance
dev.0.resid = s.logit$null.deviance
aic = s.logit$aic
goodness = cbind(Deviance.residual =dev.resid, Null.Deviance.Residual = dev.0.resid,
AIC = aic)
goodness
}
goodness=rbind(full.model = global.measure(full.model),
reduced.model=global.measure(reduced.model),
final.model=global.measure(final.model.forward))
row.names(goodness) = c("full.model", "reduced.model", "final.model")
kable(goodness, caption ="Comparison of global goodness-of-fit statistics")
| Deviance.residual | Null.Deviance.Residual | AIC | |
|---|---|---|---|
| full.model | 1409.708 | 1718.347 | 1451.708 |
| reduced.model | 1610.339 | 1718.347 | 1616.339 |
| final.model | 1415.833 | 1718.347 | 1435.833 |
Based on our AIC values above, we will be using the final.model as our final model. In practice, it is always best to use the model with the lowest AIC value.
# Odds ratio
model.coef.stats = summary(final.model.forward)$coef
odds.ratio = exp(coef(final.model.forward))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)
kable(out.stats,caption = "Summary Stats with Odds Ratios")
| Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
|---|---|---|---|---|---|
| (Intercept) | -6.8909795 | 1.0339416 | -6.6647667 | 0.0000000 | 0.0010169 |
| hypertension1 | 0.6515714 | 0.1705278 | 3.8209100 | 0.0001330 | 1.9185533 |
| heart_disease1 | 0.6680755 | 0.1980582 | 3.3731278 | 0.0007432 | 1.9504800 |
| grp.age21-24 | -12.6870260 | 741.5158532 | -0.0171096 | 0.9863492 | 0.0000031 |
| grp.age25-30 | -12.6779784 | 592.3335413 | -0.0214034 | 0.9829238 | 0.0000031 |
| grp.age31-40 | 2.0147358 | 1.1009074 | 1.8300685 | 0.0672397 | 7.4987462 |
| grp.age41-50 | 2.9866882 | 1.0392570 | 2.8738687 | 0.0040548 | 19.8199344 |
| grp.age50 + | 4.2528703 | 1.0095812 | 4.2125095 | 0.0000253 | 70.3069241 |
| avg_glucose_level | 0.0054563 | 0.0012784 | 4.2680899 | 0.0000197 | 1.0054712 |
| bmi | -0.0187445 | 0.0115840 | -1.6181364 | 0.1056332 | 0.9814301 |
The group-age variable grp.age has six categories. The baseline category is ages 0-20. We can see from the above table inferential table that the odds of having a stroke increase as age increases. For example, the odds ratio association with the age group 41-50 is 19.82. This means if you have high blood pressure, heart disease, and the same level of glucose and bmi, your odds of having a stroke is nearly 20 times higher than the base lined group age 0-20. The same ratio increases to 70 times higher than the base line group 0-20 when you are looking at those who are 50+ years old.
The case study focused on the association analysis between a set of potential risk factors for having a stroke. The initial data set has 12 numerical and categorical variables.
After exploratory analysis, we decide to re-group the discrete variable age, and then define dummy variables for the associated variable. This new variable was used in the analysis.
Since hypertension and heart disease are considered to be major contributors to the risk of having a stroke, we include these factors in the final model regardless of the statistical significance, although our analysis shows that are statistically significant.
After automatic variable selection, we obtain the final model with 5 factors, hypertension (1 dummy variable), heart disease (1 dummy variable), age (with 5 dummy variables), glucose, and BMI.
Our final conclusion from our data is that health factors are what lead to strokes. The odds ratio shows that having heart disease, high blood pressure(hypertension), and higher glucose lead to a stroke.