Stroke Prediction

Purpose

The purpose of this document is to:

Develop standard operating procedures (SOPs) for data analysis related projects and to create reproducible reporting. To that end the following steps were completed:

Cleaning and visualizing publicly available dataset.
Providing detailed distribution tables of variables of interest.
Developing logistic regression model to predict the occurrence of stroke in the current dataset and assess it.

Pakcages

library(tidyverse)
library(janitor)
library(gtsummary)
library(knitr)
library(kableExtra)
library(skimr)
library(survey)
library(tidymodels)
library(caret)
library(pROC)
library(ggrepel)

Data Input

set.seed(42)
df_input<- 
  read.csv("data/stroke.csv") %>% 
  clean_names()

Data Cleaning

Variable and data type

This tells me what kind of variables I have and what kind of classes they are. Usually for analysis I would like to see each categorical variables as a factor and numerical variables as a double.

glimpse(df_input)

## Rows: 5,110
## Columns: 12
## $ id                <int> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434…
## $ gender            <chr> "Male", "Female", "Male", "Female", "Female", "Male"…
## $ age               <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, …
## $ hypertension      <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
## $ heart_disease     <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0…
## $ ever_married      <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
## $ work_type         <chr> "Private", "Self-employed", "Private", "Private", "S…
## $ residence_type    <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"…
## $ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0…
## $ bmi               <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "…
## $ smoking_status    <chr> "formerly smoked", "never smoked", "never smoked", "…
## $ stroke            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Exploratory data analysis (EDA)

I then would like to individually examine each of the variables. Additionally, I would usually like to categorize continuous variables when possible. For example for the current dataset I can break down age by decade, BMI as normal weight, over weight, obesity class I-III and glucose levels as normal, pre-diabetic and diabetic. Such categories would allow for more flexibility when performing EDA and data visualization.

attach(df_input)

df_clean<-
  df_input %>% 
  mutate(hypertension = 
           case_when( 
             hypertension == 1 ~ "Yes", 
             hypertension == 0 ~ "No"
             ),
         heart_disease = 
           case_when(
           heart_disease == 1 ~ "Yes", 
           heart_disease == 0 ~ "No"),
         work_type = 
           case_when(
             work_type %in% "children" ~ "Children", 
             work_type %in% "Govt_job" ~ "Government Job", 
             work_type %in% "Never_worked" ~ "Never Worked",
             TRUE ~ as.factor(work_type)
             ),
         stroke = 
           case_when(
             stroke== 1 ~ "Yes", 
             stroke == 0 ~ "No"),
         bmi = as.double(bmi), 
         smoking_status = str_to_title(smoking_status),
         age_group_decade = 
           case_when(
             age>=0 & age<10 ~ "0-10",
             age>=10 & age<20 ~ "10-20", 
             age>=20 & age<30 ~ "20-30", 
             age>=30 & age<40 ~ "30-40", 
             age>=40 & age<50 ~ "40-50", 
             age>=50 & age<60 ~ "50-60", 
             age>=60 & age<70 ~ "60-70",
             age>=70 & age<80 ~ "70-80", 
             age>=80 ~ "> 80"
           ),
         age_group_decade = factor(age_group_decade, levels = c("0-10","10-20","20-30", "30-40", "40-50", "50-60", "60-70", "70-80", "> 80")),
         age_group_50 = 
           case_when(age<=50 ~ "<= 50", 
                     age>50 ~ "> 50"), 
         bmi_class = case_when(
           bmi <18.5 ~ "Underweight",
           bmi >=18.5 & bmi <= 24.9 ~ "Normal",
           bmi >= 25.0 & bmi <= 29.9 ~ "Overweight",
           bmi >= 30.0 & bmi <= 34.9 ~ "Class I obese",
           bmi >=35 ~ "Class II-III obese"
         ),
         bmi_class = factor(bmi_class,levels=c("Underweight","Normal",
                                                 "Overweight",
                                               "Class I obese",
                                               "Class II-III obese")), 
         diabetes_status = 
           case_when(
             avg_glucose_level<= 99 ~ "Normal", 
             avg_glucose_level>99 & avg_glucose_level<= 125 ~ "Pre-diabetic", 
             avg_glucose_level> 125 ~ "Diabetic"), 
         diabetes_status = factor(diabetes_status, level = c("Normal", "Pre-diabetic", "Diabetic"))
         ) %>% 
  mutate_if(is.character, as.factor) %>% 
  mutate_if(is.numeric, as.double) 



df_clean %>% 
  skim_without_charts() %>%
  yank("numeric") %>% 
  adorn_rounding(digits = 2) %>%
  as.data.frame() %>% 
  kable(align = "c", caption = "Summary statistics of numerical variables") %>% 
  kable_classic(html_font = "Cambria") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), )

Summary statistics of numerical variables
skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
id	0	1.00	36517.83	21161.72	67.00	17741.25	36932.00	54682.00	72940.00
age	0	1.00	43.23	22.61	0.08	25.00	45.00	61.00	82.00
avg_glucose_level	0	1.00	106.15	45.28	55.12	77.24	91.88	114.09	271.74
bmi	201	0.96	28.89	7.85	10.30	23.50	28.10	33.10	97.60

df_clean %>% 
  skim() %>%
  yank("factor") %>% 
  adorn_rounding(digits = 2) %>%
  as.data.frame() %>% 
  kable(align = "c", caption = "Summary statistics of categorical variables") %>% 
  kable_classic(html_font = "Cambria") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Summary statistics of categorical variables
skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
gender	0	1.00	FALSE	3	Fem: 2994, Mal: 2115, Oth: 1
hypertension	0	1.00	FALSE	2	No: 4612, Yes: 498
heart_disease	0	1.00	FALSE	2	No: 4834, Yes: 276
ever_married	0	1.00	FALSE	2	Yes: 3353, No: 1757
work_type	0	1.00	FALSE	5	Pri: 2925, Sel: 819, Chi: 687, Gov: 657
residence_type	0	1.00	FALSE	2	Urb: 2596, Rur: 2514
smoking_status	0	1.00	FALSE	4	Nev: 1892, Unk: 1544, For: 885, Smo: 789
stroke	0	1.00	FALSE	2	No: 4861, Yes: 249
age_group_decade	0	1.00	FALSE	9	50-: 834, 40-: 730, 30-: 655, 60-: 621
age_group_50	0	1.00	FALSE	2	<= : 2983, > 5: 2127
bmi_class	201	0.96	FALSE	5	Ove: 1409, Nor: 1243, Cla: 1000, Cla: 920
diabetes_status	0	1.00	FALSE	3	Nor: 3071, Pre: 1039, Dia: 1000

Distribution table

This is going to help me figure out what are the characteristics of those patients who have had a history of stroke versus those who did not. I usually use this table as a reference. I additionally provide some visuals after for better understanding of the data.

tb1<-
  df_clean %>% 
  select(!c(id, age_group_decade)) %>%
  tbl_summary(by = stroke, 
              type = list(all_continuous() ~ "continuous2",
                         c(hypertension, heart_disease, ever_married) ~ "categorical"), 
               statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({p25}, {p75})", "{min}, {max}"), 
              label = list(age ~ "Age (years)", 
                           gender ~ "Gender",
                           bmi ~ "BMI (kg/m2)", 
                           avg_glucose_level ~ "Glucose level (mg/dL)", 
                           hypertension ~ "Hypertension", 
                           heart_disease ~ "Heart disease", 
                           ever_married ~ "Ever married", 
                           work_type ~ "Work type", 
                           smoking_status ~ "Smoking status", 
                           residence_type ~ "Residence type", 
                           bmi_class ~ "BMI class", 
                           age_group_50 ~ "Age Group", 
                           diabetes_status ~ "Diabetes Status"
                           )
              )%>% 
  bold_labels() %>% 
  add_overall(last = T) %>% 
  modify_footnote(everything() ~ NA) %>%  
  modify_caption("Table of patient's Characteristics") %>% 
  modify_header(all_stat_cols() ~ "**{level}**\n N = {n} \n({style_percent(p)}%)") %>%
     as_kable_extra(
     booktabs = TRUE,
     longtable = TRUE) %>%
  add_header_above(c(" "= 1,"Stroke History" = 2, " "= 1), bold = T) %>%
  kable_classic(html_font = "Cambria") %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>% 
  row_spec(c(4, 8, 21,35,40, 45), bold = T, color = "white", background = "#D7261E") %>%
  scroll_box(height = "600px", width = "1000px") 
  
tb1

Table of patient’s Characteristics
	Stroke History
Characteristic	No N = 4861 (95%)	Yes N = 249 (4.9%)	Overall N = 5110 (100%)
Gender
Female	2,853 (59%)	141 (57%)	2,994 (59%)
Male	2,007 (41%)	108 (43%)	2,115 (41%)
Other	1 (<0.1%)	0 (0%)	1 (<0.1%)
Age (years)
Mean (SD)	42 (22)	68 (13)	43 (23)
Median (IQR)	43 (24, 59)	71 (59, 78)	45 (25, 61)
Range	0, 82	1, 82	0, 82
Hypertension
No	4,429 (91%)	183 (73%)	4,612 (90%)
Yes	432 (8.9%)	66 (27%)	498 (9.7%)
Heart disease
No	4,632 (95%)	202 (81%)	4,834 (95%)
Yes	229 (4.7%)	47 (19%)	276 (5.4%)
Ever married
No	1,728 (36%)	29 (12%)	1,757 (34%)
Yes	3,133 (64%)	220 (88%)	3,353 (66%)
Work type
Children	685 (14%)	2 (0.8%)	687 (13%)
Government Job	624 (13%)	33 (13%)	657 (13%)
Never Worked	22 (0.5%)	0 (0%)	22 (0.4%)
Private	2,776 (57%)	149 (60%)	2,925 (57%)
Self-employed	754 (16%)	65 (26%)	819 (16%)
Residence type
Rural	2,400 (49%)	114 (46%)	2,514 (49%)
Urban	2,461 (51%)	135 (54%)	2,596 (51%)
Glucose level (mg/dL)
Mean (SD)	105 (44)	133 (62)	106 (45)
Median (IQR)	91 (77, 113)	105 (80, 197)	92 (77, 114)
Range	55, 268	56, 272	55, 272
BMI (kg/m2)
Mean (SD)	29 (8)	30 (6)	29 (8)
Median (IQR)	28 (23, 33)	30 (26, 34)	28 (24, 33)
Range	10, 98	17, 57	10, 98
Unknown	161	40	201
Smoking status
Formerly Smoked	815 (17%)	70 (28%)	885 (17%)
Never Smoked	1,802 (37%)	90 (36%)	1,892 (37%)
Smokes	747 (15%)	42 (17%)	789 (15%)
Unknown	1,497 (31%)	47 (19%)	1,544 (30%)
Age Group
<= 50	2,960 (61%)	23 (9.2%)	2,983 (58%)
> 50	1,901 (39%)	226 (91%)	2,127 (42%)
BMI class
Underweight	336 (7.1%)	1 (0.5%)	337 (6.9%)
Normal	1,208 (26%)	35 (17%)	1,243 (25%)
Overweight	1,334 (28%)	75 (36%)	1,409 (29%)
Class I obese	944 (20%)	56 (27%)	1,000 (20%)
Class II-III obese	878 (19%)	42 (20%)	920 (19%)
Unknown	161	40	201
Diabetes Status
Normal	2,960 (61%)	111 (45%)	3,071 (60%)
Pre-diabetic	1,001 (21%)	38 (15%)	1,039 (20%)
Diabetic	900 (19%)	100 (40%)	1,000 (20%)

Having looked at the data, it suggest that there are several variables that may require further cleaning (denoted by red). For example:

We have only 1 person who classified their gender as “Other”
The age range is wide (1-82 years old)
There is only 22 people who responded work type as “Never Worked”
There is 201 missing values in BMI and 30% of folks reported “unknown” as their smoking status
One person who had stroke was underweight (it’s likely one of the infants in the dataset)

It is critical to recognize these and really narrow down our research question to have more powerful models. In this particular case I am going narrow my question down to assessing the risk of stroke in the following population:

Adults age 18 and older
BMI of normal range and higher
Those who are biologic male and female

Such changes would yield the following table:

df_main<-
df_clean %>% 
  filter(
    age>=18, 
    !bmi_class %in% "Underweight",
    !gender %in% "Other", 
    
  )

tb2<-
 df_main %>% 
  select(!c(id, age_group_decade)) %>%
  tbl_summary(by = stroke, 
              type = list(all_continuous() ~ "continuous2",
                         c(hypertension, heart_disease, ever_married) ~ "categorical"), 
               statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({p25}, {p75})", "{min}, {max}"), 
              label = list(age ~ "Age (years)", 
                           gender ~ "Gender",
                           bmi ~ "BMI (kg/m2)", 
                           avg_glucose_level ~ "Glucose level (mg/dL)", 
                           hypertension ~ "Hypertension", 
                           heart_disease ~ "Heart disease", 
                           ever_married ~ "Ever married", 
                           work_type ~ "Work type", 
                           smoking_status ~ "Smoking status", 
                           residence_type ~ "Residence type", 
                           bmi_class ~ "BMI class", 
                           age_group_50 ~ "Age Group", 
                           diabetes_status ~ "Diabetes Status"
                           )
              )%>% 
  bold_labels() %>% 
  add_overall(last = T) %>% 
  modify_footnote(everything() ~ NA) %>%  
  modify_caption("Table of patient's Characteristics") %>% 
   modify_header(all_stat_cols() ~ "**{level}**\n N = {n} \n({style_percent(p)}%)") %>%
  as_kable_extra(
    booktabs = TRUE,
    longtable = TRUE) %>%
  add_header_above(c(" "= 1,"Stroke History" = 2, " "= 1), bold = T) %>%
 kable_classic(html_font = "Cambria") %>% 
   kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>% 
  row_spec(c(4, 8, 21,35,40, 45), bold = T, color = "white", background = "#D7261E") %>% 
  scroll_box(height = "600px", width = "1000px") 

tb2

Table of patient’s Characteristics
	Stroke History
Characteristic	No N = 3969 (94%)	Yes N = 246 (5.8%)	Overall N = 4215 (100%)
Gender
Female	2,406 (61%)	138 (56%)	2,544 (60%)
Male	1,563 (39%)	108 (44%)	1,671 (40%)
Other	0 (0%)	0 (0%)	0 (0%)
Age (years)
Mean (SD)	49 (18)	68 (12)	50 (18)
Median (IQR)	49 (35, 62)	71 (59, 78)	50 (36, 64)
Range	18, 82	32, 82	18, 82
Hypertension
No	3,540 (89%)	180 (73%)	3,720 (88%)
Yes	429 (11%)	66 (27%)	495 (12%)
Heart disease
No	3,741 (94%)	199 (81%)	3,940 (93%)
Yes	228 (5.7%)	47 (19%)	275 (6.5%)
Ever married
No	861 (22%)	27 (11%)	888 (21%)
Yes	3,108 (78%)	219 (89%)	3,327 (79%)
Work type
Children	0 (0%)	0 (0%)	0 (0%)
Government Job	615 (15%)	33 (13%)	648 (15%)
Never Worked	5 (0.1%)	0 (0%)	5 (0.1%)
Private	2,613 (66%)	149 (61%)	2,762 (66%)
Self-employed	736 (19%)	64 (26%)	800 (19%)
Residence type
Rural	1,953 (49%)	112 (46%)	2,065 (49%)
Urban	2,016 (51%)	134 (54%)	2,150 (51%)
Glucose level (mg/dL)
Mean (SD)	107 (46)	133 (62)	109 (48)
Median (IQR)	92 (77, 114)	106 (80, 197)	93 (77, 116)
Range	55, 268	56, 272	55, 272
BMI (kg/m2)
Mean (SD)	31 (7)	31 (6)	31 (7)
Median (IQR)	29 (26, 34)	30 (26, 34)	29 (26, 34)
Range	18, 92	19, 57	18, 92
Unknown	142	39	181
Smoking status
Formerly Smoked	781 (20%)	70 (28%)	851 (20%)
Never Smoked	1,644 (41%)	89 (36%)	1,733 (41%)
Smokes	734 (18%)	42 (17%)	776 (18%)
Unknown	810 (20%)	45 (18%)	855 (20%)
Age Group
<= 50	2,088 (53%)	21 (8.5%)	2,109 (50%)
> 50	1,881 (47%)	225 (91%)	2,106 (50%)
BMI class
Underweight	0 (0%)	0 (0%)	0 (0%)
Normal	829 (22%)	35 (17%)	864 (21%)
Overweight	1,244 (33%)	75 (36%)	1,319 (33%)
Class I obese	904 (24%)	55 (27%)	959 (24%)
Class II-III obese	850 (22%)	42 (20%)	892 (22%)
Unknown	142	39	181
Diabetes Status
Normal	2,384 (60%)	108 (44%)	2,492 (59%)
Pre-diabetic	784 (20%)	38 (15%)	822 (20%)
Diabetic	801 (20%)	100 (41%)	901 (21%)

After removing all variables we are in a good position except that there is only 5 people in the “Never Worked” category. This probably will not be powered enough for the model. Therefore I will remove the category entirely for better accuracy.

df_main_wo_everworked<-
df_main %>% 
  filter(!work_type %in% "Never Worked")

tb3<-
 df_main_wo_everworked %>% 
  select(!c(id, age_group_decade)) %>%
  tbl_summary(by = stroke, 
              type = list(all_continuous() ~ "continuous2",
                         c(hypertension, heart_disease, ever_married) ~ "categorical"), 
               statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({p25}, {p75})", "{min}, {max}"), 
              label = list(age ~ "Age (years)", 
                           gender ~ "Gender",
                           bmi ~ "BMI (kg/m2)", 
                           avg_glucose_level ~ "Glucose level (mg/dL)", 
                           hypertension ~ "Hypertension", 
                           heart_disease ~ "Heart disease", 
                           ever_married ~ "Ever married", 
                           work_type ~ "Work type", 
                           smoking_status ~ "Smoking status", 
                           residence_type ~ "Residence type", 
                           bmi_class ~ "BMI class", 
                           age_group_50 ~ "Age Group", 
                           diabetes_status ~ "Diabetes Status"
                           )
              )%>% 
  bold_labels() %>% 
  add_overall(last = T) %>% 
  modify_footnote(everything() ~ NA) %>%  
  modify_caption("Table of patient's Characteristics") %>% 
   modify_header(all_stat_cols() ~ "**{level}**\n N = {n} \n({style_percent(p)}%)") %>%
  as_kable_extra(
    booktabs = TRUE,
    longtable = TRUE) %>%
    add_header_above(c(" "= 1,"Stroke History" = 2, " "= 1), bold = T) %>%
 kable_classic(html_font = "Cambria") %>% 
   kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>% 
  row_spec(c(4, 8, 21,35,40, 45), bold = T, color = "white", background = "#D7261E") %>% 
  scroll_box(height = "600px", width = "1000px") 

tb3

Table of patient’s Characteristics
	Stroke History
Characteristic	No N = 3964 (94%)	Yes N = 246 (5.8%)	Overall N = 4210 (100%)
Gender
Female	2,402 (61%)	138 (56%)	2,540 (60%)
Male	1,562 (39%)	108 (44%)	1,670 (40%)
Other	0 (0%)	0 (0%)	0 (0%)
Age (years)
Mean (SD)	49 (17)	68 (12)	50 (18)
Median (IQR)	49 (35, 62)	71 (59, 78)	51 (36, 64)
Range	18, 82	32, 82	18, 82
Hypertension
No	3,535 (89%)	180 (73%)	3,715 (88%)
Yes	429 (11%)	66 (27%)	495 (12%)
Heart disease
No	3,736 (94%)	199 (81%)	3,935 (93%)
Yes	228 (5.8%)	47 (19%)	275 (6.5%)
Ever married
No	856 (22%)	27 (11%)	883 (21%)
Yes	3,108 (78%)	219 (89%)	3,327 (79%)
Work type
Children	0 (0%)	0 (0%)	0 (0%)
Government Job	615 (16%)	33 (13%)	648 (15%)
Never Worked	0 (0%)	0 (0%)	0 (0%)
Private	2,613 (66%)	149 (61%)	2,762 (66%)
Self-employed	736 (19%)	64 (26%)	800 (19%)
Residence type
Rural	1,953 (49%)	112 (46%)	2,065 (49%)
Urban	2,011 (51%)	134 (54%)	2,145 (51%)
Glucose level (mg/dL)
Mean (SD)	107 (46)	133 (62)	109 (48)
Median (IQR)	92 (77, 114)	106 (80, 197)	93 (77, 116)
Range	55, 268	56, 272	55, 272
BMI (kg/m2)
Mean (SD)	31 (7)	31 (6)	31 (7)
Median (IQR)	29 (26, 34)	30 (26, 34)	29 (26, 34)
Range	18, 92	19, 57	18, 92
Unknown	142	39	181
Smoking status
Formerly Smoked	781 (20%)	70 (28%)	851 (20%)
Never Smoked	1,641 (41%)	89 (36%)	1,730 (41%)
Smokes	734 (19%)	42 (17%)	776 (18%)
Unknown	808 (20%)	45 (18%)	853 (20%)
Age Group
<= 50	2,083 (53%)	21 (8.5%)	2,104 (50%)
> 50	1,881 (47%)	225 (91%)	2,106 (50%)
BMI class
Underweight	0 (0%)	0 (0%)	0 (0%)
Normal	825 (22%)	35 (17%)	860 (21%)
Overweight	1,243 (33%)	75 (36%)	1,318 (33%)
Class I obese	904 (24%)	55 (27%)	959 (24%)
Class II-III obese	850 (22%)	42 (20%)	892 (22%)
Unknown	142	39	181
Diabetes Status
Normal	2,380 (60%)	108 (44%)	2,488 (59%)
Pre-diabetic	784 (20%)	38 (15%)	822 (20%)
Diabetic	800 (20%)	100 (41%)	900 (21%)

Imputing NA values

In the current dataset, I have found that 201 patients did not have a bmi calculated. There are indeed many ways in which one could impute the missing variables. In this dataset, I decided to keep it simple and calculate the mean and median of the rest of population and to use the mean bmi as the imputed value.

attach(df_main_wo_everworked)

# mean and bmi are so similar that we would used the mean in our data for imputation
bmi_mean_median<-
  df_main_wo_everworked %>% 
  filter(!is.na(bmi)) %>% 
  summarise(mean = mean(bmi), median = median(bmi)) %>% 
  round(2)

bmi_mean_median %>% 
  kable(align = "c", caption = "Mean and Median BMI of the non-missing values") %>% 
  kable_classic(html_font = "Cambria") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Mean and Median BMI of the non-missing values
mean	median
30.57	29.3

df_main_bmi_mean<-
  df_main_wo_everworked %>% 
  mutate(bmi = 
           case_when(
             bmi = is.na(bmi) ~ bmi_mean_median$mean,
             TRUE ~ as.double(bmi)
           ))

Data Visualization

Age & Gender vs Stroke

It appears that in those male and females who have had stroke, the bulk of the data trending toward the older ages in both males and females.

attach(df_main_bmi_mean)


df_main_bmi_mean %>% 
  ggplot(aes(x = gender , y = age, fill = gender))+ 
  geom_violin(trim = FALSE, show.legend = F, color = "black") + 
  geom_boxplot(width = 0.07, show.legend = F) + 
  scale_y_continuous(breaks = seq(0,100,10))+ 
  scale_fill_brewer()+
  facet_wrap(~stroke) +
  theme_bw(base_size = 12)+ 
  labs(title = "Relationship between history of stroke, age and gender", 
       x = "Gender", y = "Age (years)")+
  theme(panel.grid = element_blank(),
        axis.text = element_text(colour = "black"), 
        strip.background = element_rect(fill = "white"),
        strip.background.x = element_rect(size = 1)
        )

Comorbidities vs Stroke

Looking at the frequencies of the stroke it appears that the prevalence of stroke is higher in those with diabetes and heart disease. Also in those without diabetes the prevalence of stroke seem to be higher if they had hypertension.

attach(df_main_bmi_mean)


round(prop.table(table(stroke, heart_disease, hypertension, diabetes_status), c(4,3,2)),3) %>% 
  as.data.frame() %>% 
  rename("Stroke" = stroke, 
         "Hypertension" = hypertension) %>% 
  mutate(Freq = Freq*100) %>% 
  ggplot(aes(x = heart_disease, y = Freq, fill = Stroke)) + 
  geom_col(position = position_dodge(), color = "black", size = .75) + 
  geom_label(aes(label = Freq), show.legend = F, size = 3, nudge_y = -2 )+
  scale_fill_brewer()+ 
  facet_grid(Hypertension ~ diabetes_status, labeller = labeller(Hypertension = label_both, diabetes_status = label_value)) +
  labs(y = "Frequency (%)", 
       x = "Heart Disease", 
       title = "Relation between stroke, heart disease, hypertension and diabetes") +
  theme_minimal(base_size = 12) + 
  theme(
    axis.text = element_text(color = "black"), 
    strip.background = element_rect(colour = "black"), 
    strip.text = element_text(face = "bold"),
    panel.grid.major.x = element_blank(), 
    plot.title = element_text(hjust = 0.5)
  )

Social Factors vs Stroke

The distribution of social factors of those who experienced stroke are presented here. It appears that in those who have experienced stroke the majority of the population were working in a private setting, living urban area, married, and never smoked.

attach(df_main_bmi_mean)

stroke_yes <- 
  df_main_bmi_mean %>% 
  filter(stroke %in% "Yes") %>% 
  nrow()


fig_worktype<-
df_main_bmi_mean %>% 
  filter(stroke %in% "Yes") %>% 
  group_by(work_type) %>% 
  summarise(n = n(), 
            freq = n/stroke_yes
            ) %>% 
  mutate(ymax = cumsum(freq), 
         ymin = c(0, head(ymax, n = -1)),
         freq_perc = paste0(round(100*freq, 0), "%"),
         label = paste0(work_type, "\n ", n, " ", "(", freq_perc, ")"), 
         label_position = sort((ymax + ymin)/2)) %>% 
  ggplot(aes(ymin=ymin, ymax=ymax, xmax=3, xmin=2, fill=work_type)) +
  geom_rect() +
  coord_polar(theta="y") +
  geom_label(x=3.1, aes(y=label_position, label=label), color = "white",  size=3
             ) +
  scale_fill_manual(values = c("#3B6895", "#6495BF", "#A5CFE9")) + 
  xlim(c(1,3)) + 
  theme_void(base_size = 12) +
  labs(title = "Work Type and Storke")+ 
  theme(legend.position = "none", 
        plot.title = element_text(face = "bold", hjust = .5))

fig_residence<-
df_main_bmi_mean %>% 
  filter(stroke %in% "Yes") %>% 
  group_by(residence_type) %>% 
  summarise(n = n(), 
            freq = n/stroke_yes
            ) %>% 
  mutate(ymax = cumsum(freq), 
         ymin = c(0, head(ymax, n = -1)),
         freq_perc = paste0(round(100*freq, 0), "%"),
         label = paste0(residence_type, "\n ", n, " ", "(", freq_perc, ")"), 
         label_position = sort((ymax + ymin)/2)) %>% 
  ggplot(aes(ymin=ymin, ymax=ymax, xmax=3, xmin=2, fill=residence_type)) +
  geom_rect() +
  coord_polar(theta="y") +
  geom_label(x=2.5, aes(y=label_position, label=label), color = "white",  size=3
             ) +
  scale_fill_manual(values = c("#3B6895", "#6495BF", "#A5CFE9")) + 
  xlim(c(1,3)) + 
  theme_void(base_size = 12) +
  labs(title = "Residence Type and Storke")+ 
  theme(legend.position = "none", 
        plot.title = element_text(face = "bold", hjust = .5))



fig_marriage<-
df_main_bmi_mean %>% 
  filter(stroke %in% "Yes") %>% 
  group_by(ever_married ) %>% 
  summarise(n = n(), 
            freq = n/stroke_yes
            ) %>% 
  mutate(ymax = cumsum(freq), 
         ymin = c(0, head(ymax, n = -1)),
         freq_perc = paste0(round(100*freq, 0), "%"),
         label = paste0(ever_married , "\n ", n, " ", "(", freq_perc, ")"), 
         label_position = sort((ymax + ymin)/2)) %>% 
  ggplot(aes(ymin=ymin, ymax=ymax, xmax=3, xmin=2, fill=ever_married )) +
  geom_rect() +
  coord_polar(theta="y") +
  geom_label(x=2.5, aes(y=label_position, label=label), color = "white",  size=3
             ) +
  scale_fill_manual(values = c("#3B6895", "#6495BF", "#A5CFE9")) + 
  xlim(c(1,3)) + 
  theme_void(base_size = 12) +
  labs(title = "Marriage and Storke")+ 
  theme(legend.position = "none", 
        plot.title = element_text(face = "bold", hjust = .5))


fig_smoking<-
df_main_bmi_mean %>% 
  filter(stroke %in% "Yes") %>% 
  group_by(smoking_status  ) %>% 
  summarise(n = n(), 
            freq = n/stroke_yes
            ) %>% 
  mutate(ymax = cumsum(freq), 
         ymin = c(0, head(ymax, n = -1)),
         freq_perc = paste0(round(100*freq, 0), "%"),
         label = paste0(smoking_status  , "\n ", n, " ", "(", freq_perc, ")"), 
         label_position = sort((ymax + ymin)/2)) %>% 
  ggplot(aes(ymin=ymin, ymax=ymax, xmax=3, xmin=2, fill=smoking_status  )) +
  geom_rect() +
  coord_polar(theta="y") +
  geom_label(x=2.5, aes(y=label_position, label=label), color = "white",  size=3
             ) +
  scale_fill_manual(values = c("#3B6895", "#6495BF", "#7CADD2", "#B9DDF1")) + 
  xlim(c(1,3)) + 
  theme_void(base_size = 12) +
  labs(title = "Smoking and Storke")+ 
  theme(legend.position = "none", 
        plot.title = element_text(face = "bold", hjust = .5))


ggpubr::ggarrange(fig_worktype, fig_residence, fig_marriage, fig_smoking)

Models

Logistic Regression

Logistic Regression has 6 assumption in which we will check for here:

1. Having a response variable which is binary

In this case our binary dependent variable is stroke yes or no.

2. The observations being independent

This means that the observations should not come from repeated measures of the same individuals. We have checked for this already as we have unique ids with no repeated measures.

3. No Multicollineratity among the independent variables

When two or more explanatory variables are highly correlated to each other, it could create problems. When we think about regression coefficients, we often interpret them as while keeping all other coefficients constant. So when two or more highly correlated variables are present, when you change one, it is very likely that that the other would change as well. To find that one way is to find variance inflation factor VIF. Generally we would like to see a VIF of < 5 which suggest there is maybe a correlation but is not severe enough to cause problems.

attach(df_main_bmi_mean)



model_log<- glm(formula = stroke ~ gender + age + bmi + ever_married + work_type + residence_type +
                  smoking_status + avg_glucose_level + hypertension + heart_disease, 
                  data = df_main_bmi_mean , family = "binomial")

car::vif(model_log) %>%
  as.data.frame() %>% 
  rownames_to_column(var = "variable") %>% 
  clean_names() %>% 
  ggplot(aes(x = variable, y= gvif)) +
  geom_col(color = "black", fill = "#325F8C", width = 0.75, linewidth= .75) + 
  coord_flip() + 
  labs(title = "Variance infaltion factor", 
       x = "Variable", 
       y = "Variance inflaction factor") + 
  theme_bw(base_size = 12) + 
  theme(
    axis.text = element_text(color = "black"), 
    plot.title = element_text(hjust = 0.5)
  )

4. No extreme outliers

Outlier also known as influential observation can impact the model.

df_main_bmi_mean %>% 
  ggplot(aes(x = age)) + 
  geom_histogram(aes(y = ..density..), fill = "white", color = "black", size = 1, binwidth = 5) +
  geom_density(size = 1) + 
   theme_bw(base_size = 12) + 
  labs(x = "Age (years)",
       y = "Density", 
       title = "Histogram of distribution of age") +
  scale_x_continuous(breaks = seq(15,100,5)) +
  theme(
    axis.text = element_text(color = "black"), 
    panel.grid = element_blank(), 
    plot.title = element_text(hjust = 0.5)
  )

df_main_bmi_mean %>% 
  ggplot(aes(x = avg_glucose_level)) + 
  geom_histogram(aes(y = ..density..), fill = "white", color = "black", size = 1, binwidth =10) +
  geom_density(size = 1) + 
   theme_bw(base_size = 12) + 
  labs(x = "Average glucose level (mg/dL)", 
       y = "Density", 
       title = "Histogram of distribution of glucose levels"
       ) +
  scale_x_continuous(breaks = seq(55,300,10)) + 
  theme(
    axis.text = element_text(color = "black"), 
    panel.grid = element_blank(),
    plot.title = element_text(hjust = 0.5)
  )

df_main_bmi_mean %>% 
  ggplot(aes(x = bmi)) + 
  geom_histogram(aes(y = ..density..), fill = "white", color = "black", size = 1, binwidth = 4) +
  geom_density(size = 1) + 
  theme_minimal() +
   theme_bw(base_size = 12) + 
  labs(x = "BMI (kg/m2)", 
       y = "Density", 
       title = "Histogram of distribution of BMI") + 
  scale_x_continuous(breaks = seq(18,100,4)) +
  theme(
    axis.text = element_text(color = "black"), 
    panel.grid = element_blank() , 
    plot.title = element_text(hjust = 0.5)
  )

5. There is linear relationship between continuous explanatory variables and the logit of dependent variable

The relationship appear to be linear.

attach(df_main_bmi_mean)

model_logit <- glm(stroke ~ age + bmi + avg_glucose_level, family = "binomial", data = df_main_bmi_mean)

prob_logit <- predict(model_logit, type = "response")

logit<- log(prob_logit/(1-prob_logit))

df_main_bmi_mean %>% 
ggplot(aes(age, logit)) +
  geom_point() + 
  geom_smooth() +
  labs(x = "Age (years)", 
       y = "Logit", 
       title = "Logit of stroke vs age") +
   theme_minimal(base_size = 12) + 
  theme(
    axis.text = element_text(color = "black")
  )

df_main_bmi_mean %>% 
  ggplot(aes(bmi, logit)) + 
  geom_point() + 
  geom_smooth() + 
  labs(x = "BMI (kg/m2)", 
       y = "Logit", 
       title = "Logit of stroke vs BMI") +
   theme_minimal(base_size = 12) + 
  theme(
    axis.text = element_text(color = "black")
  )

df_main_bmi_mean %>% 
  ggplot(aes(avg_glucose_level, logit)) + 
  geom_point() + 
  geom_smooth() + 
  labs(x = "Average glucose leve (mg/dL)", 
       y = "Logit", 
       title = "Logit of stroke vs average glucose level") +
   theme_minimal(base_size = 12) + 
  theme(
    axis.text = element_text(color = "black")
  )

attach(df_main_bmi_mean)

df_main_bmi_mean$bmi_class <- factor(df_main_bmi_mean$bmi_class,levels=c("Normal","Underweight",
                                                 "Overweight",
                                               "Class I obese",
                                               "Class II-III obese"))

df_main_bmi_mean$smoking_status <- factor(df_main_bmi_mean$smoking_status, levels = c("Never Smoked", "Formerly Smoked", "Smokes", "Unknown"))

Overal model

attach(df_main_bmi_mean)

model_log_summary<- summary(model_log)

model_log_table<-
model_log_summary$coefficients %>%
  as.data.frame() %>% 
  bind_cols(confint(model_log)) %>%
  rownames_to_column(var = "variable") %>% 
  clean_names() %>% 
  select(variable, estimate, pr_z, 6,7) %>% 
  filter(!variable %in% "(Intercept)") %>% 
  mutate(across(.fns = exp, c(estimate, x2_5_percent, x97_5_percent)), 
          labels = c("Male", "Age", "BMI", "Married", "Work type private", "Work type self-emplyed", "Residence type urban",
                      "Smoking status formerly smoked", "Smoking status urrently smokes", "Smoking status unknown", "Glucose level", "Has hypertension",
                      "Has heart disease"
                     )
         ) %>% 
  adorn_rounding(digits = 2)

model_log_table %>% 
  ggplot(aes(y = labels)) + 
  geom_point(aes(x= estimate) , pch =22 , fill = "#5182AF", size = 3) + 
  geom_vline(xintercept = 1) + 
  geom_errorbarh(aes(xmin = x2_5_percent, xmax = x97_5_percent), width=0.4, alpha=0.9, size=.8) + 
  theme_minimal(base_size = 12) + 
  labs(x = "Expo (OR)", 
       y = "Variable")+
  theme(
    axis.text = element_text(color = "black")
  )

model_log_table %>% 
  mutate(ci = paste0("(",x2_5_percent, "-", x97_5_percent, ")")) %>% 
  select(labels, estimate, pr_z, ci) %>% 
  rename("Expo(OR)" = estimate, "p-value" = pr_z, "95% CI"= ci ) %>% 
  kable(align = "c", caption = "Logistic regression exponentiated odds ratios predicting stroke") %>%
  kable_classic(html_font = "Cambria") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Logistic regression exponentiated odds ratios predicting stroke
labels	Expo(OR)	p-value	95% CI
Male	1.03	0.83	(0.78-1.36)
Age	1.08	0.00	(1.07-1.09)
BMI	1.00	0.80	(0.98-1.03)
Married	0.82	0.37	(0.53-1.3)
Work type private	1.17	0.45	(0.79-1.77)
Work type self-emplyed	0.79	0.30	(0.5-1.25)
Residence type urban	1.10	0.51	(0.83-1.44)
Smoking status formerly smoked	0.81	0.23	(0.57-1.15)
Smoking status urrently smokes	1.12	0.60	(0.73-1.7)
Smoking status unknown	0.93	0.72	(0.61-1.39)
Glucose level	1.00	0.00	(1-1.01)
Has hypertension	1.50	0.01	(1.08-2.06)
Has heart disease	1.31	0.16	(0.89-1.89)

Model performance

Overall age appear to have highest AUC and better predictor.

attach(df_main_bmi_mean)

roc_age<- 
roc(stroke, as.numeric(age)) 

roc_gluc<- 
  roc(stroke, avg_glucose_level)

roc_htn<- 
  roc(stroke, as.numeric(hypertension))


roc_age %>% 
ggroc(size = 1, legacy.axes = TRUE) +
  theme_bw(base_size = 12) + 
  geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1),
                 color="darkgrey", linetype="dashed") +
  labs(title = "ROC curve for age") + 
  annotate("text", x = .5, y = .3, label = paste0("AUC:", " ", round(roc_age$auc,3)), size =6) +
  theme(
    axis.text = element_text(color = "black"), 
    plot.title =element_text(hjust = 0.5)
  )

roc_gluc %>% 
ggroc(size = 1, legacy.axes = TRUE) +
  theme_bw(base_size = 12) + 
  geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1),
                 color="darkgrey", linetype="dashed") +
  labs(title = "ROC curve for glucose levels") + 
  annotate("text", x = .5, y = .3, label = paste0("AUC:", " ", round(roc_gluc$auc,3)), size =6) +
  theme(
    axis.text = element_text(color = "black"), 
    plot.title =element_text(hjust = 0.5)
  )

roc_htn %>% 
ggroc(size = 1, legacy.axes = TRUE) +
  theme_bw(base_size = 12) + 
  geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1),
                 color="darkgrey", linetype="dashed") +
  annotate("text", x = .5, y = .3, label = paste0("AUC:", " ", round(roc_htn$auc,3)), size =6) +
  labs(title = "ROC curve for hypertension") +
  theme(
    axis.text = element_text(color = "black"), 
     plot.title =element_text(hjust = 0.5)
  )

Summary

The current data suggests that age, history of hypertension and average glucose levels are the best predictors of strokes in adults. Particularly age appears to be the better predictor of stroke with AUC 0.806.

The clinical interpretation of such results would depend on the setting in which data was collected and subsequently would be applied.