Purpose

The purpose of this document is to:

Develop standard operating procedures (SOPs) for data analysis related projects and to create reproducible reporting. To that end the following steps were completed:

  1. Cleaning and visualizing publicly available dataset.

  2. Providing detailed distribution tables of variables of interest.

  3. Developing logistic regression model to predict the occurrence of stroke in the current dataset and assess it.

Pakcages

library(tidyverse)
library(janitor)
library(gtsummary)
library(knitr)
library(kableExtra)
library(skimr)
library(survey)
library(tidymodels)
library(caret)
library(pROC)
library(ggrepel)

Data Input

set.seed(42)
df_input<- 
  read.csv("data/stroke.csv") %>% 
  clean_names()

Data Cleaning

Variable and data type

This tells me what kind of variables I have and what kind of classes they are. Usually for analysis I would like to see each categorical variables as a factor and numerical variables as a double.

glimpse(df_input) 
## Rows: 5,110
## Columns: 12
## $ id                <int> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434…
## $ gender            <chr> "Male", "Female", "Male", "Female", "Female", "Male"…
## $ age               <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, …
## $ hypertension      <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
## $ heart_disease     <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0…
## $ ever_married      <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
## $ work_type         <chr> "Private", "Self-employed", "Private", "Private", "S…
## $ residence_type    <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"…
## $ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0…
## $ bmi               <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "…
## $ smoking_status    <chr> "formerly smoked", "never smoked", "never smoked", "…
## $ stroke            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Exploratory data analysis (EDA)

I then would like to individually examine each of the variables. Additionally, I would usually like to categorize continuous variables when possible. For example for the current dataset I can break down age by decade, BMI as normal weight, over weight, obesity class I-III and glucose levels as normal, pre-diabetic and diabetic. Such categories would allow for more flexibility when performing EDA and data visualization.

attach(df_input)

df_clean<-
  df_input %>% 
  mutate(hypertension = 
           case_when( 
             hypertension == 1 ~ "Yes", 
             hypertension == 0 ~ "No"
             ),
         heart_disease = 
           case_when(
           heart_disease == 1 ~ "Yes", 
           heart_disease == 0 ~ "No"),
         work_type = 
           case_when(
             work_type %in% "children" ~ "Children", 
             work_type %in% "Govt_job" ~ "Government Job", 
             work_type %in% "Never_worked" ~ "Never Worked",
             TRUE ~ as.factor(work_type)
             ),
         stroke = 
           case_when(
             stroke== 1 ~ "Yes", 
             stroke == 0 ~ "No"),
         bmi = as.double(bmi), 
         smoking_status = str_to_title(smoking_status),
         age_group_decade = 
           case_when(
             age>=0 & age<10 ~ "0-10",
             age>=10 & age<20 ~ "10-20", 
             age>=20 & age<30 ~ "20-30", 
             age>=30 & age<40 ~ "30-40", 
             age>=40 & age<50 ~ "40-50", 
             age>=50 & age<60 ~ "50-60", 
             age>=60 & age<70 ~ "60-70",
             age>=70 & age<80 ~ "70-80", 
             age>=80 ~ "> 80"
           ),
         age_group_decade = factor(age_group_decade, levels = c("0-10","10-20","20-30", "30-40", "40-50", "50-60", "60-70", "70-80", "> 80")),
         age_group_50 = 
           case_when(age<=50 ~ "<= 50", 
                     age>50 ~ "> 50"), 
         bmi_class = case_when(
           bmi <18.5 ~ "Underweight",
           bmi >=18.5 & bmi <= 24.9 ~ "Normal",
           bmi >= 25.0 & bmi <= 29.9 ~ "Overweight",
           bmi >= 30.0 & bmi <= 34.9 ~ "Class I obese",
           bmi >=35 ~ "Class II-III obese"
         ),
         bmi_class = factor(bmi_class,levels=c("Underweight","Normal",
                                                 "Overweight",
                                               "Class I obese",
                                               "Class II-III obese")), 
         diabetes_status = 
           case_when(
             avg_glucose_level<= 99 ~ "Normal", 
             avg_glucose_level>99 & avg_glucose_level<= 125 ~ "Pre-diabetic", 
             avg_glucose_level> 125 ~ "Diabetic"), 
         diabetes_status = factor(diabetes_status, level = c("Normal", "Pre-diabetic", "Diabetic"))
         ) %>% 
  mutate_if(is.character, as.factor) %>% 
  mutate_if(is.numeric, as.double) 



df_clean %>% 
  skim_without_charts() %>%
  yank("numeric") %>% 
  adorn_rounding(digits = 2) %>%
  as.data.frame() %>% 
  kable(align = "c", caption = "Summary statistics of numerical variables") %>% 
  kable_classic(html_font = "Cambria") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), ) 
Summary statistics of numerical variables
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
id 0 1.00 36517.83 21161.72 67.00 17741.25 36932.00 54682.00 72940.00
age 0 1.00 43.23 22.61 0.08 25.00 45.00 61.00 82.00
avg_glucose_level 0 1.00 106.15 45.28 55.12 77.24 91.88 114.09 271.74
bmi 201 0.96 28.89 7.85 10.30 23.50 28.10 33.10 97.60
df_clean %>% 
  skim() %>%
  yank("factor") %>% 
  adorn_rounding(digits = 2) %>%
  as.data.frame() %>% 
  kable(align = "c", caption = "Summary statistics of categorical variables") %>% 
  kable_classic(html_font = "Cambria") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) 
Summary statistics of categorical variables
skim_variable n_missing complete_rate ordered n_unique top_counts
gender 0 1.00 FALSE 3 Fem: 2994, Mal: 2115, Oth: 1
hypertension 0 1.00 FALSE 2 No: 4612, Yes: 498
heart_disease 0 1.00 FALSE 2 No: 4834, Yes: 276
ever_married 0 1.00 FALSE 2 Yes: 3353, No: 1757
work_type 0 1.00 FALSE 5 Pri: 2925, Sel: 819, Chi: 687, Gov: 657
residence_type 0 1.00 FALSE 2 Urb: 2596, Rur: 2514
smoking_status 0 1.00 FALSE 4 Nev: 1892, Unk: 1544, For: 885, Smo: 789
stroke 0 1.00 FALSE 2 No: 4861, Yes: 249
age_group_decade 0 1.00 FALSE 9 50-: 834, 40-: 730, 30-: 655, 60-: 621
age_group_50 0 1.00 FALSE 2 <= : 2983, > 5: 2127
bmi_class 201 0.96 FALSE 5 Ove: 1409, Nor: 1243, Cla: 1000, Cla: 920
diabetes_status 0 1.00 FALSE 3 Nor: 3071, Pre: 1039, Dia: 1000

Distribution table

This is going to help me figure out what are the characteristics of those patients who have had a history of stroke versus those who did not. I usually use this table as a reference. I additionally provide some visuals after for better understanding of the data.

tb1<-
  df_clean %>% 
  select(!c(id, age_group_decade)) %>%
  tbl_summary(by = stroke, 
              type = list(all_continuous() ~ "continuous2",
                         c(hypertension, heart_disease, ever_married) ~ "categorical"), 
               statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({p25}, {p75})", "{min}, {max}"), 
              label = list(age ~ "Age (years)", 
                           gender ~ "Gender",
                           bmi ~ "BMI (kg/m2)", 
                           avg_glucose_level ~ "Glucose level (mg/dL)", 
                           hypertension ~ "Hypertension", 
                           heart_disease ~ "Heart disease", 
                           ever_married ~ "Ever married", 
                           work_type ~ "Work type", 
                           smoking_status ~ "Smoking status", 
                           residence_type ~ "Residence type", 
                           bmi_class ~ "BMI class", 
                           age_group_50 ~ "Age Group", 
                           diabetes_status ~ "Diabetes Status"
                           )
              )%>% 
  bold_labels() %>% 
  add_overall(last = T) %>% 
  modify_footnote(everything() ~ NA) %>%  
  modify_caption("Table of patient's Characteristics") %>% 
  modify_header(all_stat_cols() ~ "**{level}**\n N = {n} \n({style_percent(p)}%)") %>%
     as_kable_extra(
     booktabs = TRUE,
     longtable = TRUE) %>%
  add_header_above(c(" "= 1,"Stroke History" = 2, " "= 1), bold = T) %>%
  kable_classic(html_font = "Cambria") %>% 
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>% 
  row_spec(c(4, 8, 21,35,40, 45), bold = T, color = "white", background = "#D7261E") %>%
  scroll_box(height = "600px", width = "1000px") 
  
tb1 
Table of patient’s Characteristics
Stroke History
Characteristic No
N = 4861
(95%)
Yes
N = 249
(4.9%)
Overall
N = 5110
(100%)
Gender
Female 2,853 (59%) 141 (57%) 2,994 (59%)
Male 2,007 (41%) 108 (43%) 2,115 (41%)
Other 1 (<0.1%) 0 (0%) 1 (<0.1%)
Age (years)
Mean (SD) 42 (22) 68 (13) 43 (23)
Median (IQR) 43 (24, 59) 71 (59, 78) 45 (25, 61)
Range 0, 82 1, 82 0, 82
Hypertension
No 4,429 (91%) 183 (73%) 4,612 (90%)
Yes 432 (8.9%) 66 (27%) 498 (9.7%)
Heart disease
No 4,632 (95%) 202 (81%) 4,834 (95%)
Yes 229 (4.7%) 47 (19%) 276 (5.4%)
Ever married
No 1,728 (36%) 29 (12%) 1,757 (34%)
Yes 3,133 (64%) 220 (88%) 3,353 (66%)
Work type
Children 685 (14%) 2 (0.8%) 687 (13%)
Government Job 624 (13%) 33 (13%) 657 (13%)
Never Worked 22 (0.5%) 0 (0%) 22 (0.4%)
Private 2,776 (57%) 149 (60%) 2,925 (57%)
Self-employed 754 (16%) 65 (26%) 819 (16%)
Residence type
Rural 2,400 (49%) 114 (46%) 2,514 (49%)
Urban 2,461 (51%) 135 (54%) 2,596 (51%)
Glucose level (mg/dL)
Mean (SD) 105 (44) 133 (62) 106 (45)
Median (IQR) 91 (77, 113) 105 (80, 197) 92 (77, 114)
Range 55, 268 56, 272 55, 272
BMI (kg/m2)
Mean (SD) 29 (8) 30 (6) 29 (8)
Median (IQR) 28 (23, 33) 30 (26, 34) 28 (24, 33)
Range 10, 98 17, 57 10, 98
Unknown 161 40 201
Smoking status
Formerly Smoked 815 (17%) 70 (28%) 885 (17%)
Never Smoked 1,802 (37%) 90 (36%) 1,892 (37%)
Smokes 747 (15%) 42 (17%) 789 (15%)
Unknown 1,497 (31%) 47 (19%) 1,544 (30%)
Age Group
<= 50 2,960 (61%) 23 (9.2%) 2,983 (58%)
> 50 1,901 (39%) 226 (91%) 2,127 (42%)
BMI class
Underweight 336 (7.1%) 1 (0.5%) 337 (6.9%)
Normal 1,208 (26%) 35 (17%) 1,243 (25%)
Overweight 1,334 (28%) 75 (36%) 1,409 (29%)
Class I obese 944 (20%) 56 (27%) 1,000 (20%)
Class II-III obese 878 (19%) 42 (20%) 920 (19%)
Unknown 161 40 201
Diabetes Status
Normal 2,960 (61%) 111 (45%) 3,071 (60%)
Pre-diabetic 1,001 (21%) 38 (15%) 1,039 (20%)
Diabetic 900 (19%) 100 (40%) 1,000 (20%)

 

Having looked at the data, it suggest that there are several variables that may require further cleaning (denoted by red). For example:

  • We have only 1 person who classified their gender as “Other”

  • The age range is wide (1-82 years old)

  • There is only 22 people who responded work type as “Never Worked”

  • There is 201 missing values in BMI and 30% of folks reported “unknown” as their smoking status

  • One person who had stroke was underweight (it’s likely one of the infants in the dataset)

It is critical to recognize these and really narrow down our research question to have more powerful models. In this particular case I am going narrow my question down to assessing the risk of stroke in the following population:

  • Adults age 18 and older

  • BMI of normal range and higher

  • Those who are biologic male and female

Such changes would yield the following table:

df_main<-
df_clean %>% 
  filter(
    age>=18, 
    !bmi_class %in% "Underweight",
    !gender %in% "Other", 
    
  )

tb2<-
 df_main %>% 
  select(!c(id, age_group_decade)) %>%
  tbl_summary(by = stroke, 
              type = list(all_continuous() ~ "continuous2",
                         c(hypertension, heart_disease, ever_married) ~ "categorical"), 
               statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({p25}, {p75})", "{min}, {max}"), 
              label = list(age ~ "Age (years)", 
                           gender ~ "Gender",
                           bmi ~ "BMI (kg/m2)", 
                           avg_glucose_level ~ "Glucose level (mg/dL)", 
                           hypertension ~ "Hypertension", 
                           heart_disease ~ "Heart disease", 
                           ever_married ~ "Ever married", 
                           work_type ~ "Work type", 
                           smoking_status ~ "Smoking status", 
                           residence_type ~ "Residence type", 
                           bmi_class ~ "BMI class", 
                           age_group_50 ~ "Age Group", 
                           diabetes_status ~ "Diabetes Status"
                           )
              )%>% 
  bold_labels() %>% 
  add_overall(last = T) %>% 
  modify_footnote(everything() ~ NA) %>%  
  modify_caption("Table of patient's Characteristics") %>% 
   modify_header(all_stat_cols() ~ "**{level}**\n N = {n} \n({style_percent(p)}%)") %>%
  as_kable_extra(
    booktabs = TRUE,
    longtable = TRUE) %>%
  add_header_above(c(" "= 1,"Stroke History" = 2, " "= 1), bold = T) %>%
 kable_classic(html_font = "Cambria") %>% 
   kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>% 
  row_spec(c(4, 8, 21,35,40, 45), bold = T, color = "white", background = "#D7261E") %>% 
  scroll_box(height = "600px", width = "1000px") 

tb2
Table of patient’s Characteristics
Stroke History
Characteristic No
N = 3969
(94%)
Yes
N = 246
(5.8%)
Overall
N = 4215
(100%)
Gender
Female 2,406 (61%) 138 (56%) 2,544 (60%)
Male 1,563 (39%) 108 (44%) 1,671 (40%)
Other 0 (0%) 0 (0%) 0 (0%)
Age (years)
Mean (SD) 49 (18) 68 (12) 50 (18)
Median (IQR) 49 (35, 62) 71 (59, 78) 50 (36, 64)
Range 18, 82 32, 82 18, 82
Hypertension
No 3,540 (89%) 180 (73%) 3,720 (88%)
Yes 429 (11%) 66 (27%) 495 (12%)
Heart disease
No 3,741 (94%) 199 (81%) 3,940 (93%)
Yes 228 (5.7%) 47 (19%) 275 (6.5%)
Ever married
No 861 (22%) 27 (11%) 888 (21%)
Yes 3,108 (78%) 219 (89%) 3,327 (79%)
Work type
Children 0 (0%) 0 (0%) 0 (0%)
Government Job 615 (15%) 33 (13%) 648 (15%)
Never Worked 5 (0.1%) 0 (0%) 5 (0.1%)
Private 2,613 (66%) 149 (61%) 2,762 (66%)
Self-employed 736 (19%) 64 (26%) 800 (19%)
Residence type
Rural 1,953 (49%) 112 (46%) 2,065 (49%)
Urban 2,016 (51%) 134 (54%) 2,150 (51%)
Glucose level (mg/dL)
Mean (SD) 107 (46) 133 (62) 109 (48)
Median (IQR) 92 (77, 114) 106 (80, 197) 93 (77, 116)
Range 55, 268 56, 272 55, 272
BMI (kg/m2)
Mean (SD) 31 (7) 31 (6) 31 (7)
Median (IQR) 29 (26, 34) 30 (26, 34) 29 (26, 34)
Range 18, 92 19, 57 18, 92
Unknown 142 39 181
Smoking status
Formerly Smoked 781 (20%) 70 (28%) 851 (20%)
Never Smoked 1,644 (41%) 89 (36%) 1,733 (41%)
Smokes 734 (18%) 42 (17%) 776 (18%)
Unknown 810 (20%) 45 (18%) 855 (20%)
Age Group
<= 50 2,088 (53%) 21 (8.5%) 2,109 (50%)
> 50 1,881 (47%) 225 (91%) 2,106 (50%)
BMI class
Underweight 0 (0%) 0 (0%) 0 (0%)
Normal 829 (22%) 35 (17%) 864 (21%)
Overweight 1,244 (33%) 75 (36%) 1,319 (33%)
Class I obese 904 (24%) 55 (27%) 959 (24%)
Class II-III obese 850 (22%) 42 (20%) 892 (22%)
Unknown 142 39 181
Diabetes Status
Normal 2,384 (60%) 108 (44%) 2,492 (59%)
Pre-diabetic 784 (20%) 38 (15%) 822 (20%)
Diabetic 801 (20%) 100 (41%) 901 (21%)

 

After removing all variables we are in a good position except that there is only 5 people in the “Never Worked” category. This probably will not be powered enough for the model. Therefore I will remove the category entirely for better accuracy.

df_main_wo_everworked<-
df_main %>% 
  filter(!work_type %in% "Never Worked")

tb3<-
 df_main_wo_everworked %>% 
  select(!c(id, age_group_decade)) %>%
  tbl_summary(by = stroke, 
              type = list(all_continuous() ~ "continuous2",
                         c(hypertension, heart_disease, ever_married) ~ "categorical"), 
               statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({p25}, {p75})", "{min}, {max}"), 
              label = list(age ~ "Age (years)", 
                           gender ~ "Gender",
                           bmi ~ "BMI (kg/m2)", 
                           avg_glucose_level ~ "Glucose level (mg/dL)", 
                           hypertension ~ "Hypertension", 
                           heart_disease ~ "Heart disease", 
                           ever_married ~ "Ever married", 
                           work_type ~ "Work type", 
                           smoking_status ~ "Smoking status", 
                           residence_type ~ "Residence type", 
                           bmi_class ~ "BMI class", 
                           age_group_50 ~ "Age Group", 
                           diabetes_status ~ "Diabetes Status"
                           )
              )%>% 
  bold_labels() %>% 
  add_overall(last = T) %>% 
  modify_footnote(everything() ~ NA) %>%  
  modify_caption("Table of patient's Characteristics") %>% 
   modify_header(all_stat_cols() ~ "**{level}**\n N = {n} \n({style_percent(p)}%)") %>%
  as_kable_extra(
    booktabs = TRUE,
    longtable = TRUE) %>%
    add_header_above(c(" "= 1,"Stroke History" = 2, " "= 1), bold = T) %>%
 kable_classic(html_font = "Cambria") %>% 
   kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>% 
  row_spec(c(4, 8, 21,35,40, 45), bold = T, color = "white", background = "#D7261E") %>% 
  scroll_box(height = "600px", width = "1000px") 

tb3
Table of patient’s Characteristics
Stroke History
Characteristic No
N = 3964
(94%)
Yes
N = 246
(5.8%)
Overall
N = 4210
(100%)
Gender
Female 2,402 (61%) 138 (56%) 2,540 (60%)
Male 1,562 (39%) 108 (44%) 1,670 (40%)
Other 0 (0%) 0 (0%) 0 (0%)
Age (years)
Mean (SD) 49 (17) 68 (12) 50 (18)
Median (IQR) 49 (35, 62) 71 (59, 78) 51 (36, 64)
Range 18, 82 32, 82 18, 82
Hypertension
No 3,535 (89%) 180 (73%) 3,715 (88%)
Yes 429 (11%) 66 (27%) 495 (12%)
Heart disease
No 3,736 (94%) 199 (81%) 3,935 (93%)
Yes 228 (5.8%) 47 (19%) 275 (6.5%)
Ever married
No 856 (22%) 27 (11%) 883 (21%)
Yes 3,108 (78%) 219 (89%) 3,327 (79%)
Work type
Children 0 (0%) 0 (0%) 0 (0%)
Government Job 615 (16%) 33 (13%) 648 (15%)
Never Worked 0 (0%) 0 (0%) 0 (0%)
Private 2,613 (66%) 149 (61%) 2,762 (66%)
Self-employed 736 (19%) 64 (26%) 800 (19%)
Residence type
Rural 1,953 (49%) 112 (46%) 2,065 (49%)
Urban 2,011 (51%) 134 (54%) 2,145 (51%)
Glucose level (mg/dL)
Mean (SD) 107 (46) 133 (62) 109 (48)
Median (IQR) 92 (77, 114) 106 (80, 197) 93 (77, 116)
Range 55, 268 56, 272 55, 272
BMI (kg/m2)
Mean (SD) 31 (7) 31 (6) 31 (7)
Median (IQR) 29 (26, 34) 30 (26, 34) 29 (26, 34)
Range 18, 92 19, 57 18, 92
Unknown 142 39 181
Smoking status
Formerly Smoked 781 (20%) 70 (28%) 851 (20%)
Never Smoked 1,641 (41%) 89 (36%) 1,730 (41%)
Smokes 734 (19%) 42 (17%) 776 (18%)
Unknown 808 (20%) 45 (18%) 853 (20%)
Age Group
<= 50 2,083 (53%) 21 (8.5%) 2,104 (50%)
> 50 1,881 (47%) 225 (91%) 2,106 (50%)
BMI class
Underweight 0 (0%) 0 (0%) 0 (0%)
Normal 825 (22%) 35 (17%) 860 (21%)
Overweight 1,243 (33%) 75 (36%) 1,318 (33%)
Class I obese 904 (24%) 55 (27%) 959 (24%)
Class II-III obese 850 (22%) 42 (20%) 892 (22%)
Unknown 142 39 181
Diabetes Status
Normal 2,380 (60%) 108 (44%) 2,488 (59%)
Pre-diabetic 784 (20%) 38 (15%) 822 (20%)
Diabetic 800 (20%) 100 (41%) 900 (21%)

Imputing NA values

In the current dataset, I have found that 201 patients did not have a bmi calculated. There are indeed many ways in which one could impute the missing variables. In this dataset, I decided to keep it simple and calculate the mean and median of the rest of population and to use the mean bmi as the imputed value.

attach(df_main_wo_everworked)

# mean and bmi are so similar that we would used the mean in our data for imputation
bmi_mean_median<-
  df_main_wo_everworked %>% 
  filter(!is.na(bmi)) %>% 
  summarise(mean = mean(bmi), median = median(bmi)) %>% 
  round(2)

bmi_mean_median %>% 
  kable(align = "c", caption = "Mean and Median BMI of the non-missing values") %>% 
  kable_classic(html_font = "Cambria") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Mean and Median BMI of the non-missing values
mean median
30.57 29.3
df_main_bmi_mean<-
  df_main_wo_everworked %>% 
  mutate(bmi = 
           case_when(
             bmi = is.na(bmi) ~ bmi_mean_median$mean,
             TRUE ~ as.double(bmi)
           )) 

Data Visualization

Age & Gender vs Stroke

It appears that in those male and females who have had stroke, the bulk of the data trending toward the older ages in both males and females.

attach(df_main_bmi_mean)


df_main_bmi_mean %>% 
  ggplot(aes(x = gender , y = age, fill = gender))+ 
  geom_violin(trim = FALSE, show.legend = F, color = "black") + 
  geom_boxplot(width = 0.07, show.legend = F) + 
  scale_y_continuous(breaks = seq(0,100,10))+ 
  scale_fill_brewer()+
  facet_wrap(~stroke) +
  theme_bw(base_size = 12)+ 
  labs(title = "Relationship between history of stroke, age and gender", 
       x = "Gender", y = "Age (years)")+
  theme(panel.grid = element_blank(),
        axis.text = element_text(colour = "black"), 
        strip.background = element_rect(fill = "white"),
        strip.background.x = element_rect(size = 1)
        )

Comorbidities vs Stroke

Looking at the frequencies of the stroke it appears that the prevalence of stroke is higher in those with diabetes and heart disease. Also in those without diabetes the prevalence of stroke seem to be higher if they had hypertension.

attach(df_main_bmi_mean)


round(prop.table(table(stroke, heart_disease, hypertension, diabetes_status), c(4,3,2)),3) %>% 
  as.data.frame() %>% 
  rename("Stroke" = stroke, 
         "Hypertension" = hypertension) %>% 
  mutate(Freq = Freq*100) %>% 
  ggplot(aes(x = heart_disease, y = Freq, fill = Stroke)) + 
  geom_col(position = position_dodge(), color = "black", size = .75) + 
  geom_label(aes(label = Freq), show.legend = F, size = 3, nudge_y = -2 )+
  scale_fill_brewer()+ 
  facet_grid(Hypertension ~ diabetes_status, labeller = labeller(Hypertension = label_both, diabetes_status = label_value)) +
  labs(y = "Frequency (%)", 
       x = "Heart Disease", 
       title = "Relation between stroke, heart disease, hypertension and diabetes") +
  theme_minimal(base_size = 12) + 
  theme(
    axis.text = element_text(color = "black"), 
    strip.background = element_rect(colour = "black"), 
    strip.text = element_text(face = "bold"),
    panel.grid.major.x = element_blank(), 
    plot.title = element_text(hjust = 0.5)
  )

Social Factors vs Stroke

The distribution of social factors of those who experienced stroke are presented here. It appears that in those who have experienced stroke the majority of the population were working in a private setting, living urban area, married, and never smoked.

attach(df_main_bmi_mean)

stroke_yes <- 
  df_main_bmi_mean %>% 
  filter(stroke %in% "Yes") %>% 
  nrow()


fig_worktype<-
df_main_bmi_mean %>% 
  filter(stroke %in% "Yes") %>% 
  group_by(work_type) %>% 
  summarise(n = n(), 
            freq = n/stroke_yes
            ) %>% 
  mutate(ymax = cumsum(freq), 
         ymin = c(0, head(ymax, n = -1)),
         freq_perc = paste0(round(100*freq, 0), "%"),
         label = paste0(work_type, "\n ", n, " ", "(", freq_perc, ")"), 
         label_position = sort((ymax + ymin)/2)) %>% 
  ggplot(aes(ymin=ymin, ymax=ymax, xmax=3, xmin=2, fill=work_type)) +
  geom_rect() +
  coord_polar(theta="y") +
  geom_label(x=3.1, aes(y=label_position, label=label), color = "white",  size=3
             ) +
  scale_fill_manual(values = c("#3B6895", "#6495BF", "#A5CFE9")) + 
  xlim(c(1,3)) + 
  theme_void(base_size = 12) +
  labs(title = "Work Type and Storke")+ 
  theme(legend.position = "none", 
        plot.title = element_text(face = "bold", hjust = .5))

fig_residence<-
df_main_bmi_mean %>% 
  filter(stroke %in% "Yes") %>% 
  group_by(residence_type) %>% 
  summarise(n = n(), 
            freq = n/stroke_yes
            ) %>% 
  mutate(ymax = cumsum(freq), 
         ymin = c(0, head(ymax, n = -1)),
         freq_perc = paste0(round(100*freq, 0), "%"),
         label = paste0(residence_type, "\n ", n, " ", "(", freq_perc, ")"), 
         label_position = sort((ymax + ymin)/2)) %>% 
  ggplot(aes(ymin=ymin, ymax=ymax, xmax=3, xmin=2, fill=residence_type)) +
  geom_rect() +
  coord_polar(theta="y") +
  geom_label(x=2.5, aes(y=label_position, label=label), color = "white",  size=3
             ) +
  scale_fill_manual(values = c("#3B6895", "#6495BF", "#A5CFE9")) + 
  xlim(c(1,3)) + 
  theme_void(base_size = 12) +
  labs(title = "Residence Type and Storke")+ 
  theme(legend.position = "none", 
        plot.title = element_text(face = "bold", hjust = .5))



fig_marriage<-
df_main_bmi_mean %>% 
  filter(stroke %in% "Yes") %>% 
  group_by(ever_married ) %>% 
  summarise(n = n(), 
            freq = n/stroke_yes
            ) %>% 
  mutate(ymax = cumsum(freq), 
         ymin = c(0, head(ymax, n = -1)),
         freq_perc = paste0(round(100*freq, 0), "%"),
         label = paste0(ever_married , "\n ", n, " ", "(", freq_perc, ")"), 
         label_position = sort((ymax + ymin)/2)) %>% 
  ggplot(aes(ymin=ymin, ymax=ymax, xmax=3, xmin=2, fill=ever_married )) +
  geom_rect() +
  coord_polar(theta="y") +
  geom_label(x=2.5, aes(y=label_position, label=label), color = "white",  size=3
             ) +
  scale_fill_manual(values = c("#3B6895", "#6495BF", "#A5CFE9")) + 
  xlim(c(1,3)) + 
  theme_void(base_size = 12) +
  labs(title = "Marriage and Storke")+ 
  theme(legend.position = "none", 
        plot.title = element_text(face = "bold", hjust = .5))


fig_smoking<-
df_main_bmi_mean %>% 
  filter(stroke %in% "Yes") %>% 
  group_by(smoking_status  ) %>% 
  summarise(n = n(), 
            freq = n/stroke_yes
            ) %>% 
  mutate(ymax = cumsum(freq), 
         ymin = c(0, head(ymax, n = -1)),
         freq_perc = paste0(round(100*freq, 0), "%"),
         label = paste0(smoking_status  , "\n ", n, " ", "(", freq_perc, ")"), 
         label_position = sort((ymax + ymin)/2)) %>% 
  ggplot(aes(ymin=ymin, ymax=ymax, xmax=3, xmin=2, fill=smoking_status  )) +
  geom_rect() +
  coord_polar(theta="y") +
  geom_label(x=2.5, aes(y=label_position, label=label), color = "white",  size=3
             ) +
  scale_fill_manual(values = c("#3B6895", "#6495BF", "#7CADD2", "#B9DDF1")) + 
  xlim(c(1,3)) + 
  theme_void(base_size = 12) +
  labs(title = "Smoking and Storke")+ 
  theme(legend.position = "none", 
        plot.title = element_text(face = "bold", hjust = .5))


ggpubr::ggarrange(fig_worktype, fig_residence, fig_marriage, fig_smoking)

Models

Logistic Regression

Logistic Regression has 6 assumption in which we will check for here:

1. Having a response variable which is binary

In this case our binary dependent variable is stroke yes or no.

2. The observations being independent

This means that the observations should not come from repeated measures of the same individuals. We have checked for this already as we have unique ids with no repeated measures.

3. No Multicollineratity among the independent variables

When two or more explanatory variables are highly correlated to each other, it could create problems. When we think about regression coefficients, we often interpret them as while keeping all other coefficients constant. So when two or more highly correlated variables are present, when you change one, it is very likely that that the other would change as well. To find that one way is to find variance inflation factor VIF. Generally we would like to see a VIF of < 5 which suggest there is maybe a correlation but is not severe enough to cause problems.

attach(df_main_bmi_mean)



model_log<- glm(formula = stroke ~ gender + age + bmi + ever_married + work_type + residence_type +
                  smoking_status + avg_glucose_level + hypertension + heart_disease, 
                  data = df_main_bmi_mean , family = "binomial")

car::vif(model_log) %>%
  as.data.frame() %>% 
  rownames_to_column(var = "variable") %>% 
  clean_names() %>% 
  ggplot(aes(x = variable, y= gvif)) +
  geom_col(color = "black", fill = "#325F8C", width = 0.75, linewidth= .75) + 
  coord_flip() + 
  labs(title = "Variance infaltion factor", 
       x = "Variable", 
       y = "Variance inflaction factor") + 
  theme_bw(base_size = 12) + 
  theme(
    axis.text = element_text(color = "black"), 
    plot.title = element_text(hjust = 0.5)
  ) 

4. No extreme outliers

Outlier also known as influential observation can impact the model.

df_main_bmi_mean %>% 
  ggplot(aes(x = age)) + 
  geom_histogram(aes(y = ..density..), fill = "white", color = "black", size = 1, binwidth = 5) +
  geom_density(size = 1) + 
   theme_bw(base_size = 12) + 
  labs(x = "Age (years)",
       y = "Density", 
       title = "Histogram of distribution of age") +
  scale_x_continuous(breaks = seq(15,100,5)) +
  theme(
    axis.text = element_text(color = "black"), 
    panel.grid = element_blank(), 
    plot.title = element_text(hjust = 0.5)
  ) 

df_main_bmi_mean %>% 
  ggplot(aes(x = avg_glucose_level)) + 
  geom_histogram(aes(y = ..density..), fill = "white", color = "black", size = 1, binwidth =10) +
  geom_density(size = 1) + 
   theme_bw(base_size = 12) + 
  labs(x = "Average glucose level (mg/dL)", 
       y = "Density", 
       title = "Histogram of distribution of glucose levels"
       ) +
  scale_x_continuous(breaks = seq(55,300,10)) + 
  theme(
    axis.text = element_text(color = "black"), 
    panel.grid = element_blank(),
    plot.title = element_text(hjust = 0.5)
  ) 

df_main_bmi_mean %>% 
  ggplot(aes(x = bmi)) + 
  geom_histogram(aes(y = ..density..), fill = "white", color = "black", size = 1, binwidth = 4) +
  geom_density(size = 1) + 
  theme_minimal() +
   theme_bw(base_size = 12) + 
  labs(x = "BMI (kg/m2)", 
       y = "Density", 
       title = "Histogram of distribution of BMI") + 
  scale_x_continuous(breaks = seq(18,100,4)) +
  theme(
    axis.text = element_text(color = "black"), 
    panel.grid = element_blank() , 
    plot.title = element_text(hjust = 0.5)
  ) 

5. There is linear relationship between continuous explanatory variables and the logit of dependent variable

The relationship appear to be linear.

attach(df_main_bmi_mean)

model_logit <- glm(stroke ~ age + bmi + avg_glucose_level, family = "binomial", data = df_main_bmi_mean)

prob_logit <- predict(model_logit, type = "response")

logit<- log(prob_logit/(1-prob_logit))

df_main_bmi_mean %>% 
ggplot(aes(age, logit)) +
  geom_point() + 
  geom_smooth() +
  labs(x = "Age (years)", 
       y = "Logit", 
       title = "Logit of stroke vs age") +
   theme_minimal(base_size = 12) + 
  theme(
    axis.text = element_text(color = "black")
  ) 

df_main_bmi_mean %>% 
  ggplot(aes(bmi, logit)) + 
  geom_point() + 
  geom_smooth() + 
  labs(x = "BMI (kg/m2)", 
       y = "Logit", 
       title = "Logit of stroke vs BMI") +
   theme_minimal(base_size = 12) + 
  theme(
    axis.text = element_text(color = "black")
  ) 

df_main_bmi_mean %>% 
  ggplot(aes(avg_glucose_level, logit)) + 
  geom_point() + 
  geom_smooth() + 
  labs(x = "Average glucose leve (mg/dL)", 
       y = "Logit", 
       title = "Logit of stroke vs average glucose level") +
   theme_minimal(base_size = 12) + 
  theme(
    axis.text = element_text(color = "black")
  ) 

attach(df_main_bmi_mean)

df_main_bmi_mean$bmi_class <- factor(df_main_bmi_mean$bmi_class,levels=c("Normal","Underweight",
                                                 "Overweight",
                                               "Class I obese",
                                               "Class II-III obese"))

df_main_bmi_mean$smoking_status <- factor(df_main_bmi_mean$smoking_status, levels = c("Never Smoked", "Formerly Smoked", "Smokes", "Unknown"))

Overal model

attach(df_main_bmi_mean)

model_log_summary<- summary(model_log)

model_log_table<-
model_log_summary$coefficients %>%
  as.data.frame() %>% 
  bind_cols(confint(model_log)) %>%
  rownames_to_column(var = "variable") %>% 
  clean_names() %>% 
  select(variable, estimate, pr_z, 6,7) %>% 
  filter(!variable %in% "(Intercept)") %>% 
  mutate(across(.fns = exp, c(estimate, x2_5_percent, x97_5_percent)), 
          labels = c("Male", "Age", "BMI", "Married", "Work type private", "Work type self-emplyed", "Residence type urban",
                      "Smoking status formerly smoked", "Smoking status urrently smokes", "Smoking status unknown", "Glucose level", "Has hypertension",
                      "Has heart disease"
                     )
         ) %>% 
  adorn_rounding(digits = 2)

model_log_table %>% 
  ggplot(aes(y = labels)) + 
  geom_point(aes(x= estimate) , pch =22 , fill = "#5182AF", size = 3) + 
  geom_vline(xintercept = 1) + 
  geom_errorbarh(aes(xmin = x2_5_percent, xmax = x97_5_percent), width=0.4, alpha=0.9, size=.8) + 
  theme_minimal(base_size = 12) + 
  labs(x = "Expo (OR)", 
       y = "Variable")+
  theme(
    axis.text = element_text(color = "black")
  ) 

model_log_table %>% 
  mutate(ci = paste0("(",x2_5_percent, "-", x97_5_percent, ")")) %>% 
  select(labels, estimate, pr_z, ci) %>% 
  rename("Expo(OR)" = estimate, "p-value" = pr_z, "95% CI"= ci ) %>% 
  kable(align = "c", caption = "Logistic regression exponentiated odds ratios predicting stroke") %>%
  kable_classic(html_font = "Cambria") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) 
Logistic regression exponentiated odds ratios predicting stroke
labels Expo(OR) p-value 95% CI
Male 1.03 0.83 (0.78-1.36)
Age 1.08 0.00 (1.07-1.09)
BMI 1.00 0.80 (0.98-1.03)
Married 0.82 0.37 (0.53-1.3)
Work type private 1.17 0.45 (0.79-1.77)
Work type self-emplyed 0.79 0.30 (0.5-1.25)
Residence type urban 1.10 0.51 (0.83-1.44)
Smoking status formerly smoked 0.81 0.23 (0.57-1.15)
Smoking status urrently smokes 1.12 0.60 (0.73-1.7)
Smoking status unknown 0.93 0.72 (0.61-1.39)
Glucose level 1.00 0.00 (1-1.01)
Has hypertension 1.50 0.01 (1.08-2.06)
Has heart disease 1.31 0.16 (0.89-1.89)

Model performance

Overall age appear to have highest AUC and better predictor.

attach(df_main_bmi_mean)

roc_age<- 
roc(stroke, as.numeric(age)) 

roc_gluc<- 
  roc(stroke, avg_glucose_level)

roc_htn<- 
  roc(stroke, as.numeric(hypertension))


roc_age %>% 
ggroc(size = 1, legacy.axes = TRUE) +
  theme_bw(base_size = 12) + 
  geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1),
                 color="darkgrey", linetype="dashed") +
  labs(title = "ROC curve for age") + 
  annotate("text", x = .5, y = .3, label = paste0("AUC:", " ", round(roc_age$auc,3)), size =6) +
  theme(
    axis.text = element_text(color = "black"), 
    plot.title =element_text(hjust = 0.5)
  ) 

roc_gluc %>% 
ggroc(size = 1, legacy.axes = TRUE) +
  theme_bw(base_size = 12) + 
  geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1),
                 color="darkgrey", linetype="dashed") +
  labs(title = "ROC curve for glucose levels") + 
  annotate("text", x = .5, y = .3, label = paste0("AUC:", " ", round(roc_gluc$auc,3)), size =6) +
  theme(
    axis.text = element_text(color = "black"), 
    plot.title =element_text(hjust = 0.5)
  )

roc_htn %>% 
ggroc(size = 1, legacy.axes = TRUE) +
  theme_bw(base_size = 12) + 
  geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1),
                 color="darkgrey", linetype="dashed") +
  annotate("text", x = .5, y = .3, label = paste0("AUC:", " ", round(roc_htn$auc,3)), size =6) +
  labs(title = "ROC curve for hypertension") +
  theme(
    axis.text = element_text(color = "black"), 
     plot.title =element_text(hjust = 0.5)
  )

Summary

The current data suggests that age, history of hypertension and average glucose levels are the best predictors of strokes in adults. Particularly age appears to be the better predictor of stroke with AUC 0.806.

The clinical interpretation of such results would depend on the setting in which data was collected and subsequently would be applied.