The purpose of this document is to:
Develop standard operating procedures (SOPs) for data analysis related projects and to create reproducible reporting. To that end the following steps were completed:
Cleaning and visualizing publicly available dataset.
Providing detailed distribution tables of variables of interest.
Developing logistic regression model to predict the occurrence of stroke in the current dataset and assess it.
library(tidyverse)
library(janitor)
library(gtsummary)
library(knitr)
library(kableExtra)
library(skimr)
library(survey)
library(tidymodels)
library(caret)
library(pROC)
library(ggrepel)
set.seed(42)
df_input<-
read.csv("data/stroke.csv") %>%
clean_names()
This tells me what kind of variables I have and what kind of classes they are. Usually for analysis I would like to see each categorical variables as a factor and numerical variables as a double.
glimpse(df_input)
## Rows: 5,110
## Columns: 12
## $ id <int> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434…
## $ gender <chr> "Male", "Female", "Male", "Female", "Female", "Male"…
## $ age <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, …
## $ hypertension <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
## $ heart_disease <int> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0…
## $ ever_married <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
## $ work_type <chr> "Private", "Self-employed", "Private", "Private", "S…
## $ residence_type <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"…
## $ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0…
## $ bmi <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "…
## $ smoking_status <chr> "formerly smoked", "never smoked", "never smoked", "…
## $ stroke <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
I then would like to individually examine each of the variables. Additionally, I would usually like to categorize continuous variables when possible. For example for the current dataset I can break down age by decade, BMI as normal weight, over weight, obesity class I-III and glucose levels as normal, pre-diabetic and diabetic. Such categories would allow for more flexibility when performing EDA and data visualization.
attach(df_input)
df_clean<-
df_input %>%
mutate(hypertension =
case_when(
hypertension == 1 ~ "Yes",
hypertension == 0 ~ "No"
),
heart_disease =
case_when(
heart_disease == 1 ~ "Yes",
heart_disease == 0 ~ "No"),
work_type =
case_when(
work_type %in% "children" ~ "Children",
work_type %in% "Govt_job" ~ "Government Job",
work_type %in% "Never_worked" ~ "Never Worked",
TRUE ~ as.factor(work_type)
),
stroke =
case_when(
stroke== 1 ~ "Yes",
stroke == 0 ~ "No"),
bmi = as.double(bmi),
smoking_status = str_to_title(smoking_status),
age_group_decade =
case_when(
age>=0 & age<10 ~ "0-10",
age>=10 & age<20 ~ "10-20",
age>=20 & age<30 ~ "20-30",
age>=30 & age<40 ~ "30-40",
age>=40 & age<50 ~ "40-50",
age>=50 & age<60 ~ "50-60",
age>=60 & age<70 ~ "60-70",
age>=70 & age<80 ~ "70-80",
age>=80 ~ "> 80"
),
age_group_decade = factor(age_group_decade, levels = c("0-10","10-20","20-30", "30-40", "40-50", "50-60", "60-70", "70-80", "> 80")),
age_group_50 =
case_when(age<=50 ~ "<= 50",
age>50 ~ "> 50"),
bmi_class = case_when(
bmi <18.5 ~ "Underweight",
bmi >=18.5 & bmi <= 24.9 ~ "Normal",
bmi >= 25.0 & bmi <= 29.9 ~ "Overweight",
bmi >= 30.0 & bmi <= 34.9 ~ "Class I obese",
bmi >=35 ~ "Class II-III obese"
),
bmi_class = factor(bmi_class,levels=c("Underweight","Normal",
"Overweight",
"Class I obese",
"Class II-III obese")),
diabetes_status =
case_when(
avg_glucose_level<= 99 ~ "Normal",
avg_glucose_level>99 & avg_glucose_level<= 125 ~ "Pre-diabetic",
avg_glucose_level> 125 ~ "Diabetic"),
diabetes_status = factor(diabetes_status, level = c("Normal", "Pre-diabetic", "Diabetic"))
) %>%
mutate_if(is.character, as.factor) %>%
mutate_if(is.numeric, as.double)
df_clean %>%
skim_without_charts() %>%
yank("numeric") %>%
adorn_rounding(digits = 2) %>%
as.data.frame() %>%
kable(align = "c", caption = "Summary statistics of numerical variables") %>%
kable_classic(html_font = "Cambria") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), )
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 36517.83 | 21161.72 | 67.00 | 17741.25 | 36932.00 | 54682.00 | 72940.00 |
| age | 0 | 1.00 | 43.23 | 22.61 | 0.08 | 25.00 | 45.00 | 61.00 | 82.00 |
| avg_glucose_level | 0 | 1.00 | 106.15 | 45.28 | 55.12 | 77.24 | 91.88 | 114.09 | 271.74 |
| bmi | 201 | 0.96 | 28.89 | 7.85 | 10.30 | 23.50 | 28.10 | 33.10 | 97.60 |
df_clean %>%
skim() %>%
yank("factor") %>%
adorn_rounding(digits = 2) %>%
as.data.frame() %>%
kable(align = "c", caption = "Summary statistics of categorical variables") %>%
kable_classic(html_font = "Cambria") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| gender | 0 | 1.00 | FALSE | 3 | Fem: 2994, Mal: 2115, Oth: 1 |
| hypertension | 0 | 1.00 | FALSE | 2 | No: 4612, Yes: 498 |
| heart_disease | 0 | 1.00 | FALSE | 2 | No: 4834, Yes: 276 |
| ever_married | 0 | 1.00 | FALSE | 2 | Yes: 3353, No: 1757 |
| work_type | 0 | 1.00 | FALSE | 5 | Pri: 2925, Sel: 819, Chi: 687, Gov: 657 |
| residence_type | 0 | 1.00 | FALSE | 2 | Urb: 2596, Rur: 2514 |
| smoking_status | 0 | 1.00 | FALSE | 4 | Nev: 1892, Unk: 1544, For: 885, Smo: 789 |
| stroke | 0 | 1.00 | FALSE | 2 | No: 4861, Yes: 249 |
| age_group_decade | 0 | 1.00 | FALSE | 9 | 50-: 834, 40-: 730, 30-: 655, 60-: 621 |
| age_group_50 | 0 | 1.00 | FALSE | 2 | <= : 2983, > 5: 2127 |
| bmi_class | 201 | 0.96 | FALSE | 5 | Ove: 1409, Nor: 1243, Cla: 1000, Cla: 920 |
| diabetes_status | 0 | 1.00 | FALSE | 3 | Nor: 3071, Pre: 1039, Dia: 1000 |
This is going to help me figure out what are the characteristics of those patients who have had a history of stroke versus those who did not. I usually use this table as a reference. I additionally provide some visuals after for better understanding of the data.
tb1<-
df_clean %>%
select(!c(id, age_group_decade)) %>%
tbl_summary(by = stroke,
type = list(all_continuous() ~ "continuous2",
c(hypertension, heart_disease, ever_married) ~ "categorical"),
statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({p25}, {p75})", "{min}, {max}"),
label = list(age ~ "Age (years)",
gender ~ "Gender",
bmi ~ "BMI (kg/m2)",
avg_glucose_level ~ "Glucose level (mg/dL)",
hypertension ~ "Hypertension",
heart_disease ~ "Heart disease",
ever_married ~ "Ever married",
work_type ~ "Work type",
smoking_status ~ "Smoking status",
residence_type ~ "Residence type",
bmi_class ~ "BMI class",
age_group_50 ~ "Age Group",
diabetes_status ~ "Diabetes Status"
)
)%>%
bold_labels() %>%
add_overall(last = T) %>%
modify_footnote(everything() ~ NA) %>%
modify_caption("Table of patient's Characteristics") %>%
modify_header(all_stat_cols() ~ "**{level}**\n N = {n} \n({style_percent(p)}%)") %>%
as_kable_extra(
booktabs = TRUE,
longtable = TRUE) %>%
add_header_above(c(" "= 1,"Stroke History" = 2, " "= 1), bold = T) %>%
kable_classic(html_font = "Cambria") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(c(4, 8, 21,35,40, 45), bold = T, color = "white", background = "#D7261E") %>%
scroll_box(height = "600px", width = "1000px")
tb1
| Characteristic |
No N = 4861 (95%) |
Yes N = 249 (4.9%) |
Overall N = 5110 (100%) |
|---|---|---|---|
| Gender | |||
| Female | 2,853 (59%) | 141 (57%) | 2,994 (59%) |
| Male | 2,007 (41%) | 108 (43%) | 2,115 (41%) |
| Other | 1 (<0.1%) | 0 (0%) | 1 (<0.1%) |
| Age (years) | |||
| Mean (SD) | 42 (22) | 68 (13) | 43 (23) |
| Median (IQR) | 43 (24, 59) | 71 (59, 78) | 45 (25, 61) |
| Range | 0, 82 | 1, 82 | 0, 82 |
| Hypertension | |||
| No | 4,429 (91%) | 183 (73%) | 4,612 (90%) |
| Yes | 432 (8.9%) | 66 (27%) | 498 (9.7%) |
| Heart disease | |||
| No | 4,632 (95%) | 202 (81%) | 4,834 (95%) |
| Yes | 229 (4.7%) | 47 (19%) | 276 (5.4%) |
| Ever married | |||
| No | 1,728 (36%) | 29 (12%) | 1,757 (34%) |
| Yes | 3,133 (64%) | 220 (88%) | 3,353 (66%) |
| Work type | |||
| Children | 685 (14%) | 2 (0.8%) | 687 (13%) |
| Government Job | 624 (13%) | 33 (13%) | 657 (13%) |
| Never Worked | 22 (0.5%) | 0 (0%) | 22 (0.4%) |
| Private | 2,776 (57%) | 149 (60%) | 2,925 (57%) |
| Self-employed | 754 (16%) | 65 (26%) | 819 (16%) |
| Residence type | |||
| Rural | 2,400 (49%) | 114 (46%) | 2,514 (49%) |
| Urban | 2,461 (51%) | 135 (54%) | 2,596 (51%) |
| Glucose level (mg/dL) | |||
| Mean (SD) | 105 (44) | 133 (62) | 106 (45) |
| Median (IQR) | 91 (77, 113) | 105 (80, 197) | 92 (77, 114) |
| Range | 55, 268 | 56, 272 | 55, 272 |
| BMI (kg/m2) | |||
| Mean (SD) | 29 (8) | 30 (6) | 29 (8) |
| Median (IQR) | 28 (23, 33) | 30 (26, 34) | 28 (24, 33) |
| Range | 10, 98 | 17, 57 | 10, 98 |
| Unknown | 161 | 40 | 201 |
| Smoking status | |||
| Formerly Smoked | 815 (17%) | 70 (28%) | 885 (17%) |
| Never Smoked | 1,802 (37%) | 90 (36%) | 1,892 (37%) |
| Smokes | 747 (15%) | 42 (17%) | 789 (15%) |
| Unknown | 1,497 (31%) | 47 (19%) | 1,544 (30%) |
| Age Group | |||
| <= 50 | 2,960 (61%) | 23 (9.2%) | 2,983 (58%) |
| > 50 | 1,901 (39%) | 226 (91%) | 2,127 (42%) |
| BMI class | |||
| Underweight | 336 (7.1%) | 1 (0.5%) | 337 (6.9%) |
| Normal | 1,208 (26%) | 35 (17%) | 1,243 (25%) |
| Overweight | 1,334 (28%) | 75 (36%) | 1,409 (29%) |
| Class I obese | 944 (20%) | 56 (27%) | 1,000 (20%) |
| Class II-III obese | 878 (19%) | 42 (20%) | 920 (19%) |
| Unknown | 161 | 40 | 201 |
| Diabetes Status | |||
| Normal | 2,960 (61%) | 111 (45%) | 3,071 (60%) |
| Pre-diabetic | 1,001 (21%) | 38 (15%) | 1,039 (20%) |
| Diabetic | 900 (19%) | 100 (40%) | 1,000 (20%) |
Having looked at the data, it suggest that there are several variables that may require further cleaning (denoted by red). For example:
We have only 1 person who classified their gender as “Other”
The age range is wide (1-82 years old)
There is only 22 people who responded work type as “Never Worked”
There is 201 missing values in BMI and 30% of folks reported “unknown” as their smoking status
One person who had stroke was underweight (it’s likely one of the infants in the dataset)
It is critical to recognize these and really narrow down our research question to have more powerful models. In this particular case I am going narrow my question down to assessing the risk of stroke in the following population:
Adults age 18 and older
BMI of normal range and higher
Those who are biologic male and female
Such changes would yield the following table:
df_main<-
df_clean %>%
filter(
age>=18,
!bmi_class %in% "Underweight",
!gender %in% "Other",
)
tb2<-
df_main %>%
select(!c(id, age_group_decade)) %>%
tbl_summary(by = stroke,
type = list(all_continuous() ~ "continuous2",
c(hypertension, heart_disease, ever_married) ~ "categorical"),
statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({p25}, {p75})", "{min}, {max}"),
label = list(age ~ "Age (years)",
gender ~ "Gender",
bmi ~ "BMI (kg/m2)",
avg_glucose_level ~ "Glucose level (mg/dL)",
hypertension ~ "Hypertension",
heart_disease ~ "Heart disease",
ever_married ~ "Ever married",
work_type ~ "Work type",
smoking_status ~ "Smoking status",
residence_type ~ "Residence type",
bmi_class ~ "BMI class",
age_group_50 ~ "Age Group",
diabetes_status ~ "Diabetes Status"
)
)%>%
bold_labels() %>%
add_overall(last = T) %>%
modify_footnote(everything() ~ NA) %>%
modify_caption("Table of patient's Characteristics") %>%
modify_header(all_stat_cols() ~ "**{level}**\n N = {n} \n({style_percent(p)}%)") %>%
as_kable_extra(
booktabs = TRUE,
longtable = TRUE) %>%
add_header_above(c(" "= 1,"Stroke History" = 2, " "= 1), bold = T) %>%
kable_classic(html_font = "Cambria") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(c(4, 8, 21,35,40, 45), bold = T, color = "white", background = "#D7261E") %>%
scroll_box(height = "600px", width = "1000px")
tb2
| Characteristic |
No N = 3969 (94%) |
Yes N = 246 (5.8%) |
Overall N = 4215 (100%) |
|---|---|---|---|
| Gender | |||
| Female | 2,406 (61%) | 138 (56%) | 2,544 (60%) |
| Male | 1,563 (39%) | 108 (44%) | 1,671 (40%) |
| Other | 0 (0%) | 0 (0%) | 0 (0%) |
| Age (years) | |||
| Mean (SD) | 49 (18) | 68 (12) | 50 (18) |
| Median (IQR) | 49 (35, 62) | 71 (59, 78) | 50 (36, 64) |
| Range | 18, 82 | 32, 82 | 18, 82 |
| Hypertension | |||
| No | 3,540 (89%) | 180 (73%) | 3,720 (88%) |
| Yes | 429 (11%) | 66 (27%) | 495 (12%) |
| Heart disease | |||
| No | 3,741 (94%) | 199 (81%) | 3,940 (93%) |
| Yes | 228 (5.7%) | 47 (19%) | 275 (6.5%) |
| Ever married | |||
| No | 861 (22%) | 27 (11%) | 888 (21%) |
| Yes | 3,108 (78%) | 219 (89%) | 3,327 (79%) |
| Work type | |||
| Children | 0 (0%) | 0 (0%) | 0 (0%) |
| Government Job | 615 (15%) | 33 (13%) | 648 (15%) |
| Never Worked | 5 (0.1%) | 0 (0%) | 5 (0.1%) |
| Private | 2,613 (66%) | 149 (61%) | 2,762 (66%) |
| Self-employed | 736 (19%) | 64 (26%) | 800 (19%) |
| Residence type | |||
| Rural | 1,953 (49%) | 112 (46%) | 2,065 (49%) |
| Urban | 2,016 (51%) | 134 (54%) | 2,150 (51%) |
| Glucose level (mg/dL) | |||
| Mean (SD) | 107 (46) | 133 (62) | 109 (48) |
| Median (IQR) | 92 (77, 114) | 106 (80, 197) | 93 (77, 116) |
| Range | 55, 268 | 56, 272 | 55, 272 |
| BMI (kg/m2) | |||
| Mean (SD) | 31 (7) | 31 (6) | 31 (7) |
| Median (IQR) | 29 (26, 34) | 30 (26, 34) | 29 (26, 34) |
| Range | 18, 92 | 19, 57 | 18, 92 |
| Unknown | 142 | 39 | 181 |
| Smoking status | |||
| Formerly Smoked | 781 (20%) | 70 (28%) | 851 (20%) |
| Never Smoked | 1,644 (41%) | 89 (36%) | 1,733 (41%) |
| Smokes | 734 (18%) | 42 (17%) | 776 (18%) |
| Unknown | 810 (20%) | 45 (18%) | 855 (20%) |
| Age Group | |||
| <= 50 | 2,088 (53%) | 21 (8.5%) | 2,109 (50%) |
| > 50 | 1,881 (47%) | 225 (91%) | 2,106 (50%) |
| BMI class | |||
| Underweight | 0 (0%) | 0 (0%) | 0 (0%) |
| Normal | 829 (22%) | 35 (17%) | 864 (21%) |
| Overweight | 1,244 (33%) | 75 (36%) | 1,319 (33%) |
| Class I obese | 904 (24%) | 55 (27%) | 959 (24%) |
| Class II-III obese | 850 (22%) | 42 (20%) | 892 (22%) |
| Unknown | 142 | 39 | 181 |
| Diabetes Status | |||
| Normal | 2,384 (60%) | 108 (44%) | 2,492 (59%) |
| Pre-diabetic | 784 (20%) | 38 (15%) | 822 (20%) |
| Diabetic | 801 (20%) | 100 (41%) | 901 (21%) |
After removing all variables we are in a good position except that there is only 5 people in the “Never Worked” category. This probably will not be powered enough for the model. Therefore I will remove the category entirely for better accuracy.
df_main_wo_everworked<-
df_main %>%
filter(!work_type %in% "Never Worked")
tb3<-
df_main_wo_everworked %>%
select(!c(id, age_group_decade)) %>%
tbl_summary(by = stroke,
type = list(all_continuous() ~ "continuous2",
c(hypertension, heart_disease, ever_married) ~ "categorical"),
statistic = all_continuous() ~ c("{mean} ({sd})","{median} ({p25}, {p75})", "{min}, {max}"),
label = list(age ~ "Age (years)",
gender ~ "Gender",
bmi ~ "BMI (kg/m2)",
avg_glucose_level ~ "Glucose level (mg/dL)",
hypertension ~ "Hypertension",
heart_disease ~ "Heart disease",
ever_married ~ "Ever married",
work_type ~ "Work type",
smoking_status ~ "Smoking status",
residence_type ~ "Residence type",
bmi_class ~ "BMI class",
age_group_50 ~ "Age Group",
diabetes_status ~ "Diabetes Status"
)
)%>%
bold_labels() %>%
add_overall(last = T) %>%
modify_footnote(everything() ~ NA) %>%
modify_caption("Table of patient's Characteristics") %>%
modify_header(all_stat_cols() ~ "**{level}**\n N = {n} \n({style_percent(p)}%)") %>%
as_kable_extra(
booktabs = TRUE,
longtable = TRUE) %>%
add_header_above(c(" "= 1,"Stroke History" = 2, " "= 1), bold = T) %>%
kable_classic(html_font = "Cambria") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(c(4, 8, 21,35,40, 45), bold = T, color = "white", background = "#D7261E") %>%
scroll_box(height = "600px", width = "1000px")
tb3
| Characteristic |
No N = 3964 (94%) |
Yes N = 246 (5.8%) |
Overall N = 4210 (100%) |
|---|---|---|---|
| Gender | |||
| Female | 2,402 (61%) | 138 (56%) | 2,540 (60%) |
| Male | 1,562 (39%) | 108 (44%) | 1,670 (40%) |
| Other | 0 (0%) | 0 (0%) | 0 (0%) |
| Age (years) | |||
| Mean (SD) | 49 (17) | 68 (12) | 50 (18) |
| Median (IQR) | 49 (35, 62) | 71 (59, 78) | 51 (36, 64) |
| Range | 18, 82 | 32, 82 | 18, 82 |
| Hypertension | |||
| No | 3,535 (89%) | 180 (73%) | 3,715 (88%) |
| Yes | 429 (11%) | 66 (27%) | 495 (12%) |
| Heart disease | |||
| No | 3,736 (94%) | 199 (81%) | 3,935 (93%) |
| Yes | 228 (5.8%) | 47 (19%) | 275 (6.5%) |
| Ever married | |||
| No | 856 (22%) | 27 (11%) | 883 (21%) |
| Yes | 3,108 (78%) | 219 (89%) | 3,327 (79%) |
| Work type | |||
| Children | 0 (0%) | 0 (0%) | 0 (0%) |
| Government Job | 615 (16%) | 33 (13%) | 648 (15%) |
| Never Worked | 0 (0%) | 0 (0%) | 0 (0%) |
| Private | 2,613 (66%) | 149 (61%) | 2,762 (66%) |
| Self-employed | 736 (19%) | 64 (26%) | 800 (19%) |
| Residence type | |||
| Rural | 1,953 (49%) | 112 (46%) | 2,065 (49%) |
| Urban | 2,011 (51%) | 134 (54%) | 2,145 (51%) |
| Glucose level (mg/dL) | |||
| Mean (SD) | 107 (46) | 133 (62) | 109 (48) |
| Median (IQR) | 92 (77, 114) | 106 (80, 197) | 93 (77, 116) |
| Range | 55, 268 | 56, 272 | 55, 272 |
| BMI (kg/m2) | |||
| Mean (SD) | 31 (7) | 31 (6) | 31 (7) |
| Median (IQR) | 29 (26, 34) | 30 (26, 34) | 29 (26, 34) |
| Range | 18, 92 | 19, 57 | 18, 92 |
| Unknown | 142 | 39 | 181 |
| Smoking status | |||
| Formerly Smoked | 781 (20%) | 70 (28%) | 851 (20%) |
| Never Smoked | 1,641 (41%) | 89 (36%) | 1,730 (41%) |
| Smokes | 734 (19%) | 42 (17%) | 776 (18%) |
| Unknown | 808 (20%) | 45 (18%) | 853 (20%) |
| Age Group | |||
| <= 50 | 2,083 (53%) | 21 (8.5%) | 2,104 (50%) |
| > 50 | 1,881 (47%) | 225 (91%) | 2,106 (50%) |
| BMI class | |||
| Underweight | 0 (0%) | 0 (0%) | 0 (0%) |
| Normal | 825 (22%) | 35 (17%) | 860 (21%) |
| Overweight | 1,243 (33%) | 75 (36%) | 1,318 (33%) |
| Class I obese | 904 (24%) | 55 (27%) | 959 (24%) |
| Class II-III obese | 850 (22%) | 42 (20%) | 892 (22%) |
| Unknown | 142 | 39 | 181 |
| Diabetes Status | |||
| Normal | 2,380 (60%) | 108 (44%) | 2,488 (59%) |
| Pre-diabetic | 784 (20%) | 38 (15%) | 822 (20%) |
| Diabetic | 800 (20%) | 100 (41%) | 900 (21%) |
In the current dataset, I have found that 201 patients did not have a bmi calculated. There are indeed many ways in which one could impute the missing variables. In this dataset, I decided to keep it simple and calculate the mean and median of the rest of population and to use the mean bmi as the imputed value.
attach(df_main_wo_everworked)
# mean and bmi are so similar that we would used the mean in our data for imputation
bmi_mean_median<-
df_main_wo_everworked %>%
filter(!is.na(bmi)) %>%
summarise(mean = mean(bmi), median = median(bmi)) %>%
round(2)
bmi_mean_median %>%
kable(align = "c", caption = "Mean and Median BMI of the non-missing values") %>%
kable_classic(html_font = "Cambria") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| mean | median |
|---|---|
| 30.57 | 29.3 |
df_main_bmi_mean<-
df_main_wo_everworked %>%
mutate(bmi =
case_when(
bmi = is.na(bmi) ~ bmi_mean_median$mean,
TRUE ~ as.double(bmi)
))
It appears that in those male and females who have had stroke, the bulk of the data trending toward the older ages in both males and females.
attach(df_main_bmi_mean)
df_main_bmi_mean %>%
ggplot(aes(x = gender , y = age, fill = gender))+
geom_violin(trim = FALSE, show.legend = F, color = "black") +
geom_boxplot(width = 0.07, show.legend = F) +
scale_y_continuous(breaks = seq(0,100,10))+
scale_fill_brewer()+
facet_wrap(~stroke) +
theme_bw(base_size = 12)+
labs(title = "Relationship between history of stroke, age and gender",
x = "Gender", y = "Age (years)")+
theme(panel.grid = element_blank(),
axis.text = element_text(colour = "black"),
strip.background = element_rect(fill = "white"),
strip.background.x = element_rect(size = 1)
)
Looking at the frequencies of the stroke it appears that the prevalence of stroke is higher in those with diabetes and heart disease. Also in those without diabetes the prevalence of stroke seem to be higher if they had hypertension.
attach(df_main_bmi_mean)
round(prop.table(table(stroke, heart_disease, hypertension, diabetes_status), c(4,3,2)),3) %>%
as.data.frame() %>%
rename("Stroke" = stroke,
"Hypertension" = hypertension) %>%
mutate(Freq = Freq*100) %>%
ggplot(aes(x = heart_disease, y = Freq, fill = Stroke)) +
geom_col(position = position_dodge(), color = "black", size = .75) +
geom_label(aes(label = Freq), show.legend = F, size = 3, nudge_y = -2 )+
scale_fill_brewer()+
facet_grid(Hypertension ~ diabetes_status, labeller = labeller(Hypertension = label_both, diabetes_status = label_value)) +
labs(y = "Frequency (%)",
x = "Heart Disease",
title = "Relation between stroke, heart disease, hypertension and diabetes") +
theme_minimal(base_size = 12) +
theme(
axis.text = element_text(color = "black"),
strip.background = element_rect(colour = "black"),
strip.text = element_text(face = "bold"),
panel.grid.major.x = element_blank(),
plot.title = element_text(hjust = 0.5)
)
Logistic Regression has 6 assumption in which we will check for here:
1. Having a response variable which is binary
In this case our binary dependent variable is stroke yes or no.
2. The observations being independent
This means that the observations should not come from repeated measures of the same individuals. We have checked for this already as we have unique ids with no repeated measures.
3. No Multicollineratity among the independent variables
When two or more explanatory variables are highly correlated to each other, it could create problems. When we think about regression coefficients, we often interpret them as while keeping all other coefficients constant. So when two or more highly correlated variables are present, when you change one, it is very likely that that the other would change as well. To find that one way is to find variance inflation factor VIF. Generally we would like to see a VIF of < 5 which suggest there is maybe a correlation but is not severe enough to cause problems.
attach(df_main_bmi_mean)
model_log<- glm(formula = stroke ~ gender + age + bmi + ever_married + work_type + residence_type +
smoking_status + avg_glucose_level + hypertension + heart_disease,
data = df_main_bmi_mean , family = "binomial")
car::vif(model_log) %>%
as.data.frame() %>%
rownames_to_column(var = "variable") %>%
clean_names() %>%
ggplot(aes(x = variable, y= gvif)) +
geom_col(color = "black", fill = "#325F8C", width = 0.75, linewidth= .75) +
coord_flip() +
labs(title = "Variance infaltion factor",
x = "Variable",
y = "Variance inflaction factor") +
theme_bw(base_size = 12) +
theme(
axis.text = element_text(color = "black"),
plot.title = element_text(hjust = 0.5)
)
4. No extreme outliers
Outlier also known as influential observation can impact the model.
df_main_bmi_mean %>%
ggplot(aes(x = age)) +
geom_histogram(aes(y = ..density..), fill = "white", color = "black", size = 1, binwidth = 5) +
geom_density(size = 1) +
theme_bw(base_size = 12) +
labs(x = "Age (years)",
y = "Density",
title = "Histogram of distribution of age") +
scale_x_continuous(breaks = seq(15,100,5)) +
theme(
axis.text = element_text(color = "black"),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5)
)
df_main_bmi_mean %>%
ggplot(aes(x = avg_glucose_level)) +
geom_histogram(aes(y = ..density..), fill = "white", color = "black", size = 1, binwidth =10) +
geom_density(size = 1) +
theme_bw(base_size = 12) +
labs(x = "Average glucose level (mg/dL)",
y = "Density",
title = "Histogram of distribution of glucose levels"
) +
scale_x_continuous(breaks = seq(55,300,10)) +
theme(
axis.text = element_text(color = "black"),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5)
)
df_main_bmi_mean %>%
ggplot(aes(x = bmi)) +
geom_histogram(aes(y = ..density..), fill = "white", color = "black", size = 1, binwidth = 4) +
geom_density(size = 1) +
theme_minimal() +
theme_bw(base_size = 12) +
labs(x = "BMI (kg/m2)",
y = "Density",
title = "Histogram of distribution of BMI") +
scale_x_continuous(breaks = seq(18,100,4)) +
theme(
axis.text = element_text(color = "black"),
panel.grid = element_blank() ,
plot.title = element_text(hjust = 0.5)
)
5. There is linear relationship between continuous explanatory variables and the logit of dependent variable
The relationship appear to be linear.
attach(df_main_bmi_mean)
model_logit <- glm(stroke ~ age + bmi + avg_glucose_level, family = "binomial", data = df_main_bmi_mean)
prob_logit <- predict(model_logit, type = "response")
logit<- log(prob_logit/(1-prob_logit))
df_main_bmi_mean %>%
ggplot(aes(age, logit)) +
geom_point() +
geom_smooth() +
labs(x = "Age (years)",
y = "Logit",
title = "Logit of stroke vs age") +
theme_minimal(base_size = 12) +
theme(
axis.text = element_text(color = "black")
)
df_main_bmi_mean %>%
ggplot(aes(bmi, logit)) +
geom_point() +
geom_smooth() +
labs(x = "BMI (kg/m2)",
y = "Logit",
title = "Logit of stroke vs BMI") +
theme_minimal(base_size = 12) +
theme(
axis.text = element_text(color = "black")
)
df_main_bmi_mean %>%
ggplot(aes(avg_glucose_level, logit)) +
geom_point() +
geom_smooth() +
labs(x = "Average glucose leve (mg/dL)",
y = "Logit",
title = "Logit of stroke vs average glucose level") +
theme_minimal(base_size = 12) +
theme(
axis.text = element_text(color = "black")
)
attach(df_main_bmi_mean)
df_main_bmi_mean$bmi_class <- factor(df_main_bmi_mean$bmi_class,levels=c("Normal","Underweight",
"Overweight",
"Class I obese",
"Class II-III obese"))
df_main_bmi_mean$smoking_status <- factor(df_main_bmi_mean$smoking_status, levels = c("Never Smoked", "Formerly Smoked", "Smokes", "Unknown"))
attach(df_main_bmi_mean)
model_log_summary<- summary(model_log)
model_log_table<-
model_log_summary$coefficients %>%
as.data.frame() %>%
bind_cols(confint(model_log)) %>%
rownames_to_column(var = "variable") %>%
clean_names() %>%
select(variable, estimate, pr_z, 6,7) %>%
filter(!variable %in% "(Intercept)") %>%
mutate(across(.fns = exp, c(estimate, x2_5_percent, x97_5_percent)),
labels = c("Male", "Age", "BMI", "Married", "Work type private", "Work type self-emplyed", "Residence type urban",
"Smoking status formerly smoked", "Smoking status urrently smokes", "Smoking status unknown", "Glucose level", "Has hypertension",
"Has heart disease"
)
) %>%
adorn_rounding(digits = 2)
model_log_table %>%
ggplot(aes(y = labels)) +
geom_point(aes(x= estimate) , pch =22 , fill = "#5182AF", size = 3) +
geom_vline(xintercept = 1) +
geom_errorbarh(aes(xmin = x2_5_percent, xmax = x97_5_percent), width=0.4, alpha=0.9, size=.8) +
theme_minimal(base_size = 12) +
labs(x = "Expo (OR)",
y = "Variable")+
theme(
axis.text = element_text(color = "black")
)
model_log_table %>%
mutate(ci = paste0("(",x2_5_percent, "-", x97_5_percent, ")")) %>%
select(labels, estimate, pr_z, ci) %>%
rename("Expo(OR)" = estimate, "p-value" = pr_z, "95% CI"= ci ) %>%
kable(align = "c", caption = "Logistic regression exponentiated odds ratios predicting stroke") %>%
kable_classic(html_font = "Cambria") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| labels | Expo(OR) | p-value | 95% CI |
|---|---|---|---|
| Male | 1.03 | 0.83 | (0.78-1.36) |
| Age | 1.08 | 0.00 | (1.07-1.09) |
| BMI | 1.00 | 0.80 | (0.98-1.03) |
| Married | 0.82 | 0.37 | (0.53-1.3) |
| Work type private | 1.17 | 0.45 | (0.79-1.77) |
| Work type self-emplyed | 0.79 | 0.30 | (0.5-1.25) |
| Residence type urban | 1.10 | 0.51 | (0.83-1.44) |
| Smoking status formerly smoked | 0.81 | 0.23 | (0.57-1.15) |
| Smoking status urrently smokes | 1.12 | 0.60 | (0.73-1.7) |
| Smoking status unknown | 0.93 | 0.72 | (0.61-1.39) |
| Glucose level | 1.00 | 0.00 | (1-1.01) |
| Has hypertension | 1.50 | 0.01 | (1.08-2.06) |
| Has heart disease | 1.31 | 0.16 | (0.89-1.89) |
Overall age appear to have highest AUC and better predictor.
attach(df_main_bmi_mean)
roc_age<-
roc(stroke, as.numeric(age))
roc_gluc<-
roc(stroke, avg_glucose_level)
roc_htn<-
roc(stroke, as.numeric(hypertension))
roc_age %>%
ggroc(size = 1, legacy.axes = TRUE) +
theme_bw(base_size = 12) +
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1),
color="darkgrey", linetype="dashed") +
labs(title = "ROC curve for age") +
annotate("text", x = .5, y = .3, label = paste0("AUC:", " ", round(roc_age$auc,3)), size =6) +
theme(
axis.text = element_text(color = "black"),
plot.title =element_text(hjust = 0.5)
)
roc_gluc %>%
ggroc(size = 1, legacy.axes = TRUE) +
theme_bw(base_size = 12) +
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1),
color="darkgrey", linetype="dashed") +
labs(title = "ROC curve for glucose levels") +
annotate("text", x = .5, y = .3, label = paste0("AUC:", " ", round(roc_gluc$auc,3)), size =6) +
theme(
axis.text = element_text(color = "black"),
plot.title =element_text(hjust = 0.5)
)
roc_htn %>%
ggroc(size = 1, legacy.axes = TRUE) +
theme_bw(base_size = 12) +
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1),
color="darkgrey", linetype="dashed") +
annotate("text", x = .5, y = .3, label = paste0("AUC:", " ", round(roc_htn$auc,3)), size =6) +
labs(title = "ROC curve for hypertension") +
theme(
axis.text = element_text(color = "black"),
plot.title =element_text(hjust = 0.5)
)
The current data suggests that age, history of hypertension and average glucose levels are the best predictors of strokes in adults. Particularly age appears to be the better predictor of stroke with AUC 0.806.
The clinical interpretation of such results would depend on the setting in which data was collected and subsequently would be applied.
Social Factors vs Stroke
The distribution of social factors of those who experienced stroke are presented here. It appears that in those who have experienced stroke the majority of the population were working in a private setting, living urban area, married, and never smoked.