Understanding Stroke Risk Factors

Author

Lauren Bolden & Ciara Walker

QR Code

#install.packages('qrcode')
library(qrcode)

qr <- qr_code("https://rpubs.com/bolden/1429157")

plot(qr)

Introduction

Background: What is stroke? A stroke is a medical condition where part of the brain is damaged because blood supply has been disrupted.

Why? Stroke is a leading cause of death and long term disability. Understanding key risk factors can help identify high risk individuals early.

This project will use a healthcare data to analyze demographics and health related factors to identify patterns associated with stroke risk.

Data

Dataset: Stroke Prediction Data

The dataset included:

  • 5,110 observations

  • 12 variables related to patient health and demographics.

Key Variables:

  • Age

  • Average glucose level

  • BMI

  • Hypertension

  • Heart Disease

  • Smoking Status

  • Stroke (Target variable)

Research Question: What demographic and health- related factors are more associated with the likelihood of experiencing a stroke?

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(ggplot2)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(cluster)
library(dplyr)
library(DT)
library(patchwork)
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
df <- read_csv("healthcare-dataset-stroke-data.csv")
Rows: 5110 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): gender, ever_married, work_type, Residence_type, bmi, smoking_status
dbl (6): id, age, hypertension, heart_disease, avg_glucose_level, stroke

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(df)
Rows: 5,110
Columns: 12
$ id                <dbl> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434…
$ gender            <chr> "Male", "Female", "Male", "Female", "Female", "Male"…
$ age               <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, …
$ hypertension      <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
$ heart_disease     <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0…
$ ever_married      <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
$ work_type         <chr> "Private", "Self-employed", "Private", "Private", "S…
$ Residence_type    <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"…
$ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0…
$ bmi               <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "…
$ smoking_status    <chr> "formerly smoked", "never smoked", "never smoked", "…
$ stroke            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Data Wrangling

Cleaned the dataset and prepared for analysis using the following:

  • Removed missing BMI values using ‘filter()’

  • Converted categorical variables into factors using ‘mutate()’

  • Created new variables (e.g. stroke_label, age_group) using ‘case_when()’

  • Aggregated data using ‘count()’ and ‘summarize()’

These steps allowed for the dataset to be structured properly for analysis and visualization.

# DATA WRANGLING / CLEANING

# Remove missing BMI
df <- df %>%
  mutate(bmi = as.numeric(bmi)) %>%
  filter(!is.na(bmi))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `bmi = as.numeric(bmi)`.
Caused by warning:
! NAs introduced by coercion
# Convert categorical variables
df <- df %>%
  mutate(
    stroke = as.factor(stroke),
    hypertension = as.factor(hypertension),
    heart_disease = as.factor(heart_disease),
    gender = as.factor(gender),
    ever_married = as.factor(ever_married),
    work_type = as.factor(work_type),
    Residence_type = as.factor(Residence_type),
    smoking_status = as.factor(smoking_status)
  )

############################################################
# LABEL VARIABLES 
############################################################

df <- df %>%
  mutate(
    stroke_num = as.numeric(as.character(stroke)),
    
    stroke_label = case_when(
      stroke == "0" ~ "No Stroke",
      stroke == "1" ~ "Stroke"
    ),
    
    hypertension_label = case_when(
      hypertension == "0" ~ "No Hypertension",
      hypertension == "1" ~ "Hypertension"
    ),
    
    heart_label = case_when(
      heart_disease == "0" ~ "No Heart Disease",
      heart_disease == "1" ~ "Heart Disease"
    ),
    
    age_group = case_when(
      age < 30 ~ "Under 30",
      age >= 30 & age < 45 ~ "30-44",
      age >= 45 & age < 60 ~ "45-59",
      age >= 60 ~ "60+"
    ),
    
    glucose_group = case_when(
      avg_glucose_level < 100 ~ "Normal",
      avg_glucose_level >= 100 & avg_glucose_level < 126 ~ "Moderate",
      avg_glucose_level >= 126 ~ "High"
    ),
    
    bmi_group = case_when(
      bmi < 18.5 ~ "Underweight",
      bmi >= 18.5 & bmi < 25 ~ "Normal",
      bmi >= 25 & bmi < 30 ~ "Overweight",
      bmi >= 30 ~ "Obese"
    )
  )

Summary Statistics

Dataset is highly imbalanced:

  • 95.7% of individuals did not experience a stroke

  • Only 4.26% experienced a stroke

This indicates that stroke is a relatively rare event in the dataset.

Interactive Stroke Dataset Table

This interactive table allows users to filter and explore key variables in the stroke dataset, helping identify patterns across demographic and health factors.

  • This table showcases:

    • Gender

    • Age

    • Hypertension

    • Heart disease

    • Glucose levels

  • You can filter and explore patterns manually. Which helps identify:

    • Older individuals tend to appear more frequently with risk factors.

    • Higher glucose values appear in some older individuals

  • Not for conclusions directly, but for data exploration.

############################################################
# SUMMARY TABLE - Interactive 
############################################################
library(DT)

datatable(
  df[, c("gender", "age", "hypertension", "heart_disease",
         "avg_glucose_level", "bmi", "smoking_status", "stroke")],
  filter = "top",
  options = list(pageLength = 15, scrollX = TRUE),
  caption = "Interactive Stroke Dataset Table"
)

Summary Statistics by Stroke Status

This interactive table compares key health indicators between individuals who did and did not experience a stroke. It summarizes average age, average glucose level, average BMI, and total count by stroke group, helping identify differences between the two groups.

  • Stroke = 0 (No stroke)

    • Average age = 41.76 (42 years old)

    • Average glucose = 104

    • Average BMI = 28.82

  • Stroke = 1 (Stroke)

    • Average age = 67.71 (68 years old)

    • Average glucose = 134.57

    • Average BMI = 30.47

  • Overall the stroke group is much older and has higher glucose levels and slightly higher BMI. There’s strong evidence that age and glucose are key factors, which directly supports our research question.

############################################################
# SUMMARY TABLE - Interactive 
############################################################
stroke_summary <- df |>
  group_by(stroke) |>
  summarise(
    count = n(),
    avg_age = mean(age, na.rm = TRUE),
    avg_glucose = mean(avg_glucose_level, na.rm = TRUE),
    avg_bmi = mean(bmi, na.rm = TRUE)
  )

datatable(
  stroke_summary,
  options = list(pageLength = 5),
  caption = "Summary Statistics by Stroke Status"
) |>
  formatRound(c("avg_age", "avg_glucose", "avg_bmi"), digits = 2)

Mean Differences

Individuals who experienced stroke:

  • Are significantly older on average (67.7yrs vs 41.8yrs)

  • Have higher average glucose levels (135 vs 104)

  • Have slightly higher BMI (30.5 vs 28.8).

This suggest that age and glucose levels are strong indicators of stroke risk.

Analysis

Overall, categorical results show that medical conditions such as hypertension and heart disease have the strongest association with stroke, while other variables like gender, work type and residence type show minimal impact. Supporting earlier findings, medical conditions are more important predictors than demographics factors.

library(knitr)
#Hypertension Table
hypertension_table <- df %>%
  group_by(hypertension_label, stroke_label) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(hypertension_label) %>%
  mutate(Proportion = round(Count / sum(Count), 3))


kable(
  hypertension_table,
  caption = "Stroke Counts and Proportions by Hypertension Status"
)
Stroke Counts and Proportions by Hypertension Status
hypertension_label stroke_label Count Proportion
Hypertension No Stroke 391 0.867
Hypertension Stroke 60 0.133
No Hypertension No Stroke 4309 0.967
No Hypertension Stroke 149 0.033
  • Individuals with hypertension have a much higher stroke rate (13.3%) compared to those without hypertension (3.3%), indicating a strong relationship between hypertension and stroke.
#Heart Disease Table

heart_table <- df %>%
  group_by(heart_label, stroke_label) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(heart_label) %>%
  mutate(Proportion = round(Count / sum(Count), 3))


kable(
  heart_table,
  caption = "Stroke Counts and Proportions by Heart Disease Status"
)
Stroke Counts and Proportions by Heart Disease Status
heart_label stroke_label Count Proportion
Heart Disease No Stroke 203 0.835
Heart Disease Stroke 40 0.165
No Heart Disease No Stroke 4497 0.964
No Heart Disease Stroke 169 0.036
  • Stroke is more common among individuals with heart disease (16.5%) compared to those without heart disease (3.6%). Heart disease is one of the strongest risk factors in the dataset.
#GENDER Table
gender_table <- df %>%
  group_by(gender, stroke_label) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(gender) %>%
  mutate(Proportion = round(Count / sum(Count), 3))

kable(
  gender_table,
  caption = "Stroke Counts and Proportions by Gender"
)
Stroke Counts and Proportions by Gender
gender stroke_label Count Proportion
Female No Stroke 2777 0.959
Female Stroke 120 0.041
Male No Stroke 1922 0.956
Male Stroke 89 0.044
Other No Stroke 1 1.000
  • Stroke rates between males and females are very similar. This suggest that gender is not a strong predictor of stroke in the dataset.
#Smoking Status Table
smoking_table <- df %>%
  group_by(smoking_status, stroke_label) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(smoking_status) %>%
  mutate(Proportion = round(Count / sum(Count), 3))

kable(
  smoking_table,
  caption = "Stroke Counts and Proportions by Smoking Status"
)
Stroke Counts and Proportions by Smoking Status
smoking_status stroke_label Count Proportion
formerly smoked No Stroke 780 0.932
formerly smoked Stroke 57 0.068
never smoked No Stroke 1768 0.955
never smoked Stroke 84 0.045
smokes No Stroke 698 0.947
smokes Stroke 39 0.053
Unknown No Stroke 1454 0.980
Unknown Stroke 29 0.020
  • Former smokers has the highest stroke rate (6.8%), followed by current smokers (5.3%) and never smokers (4.5%). Suggesting that smoking history may be associated with increased stroke risk.
#Work Type Table
work_table <- df %>%
  group_by(work_type, stroke_label) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(work_type) %>%
  mutate(Proportion = round(Count / sum(Count), 3))

kable(
  work_table,
  caption = "Stroke Counts and Proportions by Work Type"
)
Stroke Counts and Proportions by Work Type
work_type stroke_label Count Proportion
children No Stroke 670 0.999
children Stroke 1 0.001
Govt_job No Stroke 602 0.956
Govt_job Stroke 28 0.044
Never_worked No Stroke 22 1.000
Private No Stroke 2684 0.955
Private Stroke 127 0.045
Self-employed No Stroke 722 0.932
Self-employed Stroke 53 0.068
  • Self employed individuals have the highest stroke rate (6.8%), while children and those who never worked have almost no cases. This reflects age differences, as younger individuals have lower stroke risk.
#Residence Type Table
residence_table <- df %>%
  group_by(Residence_type, stroke_label) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(Residence_type) %>%
  mutate(Proportion = round(Count / sum(Count), 3))

kable(
  residence_table,
  caption = "Stroke Counts and Proportions by Residence Type"
)
Stroke Counts and Proportions by Residence Type
Residence_type stroke_label Count Proportion
Rural No Stroke 2319 0.959
Rural Stroke 100 0.041
Urban No Stroke 2381 0.956
Urban Stroke 109 0.044
  • Stroke rates are very similar between rural (4.1%) and urban (4.4%) areas. This indicates that residence type does not have a strong relationship with stroke.

Visualizations

Visualizations provides insight into relationships between variables and stroke.

###Figure 1: Stroke Count Bar Chart 

stroke_count <- df %>%
  count(stroke_label) %>%
  mutate(percent = round(n / sum(n) * 100, 1))

ggplot(stroke_count, aes(x = stroke_label, y = n, fill = stroke_label)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste(n, "(", percent, "%)")), vjust = -0.5) +
  labs(
    title = "Stroke Cases in the Dataset",
    subtitle = "Most individuals in the dataset did not experience a stroke",
    x = "Stroke Status",
    y = "Count",
    fill = "Stroke Status"
  ) +
  theme_minimal()

  • This chart shows an highly imbalanced dataset, with most individuals not experiencing a stroke.
###Figure 2: Age Distribution Histogram 
ggplot(df, aes(x = age, fill = stroke_label)) +
  geom_histogram(binwidth = 5) +
  labs(
    title = "Age Distribution by Stroke Status",
    subtitle = "Stroke cases appear more common among older individuals",
    x = "Age",
    y = "Count",
    fill = "Stroke Status"
  ) +
  scale_x_continuous(breaks = seq(0, 90, by = 10)) +
  theme_minimal()

  • Age histogram shows that stroke cases are more concentrated among older individuals.
#Figure 3:Glucose Boxplot by Stroke Status
ggplot(df, aes(x = stroke_label, y = avg_glucose_level, fill = stroke_label)) +
  geom_boxplot() +
  labs(
    title = "Average Glucose Level by Stroke Status",
    subtitle = "Stroke cases tend to have higher glucose levels",
    x = "Stroke Status",
    y = "Average Glucose Level",
    fill = "Stroke Status"
  ) +
  theme_minimal()

Individuals with stroke tend to have higher glucose levels, indicating a strong relationship with stroke risk.

#Figure 4:BMI Boxplot by Stroke Status
ggplot(df, aes(x = stroke_label, y = bmi, fill = stroke_label)) +
  geom_boxplot() +
  labs(
    title = "BMI by Stroke Status",
    subtitle = "BMI shows a smaller difference compared with age and glucose",
    x = "Stroke Status",
    y = "BMI",
    fill = "Stroke Status"
  ) +
  theme_minimal()

  • Boxplot for BMI only show small differences between groups, suggesting its not a strong independent predictor.
#Figure 5: Age vs Glucose Scatterplot with Trendline 
ggplot(df, aes(x = age, y = avg_glucose_level, color = stroke_label)) +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(
    title = "Age vs Average Glucose Level",
    subtitle = "Scatterplot with trend line comparing age and glucose",
    x = "Age",
    y = "Average Glucose Level",
    color = "Stroke Status"
  ) +
  scale_x_continuous(breaks = seq(0, 90, by = 10)) +
  theme_minimal()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

  • Scatterplot shows stroke cases are more concentrated among older individuals with higher glucose levels.
#Figure 6:  Stroke Rate by Age Group
age_rate <- df %>%
  group_by(age_group) %>%
  summarize(stroke_rate = mean(stroke_num)) %>%
  mutate(percent = round(stroke_rate * 100, 1))

ggplot(age_rate, aes(x = age_group, y = stroke_rate, fill = age_group)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste(percent, "%")), vjust = -0.5) +
  labs(
    title = "Stroke Rate by Age Group",
    subtitle = "Stroke rate increases in older age groups",
    x = "Age Group",
    y = "Stroke Rate",
    fill = "Age Group"
  ) +
  theme_minimal()

  • Stroke rates increases significantly with age, especially in older groups.
#Figure 7: Stroke Rate by Hypertension Status
hyper_rate <- df %>%
  group_by(hypertension_label) %>%
  summarize(stroke_rate = mean(stroke_num)) %>%
  mutate(percent = round(stroke_rate * 100, 1))

ggplot(hyper_rate, aes(x = hypertension_label, y = stroke_rate, fill = hypertension_label)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste(percent, "%")), vjust = -0.5) +
  labs(
    title = "Stroke Rate by Hypertension Status",
    subtitle = "Hypertension is associated with a higher stroke rate",
    x = "Hypertension Status",
    y = "Stroke Rate",
    fill = "Hypertension Status"
  ) +
  theme_minimal()

  • Individuals with hypertension have a much higher stroke rate compared to those without.
#Figure 8: Stroke Rate by Heart Disease Status
heart_rate <- df %>%
  group_by(heart_label) %>%
  summarize(stroke_rate = mean(stroke_num)) %>%
  mutate(percent = round(stroke_rate * 100, 1))

ggplot(heart_rate, aes(x = heart_label, y = stroke_rate, fill = heart_label)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste(percent, "%")), vjust = -0.5) +
  labs(
    title = "Stroke Rate by Heart Disease Status",
    subtitle = "Heart disease is associated with a higher stroke rate",
    x = "Heart Disease Status",
    y = "Stroke Rate",
    fill = "Heart Disease Status"
  ) +
  theme_minimal()

  • Heart disease is strongly associated with stroke, demonstrating one of the highest differences in stroke rates.
#Figure 9: Stroke Rate by Smoking Status
smoking_rate <- df %>%
  group_by(smoking_status) %>%
  summarize(stroke_rate = mean(stroke_num)) %>%
  mutate(percent = round(stroke_rate * 100, 1))

ggplot(smoking_rate, aes(x = reorder(smoking_status, stroke_rate), y = stroke_rate, fill = smoking_status)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste(percent, "%")), hjust = -0.1) +
  coord_flip() +
  labs(
    title = "Stroke Rate by Smoking Status",
    subtitle = "Former smokers show the highest stroke rate",
    x = "Smoking Status",
    y = "Stroke Rate",
    fill = "Smoking Status"
  ) +
  theme_minimal()

  • Former Smokers have the highest stroke rates, suggesting smoking history may increase risk.
#Figure 10: Stroke Rate by Glucose Group
glucose_rate <- df %>%
  group_by(glucose_group) %>%
  summarize(stroke_rate = mean(stroke_num)) %>%
  mutate(percent = round(stroke_rate * 100, 1))

ggplot(glucose_rate, aes(x = glucose_group, y = stroke_rate, fill = glucose_group)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste(percent, "%")), vjust = -0.5) +
  labs(
    title = "Stroke Rate by Glucose Group",
    subtitle = "Higher glucose groups show higher stroke rates",
    x = "Glucose Group",
    y = "Stroke Rate",
    fill = "Glucose Group"
  ) +
  theme_minimal()

  • Higher glucose groups show higher stroke rates; this shows its importance as a risk factor.
#Figure 11: Stroke Rate by BMI Group
bmi_rate <- df %>%
  group_by(bmi_group) %>%
  summarize(stroke_rate = mean(stroke_num)) %>%
  mutate(percent = round(stroke_rate * 100, 1))

ggplot(bmi_rate, aes(x = bmi_group, y = stroke_rate, fill = bmi_group)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste(percent, "%")), vjust = -0.5) +
  labs(
    title = "Stroke Rate by BMI Group",
    subtitle = "BMI groups show smaller differences than age and glucose",
    x = "BMI Group",
    y = "Stroke Rate",
    fill = "BMI Group"
  ) +
  theme_minimal()

  • Small differences are shown between BMI groups, indicating a weaker influence compared to age and glucose.
#Figure 12: Stroke Rate by Work Type
work_rate <- df %>%
  group_by(work_type) %>%
  summarize(stroke_rate = mean(stroke_num)) %>%
  mutate(percent = round(stroke_rate * 100, 1))

ggplot(work_rate, aes(x = reorder(work_type, stroke_rate), y = stroke_rate, fill = work_type)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste(percent, "%")), hjust = -0.1) +
  coord_flip() +
  labs(
    title = "Stroke Rate by Work Type",
    subtitle = "Self-employed individuals show the highest stroke rate",
    x = "Work Type",
    y = "Stroke Rate",
    fill = "Work Type"
  ) +
  theme_minimal()

  • Self employed individuals show higher stroke rates, reflecting age differences.
#Figure 13:Residence Type and Stroke Proportion
ggplot(df, aes(x = Residence_type, fill = stroke_label)) +
  geom_bar(position = "fill") +
  labs(
    title = "Stroke Proportion by Residence Type",
    subtitle = "Urban and rural groups show similar stroke proportions",
    x = "Residence Type",
    y = "Proportion",
    fill = "Stroke Status"
  ) +
  theme_minimal()

  • There is little difference between urban and rural populations, suggesting location is not a major factor.
#Figure 14: Mean age difference by Stroke 
age_means <- df %>%
  group_by(stroke_label) %>%
  summarize(mean_age = mean(age))

ggplot(age_means, aes(x = stroke_label, y = mean_age, color = stroke_label)) +
  geom_point(size = 4) +
  geom_segment(
    aes(
      x = 1,
      xend = 2,
      y = age_means$mean_age[age_means$stroke_label == "No Stroke"],
      yend = age_means$mean_age[age_means$stroke_label == "Stroke"]
    )
  ) +
  geom_text(aes(label = round(mean_age, 1)), vjust = -1) +
  labs(
    title = "Mean Age Difference by Stroke Status",
    subtitle = "The average age is higher in the stroke group",
    x = "Stroke Status",
    y = "Mean Age",
    color = "Stroke Status"
  ) +
  theme_minimal()
Warning: Use of `age_means$mean_age` is discouraged.
ℹ Use `mean_age` instead.
Warning: Use of `age_means$stroke_label` is discouraged.
ℹ Use `stroke_label` instead.
Warning: Use of `age_means$mean_age` is discouraged.
ℹ Use `mean_age` instead.
Warning: Use of `age_means$stroke_label` is discouraged.
ℹ Use `stroke_label` instead.

  • Average age is significantly higher for individuals who experienced a stroke.
#Figure 15: Combined Comparison Plot 

p1 <- ggplot(df, aes(x = stroke_label, y = age, fill = stroke_label)) +
  geom_boxplot() +
  labs(title = "Age", x = "", y = "Age") +
  theme_minimal()

p2 <- ggplot(df, aes(x = stroke_label, y = avg_glucose_level, fill = stroke_label)) +
  geom_boxplot() +
  labs(title = "Glucose", x = "", y = "Average Glucose") +
  theme_minimal()

p3 <- ggplot(df, aes(x = stroke_label, y = bmi, fill = stroke_label)) +
  geom_boxplot() +
  labs(title = "BMI", x = "", y = "BMI") +
  theme_minimal()

(p1 + p2 + p3)

  • This combined plot compares age, glucose, and BMI; demonstrating clear differences between stroke groups compared to BMI.

Overall, the plots consistently show that age, glucose level, hypertension, heart disease are the most important variables associated with stroke.

Modeling

Multiple Linear Regression Model

Regression model was used to estimate how age, glucose, BMI, hypertension and heart disease are associated with the likelihood of stroke.

# Convert stroke to numeric for regression
df <- df %>%
  mutate(stroke_num = as.numeric(as.character(stroke)))

# Linear model
model <- lm(stroke_num ~ age + avg_glucose_level + bmi + hypertension + heart_disease, data = df)

summary(model)

Call:
lm(formula = stroke_num ~ age + avg_glucose_level + bmi + hypertension + 
    heart_disease, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.26448 -0.06629 -0.02773  0.00551  1.03461 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -3.869e-02  1.187e-02  -3.261 0.001119 ** 
age                1.720e-03  1.396e-04  12.320  < 2e-16 ***
avg_glucose_level  3.548e-04  6.542e-05   5.423 6.13e-08 ***
bmi               -1.312e-03  3.786e-04  -3.465 0.000534 ***
hypertension1      5.325e-02  1.010e-02   5.270 1.42e-07 ***
heart_disease1     6.493e-02  1.334e-02   4.867 1.17e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1944 on 4903 degrees of freedom
Multiple R-squared:  0.07379,   Adjusted R-squared:  0.07285 
F-statistic: 78.12 on 5 and 4903 DF,  p-value: < 2.2e-16

Key Findings:

  • Age (Strongest Predictor)

    • As age increases, the likelihood of stroke increases.
  • Glucose Level (significant positive effect):

    • Higher glucose levels are significantly associated with higher stroke risk.
  • BMI (Weak/slight negative effect):

    • BMI has a small negative effect and is not a strong independent predictor
  • Hypertension:

    • Individuals with hypertension have a higher likelihood of stroke compared individuals without.
  • Heart Disease:

    • Significant predictor of stroke, with individuals showing a higher probability of stroke.

Model Strength:

  • R² = 0.0738 (7%)

Model explains only a small portion of the variation in stroke, indicating that stroke is influenced by many additional factors such as lifestyle and health conditions. These results reinforces earlier findings that medical conditions along with age and glucose level are strong predictors of stroke risk.

Clustering

Clustering was used to group individuals based on age, glucose level and BMI without using the stroke variable. Data was standardized so that variables with larger scales would not dominate results. Pam clustering grouped individuals into two clusters and the summary table compares the average characteristics of each group.

############################################################
# CLUSTERING (REMOVE TARGET VARIABLE)
############################################################

cluster_df <- df %>%
  select(age, avg_glucose_level, bmi)


cluster_scaled <- scale(cluster_df)

set.seed(123)
cluster_model <- pam(cluster_scaled, k = 2)

#medoids
cluster_model$medoids
           age avg_glucose_level        bmi
[1,]  0.493663        -0.1128919  0.1663804
[2,] -1.146763        -0.3539760 -0.7758066
#cluster assignments
cluster_model$clustering
   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
 [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1
 [223] 2 1 1 2 2 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 2 1 2 2 2 1 2 1 2 2 2 1 2 1 1 1 1
 [260] 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2
 [297] 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 1 2 1 1 2 1 2 2 1 2 1 1 1 1 1 2 1 2 1 2 1
 [334] 1 2 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 2 1
 [371] 1 1 2 1 2 1 2 2 1 2 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 2 2 1 2 1 2 2 1 1 1
 [408] 1 2 1 1 1 2 1 1 2 2 2 2 1 1 2 1 1 1 1 1 2 2 1 1 2 1 2 1 2 1 1 2 1 1 2 2 1
 [445] 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1
 [482] 1 1 2 1 2 2 1 2 2 1 1 2 1 2 2 1 1 1 2 2 2 1 1 2 1 1 1 2 1 2 1 2 1 2 1 2 1
 [519] 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 2 1 2 1 1 1 2 1 1 1 1 2 1 1 1 2 2 1 2
 [556] 1 2 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1
 [593] 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2
 [630] 1 1 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1
 [667] 1 1 2 2 1 1 2 2 1 1 1 1 1 2 2 1 1 2 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 2
 [704] 1 1 2 2 1 1 1 1 2 1 2 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 2 1 1 2 1 1 1 1 1 1 2
 [741] 2 1 2 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1 2
 [778] 1 1 2 1 1 2 1 1 1 2 1 2 2 1 2 2 2 1 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1
 [815] 2 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1
 [852] 1 1 1 1 1 2 2 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 1 1 1 2 1 1 1 2 2 2 1 1 1 1 1
 [889] 1 1 2 1 1 1 2 2 2 1 1 2 2 1 2 1 2 2 2 1 2 2 1 2 1 1 2 1 2 1 1 2 1 2 2 1 2
 [926] 1 2 2 2 2 2 1 1 1 1 1 1 2 2 1 2 2 1 1 1 1 1 2 2 1 2 2 1 1 2 1 1 1 1 1 1 1
 [963] 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 2 1 1 1 1 2 1 2 1 2 1 2 1 1 1 1 2
[1000] 1 1 1 2 1 1 2 1 1 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 1 2
[1037] 2 2 1 1 1 2 1 2 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 2 1 2 2 2 1 1 2 1 1 1 1 2 1
[1074] 2 2 1 2 1 2 1 1 2 2 1 2 1 1 1 1 2 1 1 1 2 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2 1
[1111] 1 1 1 2 1 1 2 1 1 1 2 2 1 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 2 2 2 1 1 1
[1148] 1 1 2 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 2 1 1 1 1 2 2 1 1 2 2 2 1 2 1 2 2
[1185] 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 2 1 2 1 1 2 1 1 1 1 1 2
[1222] 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 1 1 2 1 2 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1
[1259] 2 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 1
[1296] 1 1 2 2 1 1 1 2 1 2 1 1 2 1 1 1 1 1 2 1 1 2 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1
[1333] 2 2 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 2 1 2 1 2 2 1 2 1 1 1 1 1 1 1 1 2 2
[1370] 1 2 2 1 1 1 2 2 1 1 1 1 2 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 2 2 1 1 1 1 1 1 2
[1407] 2 1 2 1 2 1 1 1 2 2 2 1 1 1 1 2 1 1 2 1 1 1 2 1 1 2 1 2 2 2 1 1 1 2 2 1 1
[1444] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1 1 2 2 1 2 1 2 1
[1481] 1 1 1 2 1 2 1 1 2 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 2 2 2 1 2 1 1 2 1 1
[1518] 2 2 1 1 1 1 1 1 1 2 2 1 2 1 2 1 1 1 1 2 1 1 2 2 1 1 2 1 1 2 1 2 2 2 1 2 2
[1555] 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 2 1 1 1 1 1 1
[1592] 1 1 2 2 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 2 2 2 2 1 1 1 1 2 1 1 1 2 2 2 1 2 1
[1629] 2 1 1 1 2 2 2 1 2 1 2 1 1 1 1 2 1 1 2 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 2 2 1
[1666] 1 1 2 2 1 2 2 2 2 1 1 1 1 2 2 1 1 2 1 1 1 1 2 1 1 1 1 1 2 1 2 2 1 2 1 1 2
[1703] 1 2 2 1 2 2 1 1 2 2 2 2 1 2 2 1 1 1 1 1 1 2 2 1 1 2 1 2 1 1 1 1 2 1 2 2 1
[1740] 1 1 1 2 1 1 2 2 2 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1
[1777] 1 2 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1 1 2 1 1 2 1 1 1 1
[1814] 1 2 1 2 1 1 2 1 1 1 1 2 2 2 1 2 2 1 1 1 1 2 1 2 2 1 2 1 1 1 1 1 2 1 2 1 1
[1851] 1 1 1 2 2 2 2 2 1 1 2 2 2 1 1 1 1 1 1 1 2 1 2 2 2 1 1 2 2 1 1 2 2 2 1 1 2
[1888] 2 1 2 1 2 2 1 2 1 1 1 1 1 2 2 1 2 2 1 1 2 2 1 2 1 1 1 2 2 1 1 2 2 1 1 1 1
[1925] 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1 2 1 1 1 1 2 2 2 1 2 1 2 1 1 1 1 1 2 1 1 2 1
[1962] 1 1 1 1 1 1 2 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 2 1 1 2 1 1
[1999] 2 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 2 1 2 1 2 1 1 1 2 1 2 1 1 1 2 1 1 1 2 2 1
[2036] 1 1 2 2 2 1 2 2 2 1 2 1 1 2 1 1 1 1 1 2 1 2 2 1 1 2 1 2 1 1 1 2 1 2 1 1 1
[2073] 2 1 2 1 1 2 2 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 2 1 2
[2110] 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 2 2 2 1 1 2 1 2 2 2 1 1 1 1 2 1 1
[2147] 1 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1
[2184] 1 1 1 1 2 1 1 1 2 2 1 1 2 1 2 2 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 1 2 1
[2221] 1 1 2 1 2 1 1 2 2 2 2 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 1 2 1 1 2 2 1 2 2 2 1
[2258] 1 1 1 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
[2295] 1 1 1 1 2 1 1 1 2 1 1 1 2 1 2 2 2 1 2 1 2 2 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1
[2332] 2 1 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1 2 2 1 2 2 2 2 2 2 1 2
[2369] 1 1 2 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2
[2406] 2 1 2 2 2 1 1 2 1 1 2 2 1 2 2 1 1 1 1 1 1 1 2 1 1 2 1 1 2 2 1 2 1 1 1 1 1
[2443] 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 1 1 1 2 1 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1
[2480] 1 2 1 2 1 1 1 2 1 2 1 2 1 1 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 1
[2517] 1 1 1 2 1 1 2 1 2 1 1 2 2 1 1 1 1 1 2 2 1 2 2 1 2 1 2 1 1 1 2 2 1 1 1 1 1
[2554] 1 2 1 2 1 1 2 1 1 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1
[2591] 2 2 1 1 1 2 1 1 1 2 1 1 2 2 2 1 1 2 1 1 1 1 1 2 2 1 1 2 2 1 2 1 1 1 1 2 1
[2628] 2 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 2 1 1 2 1 2 2 2 2 1 2 1 1 1 1 1 2 2 1 1
[2665] 2 1 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1
[2702] 1 2 1 1 2 1 1 1 1 2 2 2 2 1 1 1 1 1 1 2 1 2 2 1 2 1 1 2 1 1 1 1 1 1 1 1 2
[2739] 1 1 2 2 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1
[2776] 2 2 1 2 1 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1 1 1 2 1 1 1 1 2 2 2 1 1 2 1 1
[2813] 1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1
[2850] 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 2 2 1 2
[2887] 1 1 1 1 2 1 1 2 1 1 2 2 2 2 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 2 2 1
[2924] 1 2 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 2 2 2 1 2 1 1 1 1
[2961] 1 1 2 1 1 2 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 2 1 2 2 2 1 1 2 1 1 1 2 1 2
[2998] 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 2 2 1 2 1 2 1 2 2 2 1 1 2 1
[3035] 2 1 2 2 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1
[3072] 2 1 1 2 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 1 1 2 2 1 1 1 1 2 1
[3109] 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 2 1 1 2 1 1 2 1 1 2 1 1 1 1
[3146] 2 2 1 1 2 1 1 1 1 1 2 2 1 2 1 2 1 1 1 2 1 2 1 2 2 2 1 1 1 1 1 2 2 1 2 1 1
[3183] 1 2 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1 2 2 1 2 1 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1
[3220] 2 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 2 1 2 2 1 2 2 1 1 2 2 2 1 1 1 1 1 1 1 2 2
[3257] 2 2 2 1 1 2 1 2 1 1 2 1 2 1 1 1 1 1 1 2 2 1 1 2 1 2 1 1 1 2 1 2 2 1 1 1 1
[3294] 1 2 1 1 1 2 2 1 2 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 2 1 1 1 1 2 2 2 1 2 1 2
[3331] 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 2 2 1 2 1 1 1 1 2 2 2 1 1 1 1 1 1
[3368] 2 1 1 2 1 2 2 1 1 1 1 2 2 2 1 1 2 1 2 2 1 1 1 2 1 2 1 1 2 1 1 1 1 2 1 1 1
[3405] 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
[3442] 1 1 1 1 1 1 2 2 1 2 1 1 2 2 1 2 2 1 2 1 2 2 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1
[3479] 1 1 1 2 2 1 2 1 1 1 2 1 2 2 2 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1
[3516] 2 1 1 1 1 2 2 1 1 1 1 1 1 2 1 2 2 1 1 2 1 1 1 2 2 2 1 1 1 1 1 2 2 2 2 1 2
[3553] 1 2 1 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1 2 2 2 2 1 2 1 1 1 1 2 2 1
[3590] 1 2 1 1 1 2 2 1 2 2 2 1 2 1 1 1 2 2 1 2 2 2 1 1 1 1 1 1 2 2 2 1 2 1 2 1 2
[3627] 1 1 2 2 2 1 2 2 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 1 1 2
[3664] 1 1 2 2 2 1 1 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 1 1 2 1
[3701] 2 1 2 2 1 2 2 1 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 2 1 1 1
[3738] 2 1 2 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 2 1 1 2 1 2 1 2 2 2 1 2 2 1 1 2 1
[3775] 1 1 2 2 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 2 1 2 1 2 2 1 1 1 1 2 1 1 1 1 1 1 1
[3812] 2 1 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 2 2 1 2 2 1 2 2 1 1 1 1 1 2 1 2 2 2 2 1
[3849] 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2 1 1 2 1 1 2 2 2 2 1
[3886] 1 1 1 2 2 1 2 2 1 1 1 1 1 2 1 1 1 2 1 1 2 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1
[3923] 1 2 1 1 2 1 1 1 1 2 1 1 1 2 1 2 2 1 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 1 1
[3960] 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2 1 1 2 1 2 2
[3997] 2 2 1 1 1 1 1 2 2 1 2 1 2 1 1 1 2 2 1 2 2 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1
[4034] 1 2 1 2 2 1 2 1 2 1 2 1 1 1 1 2 1 2 1 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 1
[4071] 1 2 2 1 2 2 2 1 1 1 2 2 1 1 1 1 1 2 2 1 1 2 2 1 2 2 2 1 1 2 1 1 1 1 2 1 1
[4108] 1 1 1 2 2 1 1 1 2 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 2 1 1 1
[4145] 1 2 1 1 1 2 2 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2
[4182] 2 1 1 1 1 1 2 1 2 2 1 1 1 1 2 2 1 1 2 2 2 1 2 1 1 1 1 2 2 2 2 1 2 1 2 2 1
[4219] 1 1 2 1 1 1 1 1 2 2 1 2 2 2 1 2 1 2 1 2 2 2 2 1 1 1 2 1 1 1 2 2 1 2 1 1 1
[4256] 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 1 1
[4293] 1 1 1 1 1 1 1 2 1 2 2 1 2 2 1 2 1 1 2 1 1 1 1 1 2 2 2 1 1 1 2 2 2 2 1 1 1
[4330] 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 1 2 2 2 1 1 2 2 1
[4367] 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 2 1 1 1 1 1 1 2 1 1 2 2 1 2 1 1
[4404] 1 2 2 2 2 1 2 2 1 1 2 2 2 2 1 1 2 1 1 1 2 1 2 2 1 1 1 2 1 1 1 2 2 1 1 1 1
[4441] 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 2 2 1 2 1 1 2 1 2
[4478] 1 1 2 1 1 1 2 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 2 1 1 1 2 1 1 1 1
[4515] 1 1 2 1 1 1 2 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 2 2 1 1 1 2 1 1 1 2 1
[4552] 1 1 1 2 1 1 2 2 1 1 2 1 1 1 2 1 1 1 1 2 1 1 2 1 2 2 2 1 2 2 2 1 2 1 2 1 1
[4589] 1 1 2 2 2 1 2 2 2 2 1 1 2 2 2 1 1 1 2 2 1 1 2 1 1 1 2 2 1 1 1 1 1 2 2 2 1
[4626] 1 1 1 1 1 1 1 1 2 2 1 2 1 2 2 2 1 1 1 1 2 1 2 1 2 2 1 2 1 1 1 2 1 2 1 1 2
[4663] 1 1 2 1 1 1 2 1 1 2 2 2 1 1 1 1 2 2 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1
[4700] 1 1 1 1 1 1 1 2 1 2 2 2 2 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2
[4737] 2 2 1 1 1 2 2 1 1 1 2 1 1 2 2 2 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 1 1 2 2 1 1
[4774] 2 1 2 1 1 1 1 1 1 1 2 1 2 2 2 1 1 1 1 1 2 2 1 1 1 2 1 2 2 2 2 2 1 1 1 2 2
[4811] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 2 2 2 2 1 1 1 1 1 2 1 2 1
[4848] 2 1 1 1 1 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 2 2 1 1 2 1 1 2 2 2 2 2 1
[4885] 1 2 1 1 1 1 1 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 1 1 1
df <- df %>%
  mutate(cluster = cluster_model$clustering)


############################################################
# CLUSTER PLOT 
############################################################

plot(cluster_model)

############################################################
# CLUSTER SUMMARY 
############################################################

cluster_summary <- df %>%
  group_by(cluster) %>%
  summarize(
    Mean_Age = round(mean(age), 2),
    Mean_Glucose = round(mean(avg_glucose_level), 2),
    Mean_BMI = round(mean(bmi), 2),
    Stroke_Rate = round(mean(stroke_num), 3),
    .groups = "drop"
  )

kable(
  cluster_summary,
  caption = "Cluster Summary Statistics"
)
Cluster Summary Statistics
cluster Mean_Age Mean_Glucose Mean_BMI Stroke_Rate
1 55.18 111.34 31.87 0.063
2 17.52 92.89 22.77 0.001

Results:

Two clusters were identified:

Cluster 1 (Higher Risk):

  • Older individuals

  • Higher glucose levels

  • More normal BMI

Cluster 2(Lower Risk):

  • Younger Individuals

  • Lower glucose levels

  • Slightly higher BMI

Clustering results aligns with regression findings, reinforcing that age and glucose are the most important factors associated with stroke risk.

Interactive Visualizations

Age Distribution by Stroke Status

This interactive boxplot compares the age distribution of individuals with and without experiencing a stroke. It helps show whether stroke cases tend to occur among older individuals.

  • Individuals with stroke (1) are significantly older.

  • Median age for stroke group = late 60s-70s

  • Median age for no stroke = early 40s

  • Stroke group is more concentrated in older ages.

  • Strong evidence that age is associated with stroke risk.

age_interactive <- ggplot(df, aes(x = factor(stroke), y = age, fill = factor(stroke))) +
  geom_boxplot() +
  labs(
    title = "Age Distribution by Stroke Status",
    x = "Stroke (0 = No, 1 = Yes)",
    y = "Age"
  ) +
  scale_fill_manual(values = c("red", "royalblue")) +
  theme_minimal()

ggplotly(age_interactive)


Stroke Proportion by Hypertension Status

This interactive stacked bar chart shows the proportion of stroke cases among individuals with and without hypertension. It helps examine whether stroke occurrence appears more common among those with hypertension.

  • People with hypertension (1) have a higher proportion of stroke.

  • The stroke portion (pink) is visibly larger in hypertension group.

  • People without hypertension have very low stroke proportion.

  • Indicates hypertension is associated with stroke.

hypertension_interactive <- ggplot(df, aes(x = factor(hypertension), fill = factor(stroke))) +
  geom_bar(position = "fill") +
  labs(
    title = "Stroke Proportion by Hypertension Status",
    x = "Hypertension (0 = No, 1 = Yes)",
    y = "Proportion",
    fill = "Stroke (0 = No, 1 = Yes)"
  ) +
  scale_fill_manual(values = c("orange", "pink")) +
  theme_minimal()

ggplotly(hypertension_interactive)

Age vs Average Glucose Level by Stroke Status

This interactive scatter-plot shows the relationship between age and average glucose level, colored by stroke status. It allows us to explore whether higher age or glucose levels are associated with stroke cases.

  • Stroke cases (pink) are mostly:

    • Older individuals

    • Often with higher glucose levels

  • Higher glucose values are more common among stroke group

  • There is a pattern clustering:

    • Stroke points appear more in upper age and higher glucose areas.
  • Suggests combined effect of age and glucose.

age_glucose <- ggplot(df, aes(
  x = age,
  y = avg_glucose_level,
  color = factor(stroke)
)) +
  geom_point() +
  labs(
    title = "Age vs Glucose Level by Stroke Status",
    x = "Age",
    y = "Average Glucose Level",
    color = "Stroke (0 = No, 1 = Yes)"
  ) +
  scale_color_manual(values = c("lightgreen", "magenta")) +
  theme_minimal()

ggplotly(age_glucose)

Conclusion

In conclusion, this analysis addressed the research question of which demographic and health-related factors are associated with stroke risk. The results suggest that stroke occurrence is more commonly observed among individuals who are older and those with higher average glucose levels. Additionally, a greater proportion of stroke cases was observed among individuals with hypertension and heart disease, indicating that these health conditions are also associated with increased stroke risk. While these findings do not imply causation, they highlight important patterns that can support early detection, prevention efforts, and data-driven decision-making in healthcare.

Contact Information

  • lbolden4@students.kennesaw.edu

  • cwalk158@students.kennesaw.edu

#install.packages("leaflet")
library(leaflet)

leaflet() %>%
  addTiles() %>%   # base map
  addMarkers(
   lng = -84.5197,
    lat = 33.9391,
    popup = "Kennesaw State University"
  )

Acknowledgment

The findings presented in this project are exclusive to this course and were not in this or previous semesters and will not be presented in any other courses during the semester.