#install.packages('qrcode')
library(qrcode)
qr <- qr_code("https://rpubs.com/bolden/1429157")
plot(qr)Understanding Stroke Risk Factors
QR Code
Introduction
Background: What is stroke? A stroke is a medical condition where part of the brain is damaged because blood supply has been disrupted.
Why? Stroke is a leading cause of death and long term disability. Understanding key risk factors can help identify high risk individuals early.
This project will use a healthcare data to analyze demographics and health related factors to identify patterns associated with stroke risk.
Data
Dataset: Stroke Prediction Data
The dataset included:
5,110 observations
12 variables related to patient health and demographics.
Key Variables:
Age
Average glucose level
BMI
Hypertension
Heart Disease
Smoking Status
Stroke (Target variable)
Research Question: What demographic and health- related factors are more associated with the likelihood of experiencing a stroke?
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(ggplot2)
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(cluster)
library(dplyr)
library(DT)
library(patchwork)
library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
df <- read_csv("healthcare-dataset-stroke-data.csv")Rows: 5110 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): gender, ever_married, work_type, Residence_type, bmi, smoking_status
dbl (6): id, age, hypertension, heart_disease, avg_glucose_level, stroke
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(df)Rows: 5,110
Columns: 12
$ id <dbl> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434…
$ gender <chr> "Male", "Female", "Male", "Female", "Female", "Male"…
$ age <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, …
$ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
$ heart_disease <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0…
$ ever_married <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
$ work_type <chr> "Private", "Self-employed", "Private", "Private", "S…
$ Residence_type <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"…
$ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0…
$ bmi <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "…
$ smoking_status <chr> "formerly smoked", "never smoked", "never smoked", "…
$ stroke <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
Data Wrangling
Cleaned the dataset and prepared for analysis using the following:
Removed missing BMI values using ‘filter()’
Converted categorical variables into factors using ‘mutate()’
Created new variables (e.g. stroke_label, age_group) using ‘case_when()’
Aggregated data using ‘count()’ and ‘summarize()’
These steps allowed for the dataset to be structured properly for analysis and visualization.
# DATA WRANGLING / CLEANING
# Remove missing BMI
df <- df %>%
mutate(bmi = as.numeric(bmi)) %>%
filter(!is.na(bmi))Warning: There was 1 warning in `mutate()`.
ℹ In argument: `bmi = as.numeric(bmi)`.
Caused by warning:
! NAs introduced by coercion
# Convert categorical variables
df <- df %>%
mutate(
stroke = as.factor(stroke),
hypertension = as.factor(hypertension),
heart_disease = as.factor(heart_disease),
gender = as.factor(gender),
ever_married = as.factor(ever_married),
work_type = as.factor(work_type),
Residence_type = as.factor(Residence_type),
smoking_status = as.factor(smoking_status)
)
############################################################
# LABEL VARIABLES
############################################################
df <- df %>%
mutate(
stroke_num = as.numeric(as.character(stroke)),
stroke_label = case_when(
stroke == "0" ~ "No Stroke",
stroke == "1" ~ "Stroke"
),
hypertension_label = case_when(
hypertension == "0" ~ "No Hypertension",
hypertension == "1" ~ "Hypertension"
),
heart_label = case_when(
heart_disease == "0" ~ "No Heart Disease",
heart_disease == "1" ~ "Heart Disease"
),
age_group = case_when(
age < 30 ~ "Under 30",
age >= 30 & age < 45 ~ "30-44",
age >= 45 & age < 60 ~ "45-59",
age >= 60 ~ "60+"
),
glucose_group = case_when(
avg_glucose_level < 100 ~ "Normal",
avg_glucose_level >= 100 & avg_glucose_level < 126 ~ "Moderate",
avg_glucose_level >= 126 ~ "High"
),
bmi_group = case_when(
bmi < 18.5 ~ "Underweight",
bmi >= 18.5 & bmi < 25 ~ "Normal",
bmi >= 25 & bmi < 30 ~ "Overweight",
bmi >= 30 ~ "Obese"
)
)Summary Statistics
Dataset is highly imbalanced:
95.7% of individuals did not experience a stroke
Only 4.26% experienced a stroke
This indicates that stroke is a relatively rare event in the dataset.
Interactive Stroke Dataset Table
This interactive table allows users to filter and explore key variables in the stroke dataset, helping identify patterns across demographic and health factors.
This table showcases:
Gender
Age
Hypertension
Heart disease
Glucose levels
You can filter and explore patterns manually. Which helps identify:
Older individuals tend to appear more frequently with risk factors.
Higher glucose values appear in some older individuals
Not for conclusions directly, but for data exploration.
############################################################
# SUMMARY TABLE - Interactive
############################################################
library(DT)
datatable(
df[, c("gender", "age", "hypertension", "heart_disease",
"avg_glucose_level", "bmi", "smoking_status", "stroke")],
filter = "top",
options = list(pageLength = 15, scrollX = TRUE),
caption = "Interactive Stroke Dataset Table"
)Summary Statistics by Stroke Status
This interactive table compares key health indicators between individuals who did and did not experience a stroke. It summarizes average age, average glucose level, average BMI, and total count by stroke group, helping identify differences between the two groups.
Stroke = 0 (No stroke)
Average age = 41.76 (42 years old)
Average glucose = 104
Average BMI = 28.82
Stroke = 1 (Stroke)
Average age = 67.71 (68 years old)
Average glucose = 134.57
Average BMI = 30.47
Overall the stroke group is much older and has higher glucose levels and slightly higher BMI. There’s strong evidence that age and glucose are key factors, which directly supports our research question.
############################################################
# SUMMARY TABLE - Interactive
############################################################
stroke_summary <- df |>
group_by(stroke) |>
summarise(
count = n(),
avg_age = mean(age, na.rm = TRUE),
avg_glucose = mean(avg_glucose_level, na.rm = TRUE),
avg_bmi = mean(bmi, na.rm = TRUE)
)
datatable(
stroke_summary,
options = list(pageLength = 5),
caption = "Summary Statistics by Stroke Status"
) |>
formatRound(c("avg_age", "avg_glucose", "avg_bmi"), digits = 2)Mean Differences
Individuals who experienced stroke:
Are significantly older on average (67.7yrs vs 41.8yrs)
Have higher average glucose levels (135 vs 104)
Have slightly higher BMI (30.5 vs 28.8).
This suggest that age and glucose levels are strong indicators of stroke risk.
Analysis
Overall, categorical results show that medical conditions such as hypertension and heart disease have the strongest association with stroke, while other variables like gender, work type and residence type show minimal impact. Supporting earlier findings, medical conditions are more important predictors than demographics factors.
library(knitr)
#Hypertension Table
hypertension_table <- df %>%
group_by(hypertension_label, stroke_label) %>%
summarise(Count = n(), .groups = "drop") %>%
group_by(hypertension_label) %>%
mutate(Proportion = round(Count / sum(Count), 3))
kable(
hypertension_table,
caption = "Stroke Counts and Proportions by Hypertension Status"
)| hypertension_label | stroke_label | Count | Proportion |
|---|---|---|---|
| Hypertension | No Stroke | 391 | 0.867 |
| Hypertension | Stroke | 60 | 0.133 |
| No Hypertension | No Stroke | 4309 | 0.967 |
| No Hypertension | Stroke | 149 | 0.033 |
- Individuals with hypertension have a much higher stroke rate (13.3%) compared to those without hypertension (3.3%), indicating a strong relationship between hypertension and stroke.
#Heart Disease Table
heart_table <- df %>%
group_by(heart_label, stroke_label) %>%
summarise(Count = n(), .groups = "drop") %>%
group_by(heart_label) %>%
mutate(Proportion = round(Count / sum(Count), 3))
kable(
heart_table,
caption = "Stroke Counts and Proportions by Heart Disease Status"
)| heart_label | stroke_label | Count | Proportion |
|---|---|---|---|
| Heart Disease | No Stroke | 203 | 0.835 |
| Heart Disease | Stroke | 40 | 0.165 |
| No Heart Disease | No Stroke | 4497 | 0.964 |
| No Heart Disease | Stroke | 169 | 0.036 |
- Stroke is more common among individuals with heart disease (16.5%) compared to those without heart disease (3.6%). Heart disease is one of the strongest risk factors in the dataset.
#GENDER Table
gender_table <- df %>%
group_by(gender, stroke_label) %>%
summarise(Count = n(), .groups = "drop") %>%
group_by(gender) %>%
mutate(Proportion = round(Count / sum(Count), 3))
kable(
gender_table,
caption = "Stroke Counts and Proportions by Gender"
)| gender | stroke_label | Count | Proportion |
|---|---|---|---|
| Female | No Stroke | 2777 | 0.959 |
| Female | Stroke | 120 | 0.041 |
| Male | No Stroke | 1922 | 0.956 |
| Male | Stroke | 89 | 0.044 |
| Other | No Stroke | 1 | 1.000 |
- Stroke rates between males and females are very similar. This suggest that gender is not a strong predictor of stroke in the dataset.
#Smoking Status Table
smoking_table <- df %>%
group_by(smoking_status, stroke_label) %>%
summarise(Count = n(), .groups = "drop") %>%
group_by(smoking_status) %>%
mutate(Proportion = round(Count / sum(Count), 3))
kable(
smoking_table,
caption = "Stroke Counts and Proportions by Smoking Status"
)| smoking_status | stroke_label | Count | Proportion |
|---|---|---|---|
| formerly smoked | No Stroke | 780 | 0.932 |
| formerly smoked | Stroke | 57 | 0.068 |
| never smoked | No Stroke | 1768 | 0.955 |
| never smoked | Stroke | 84 | 0.045 |
| smokes | No Stroke | 698 | 0.947 |
| smokes | Stroke | 39 | 0.053 |
| Unknown | No Stroke | 1454 | 0.980 |
| Unknown | Stroke | 29 | 0.020 |
- Former smokers has the highest stroke rate (6.8%), followed by current smokers (5.3%) and never smokers (4.5%). Suggesting that smoking history may be associated with increased stroke risk.
#Work Type Table
work_table <- df %>%
group_by(work_type, stroke_label) %>%
summarise(Count = n(), .groups = "drop") %>%
group_by(work_type) %>%
mutate(Proportion = round(Count / sum(Count), 3))
kable(
work_table,
caption = "Stroke Counts and Proportions by Work Type"
)| work_type | stroke_label | Count | Proportion |
|---|---|---|---|
| children | No Stroke | 670 | 0.999 |
| children | Stroke | 1 | 0.001 |
| Govt_job | No Stroke | 602 | 0.956 |
| Govt_job | Stroke | 28 | 0.044 |
| Never_worked | No Stroke | 22 | 1.000 |
| Private | No Stroke | 2684 | 0.955 |
| Private | Stroke | 127 | 0.045 |
| Self-employed | No Stroke | 722 | 0.932 |
| Self-employed | Stroke | 53 | 0.068 |
- Self employed individuals have the highest stroke rate (6.8%), while children and those who never worked have almost no cases. This reflects age differences, as younger individuals have lower stroke risk.
#Residence Type Table
residence_table <- df %>%
group_by(Residence_type, stroke_label) %>%
summarise(Count = n(), .groups = "drop") %>%
group_by(Residence_type) %>%
mutate(Proportion = round(Count / sum(Count), 3))
kable(
residence_table,
caption = "Stroke Counts and Proportions by Residence Type"
)| Residence_type | stroke_label | Count | Proportion |
|---|---|---|---|
| Rural | No Stroke | 2319 | 0.959 |
| Rural | Stroke | 100 | 0.041 |
| Urban | No Stroke | 2381 | 0.956 |
| Urban | Stroke | 109 | 0.044 |
- Stroke rates are very similar between rural (4.1%) and urban (4.4%) areas. This indicates that residence type does not have a strong relationship with stroke.
Visualizations
Visualizations provides insight into relationships between variables and stroke.
###Figure 1: Stroke Count Bar Chart
stroke_count <- df %>%
count(stroke_label) %>%
mutate(percent = round(n / sum(n) * 100, 1))
ggplot(stroke_count, aes(x = stroke_label, y = n, fill = stroke_label)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste(n, "(", percent, "%)")), vjust = -0.5) +
labs(
title = "Stroke Cases in the Dataset",
subtitle = "Most individuals in the dataset did not experience a stroke",
x = "Stroke Status",
y = "Count",
fill = "Stroke Status"
) +
theme_minimal()- This chart shows an highly imbalanced dataset, with most individuals not experiencing a stroke.
###Figure 2: Age Distribution Histogram
ggplot(df, aes(x = age, fill = stroke_label)) +
geom_histogram(binwidth = 5) +
labs(
title = "Age Distribution by Stroke Status",
subtitle = "Stroke cases appear more common among older individuals",
x = "Age",
y = "Count",
fill = "Stroke Status"
) +
scale_x_continuous(breaks = seq(0, 90, by = 10)) +
theme_minimal()- Age histogram shows that stroke cases are more concentrated among older individuals.
#Figure 3:Glucose Boxplot by Stroke Status
ggplot(df, aes(x = stroke_label, y = avg_glucose_level, fill = stroke_label)) +
geom_boxplot() +
labs(
title = "Average Glucose Level by Stroke Status",
subtitle = "Stroke cases tend to have higher glucose levels",
x = "Stroke Status",
y = "Average Glucose Level",
fill = "Stroke Status"
) +
theme_minimal()Individuals with stroke tend to have higher glucose levels, indicating a strong relationship with stroke risk.
#Figure 4:BMI Boxplot by Stroke Status
ggplot(df, aes(x = stroke_label, y = bmi, fill = stroke_label)) +
geom_boxplot() +
labs(
title = "BMI by Stroke Status",
subtitle = "BMI shows a smaller difference compared with age and glucose",
x = "Stroke Status",
y = "BMI",
fill = "Stroke Status"
) +
theme_minimal()- Boxplot for BMI only show small differences between groups, suggesting its not a strong independent predictor.
#Figure 5: Age vs Glucose Scatterplot with Trendline
ggplot(df, aes(x = age, y = avg_glucose_level, color = stroke_label)) +
geom_point() +
geom_smooth(se = FALSE) +
labs(
title = "Age vs Average Glucose Level",
subtitle = "Scatterplot with trend line comparing age and glucose",
x = "Age",
y = "Average Glucose Level",
color = "Stroke Status"
) +
scale_x_continuous(breaks = seq(0, 90, by = 10)) +
theme_minimal()`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
- Scatterplot shows stroke cases are more concentrated among older individuals with higher glucose levels.
#Figure 6: Stroke Rate by Age Group
age_rate <- df %>%
group_by(age_group) %>%
summarize(stroke_rate = mean(stroke_num)) %>%
mutate(percent = round(stroke_rate * 100, 1))
ggplot(age_rate, aes(x = age_group, y = stroke_rate, fill = age_group)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste(percent, "%")), vjust = -0.5) +
labs(
title = "Stroke Rate by Age Group",
subtitle = "Stroke rate increases in older age groups",
x = "Age Group",
y = "Stroke Rate",
fill = "Age Group"
) +
theme_minimal()- Stroke rates increases significantly with age, especially in older groups.
#Figure 7: Stroke Rate by Hypertension Status
hyper_rate <- df %>%
group_by(hypertension_label) %>%
summarize(stroke_rate = mean(stroke_num)) %>%
mutate(percent = round(stroke_rate * 100, 1))
ggplot(hyper_rate, aes(x = hypertension_label, y = stroke_rate, fill = hypertension_label)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste(percent, "%")), vjust = -0.5) +
labs(
title = "Stroke Rate by Hypertension Status",
subtitle = "Hypertension is associated with a higher stroke rate",
x = "Hypertension Status",
y = "Stroke Rate",
fill = "Hypertension Status"
) +
theme_minimal()- Individuals with hypertension have a much higher stroke rate compared to those without.
#Figure 8: Stroke Rate by Heart Disease Status
heart_rate <- df %>%
group_by(heart_label) %>%
summarize(stroke_rate = mean(stroke_num)) %>%
mutate(percent = round(stroke_rate * 100, 1))
ggplot(heart_rate, aes(x = heart_label, y = stroke_rate, fill = heart_label)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste(percent, "%")), vjust = -0.5) +
labs(
title = "Stroke Rate by Heart Disease Status",
subtitle = "Heart disease is associated with a higher stroke rate",
x = "Heart Disease Status",
y = "Stroke Rate",
fill = "Heart Disease Status"
) +
theme_minimal()- Heart disease is strongly associated with stroke, demonstrating one of the highest differences in stroke rates.
#Figure 9: Stroke Rate by Smoking Status
smoking_rate <- df %>%
group_by(smoking_status) %>%
summarize(stroke_rate = mean(stroke_num)) %>%
mutate(percent = round(stroke_rate * 100, 1))
ggplot(smoking_rate, aes(x = reorder(smoking_status, stroke_rate), y = stroke_rate, fill = smoking_status)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste(percent, "%")), hjust = -0.1) +
coord_flip() +
labs(
title = "Stroke Rate by Smoking Status",
subtitle = "Former smokers show the highest stroke rate",
x = "Smoking Status",
y = "Stroke Rate",
fill = "Smoking Status"
) +
theme_minimal()- Former Smokers have the highest stroke rates, suggesting smoking history may increase risk.
#Figure 10: Stroke Rate by Glucose Group
glucose_rate <- df %>%
group_by(glucose_group) %>%
summarize(stroke_rate = mean(stroke_num)) %>%
mutate(percent = round(stroke_rate * 100, 1))
ggplot(glucose_rate, aes(x = glucose_group, y = stroke_rate, fill = glucose_group)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste(percent, "%")), vjust = -0.5) +
labs(
title = "Stroke Rate by Glucose Group",
subtitle = "Higher glucose groups show higher stroke rates",
x = "Glucose Group",
y = "Stroke Rate",
fill = "Glucose Group"
) +
theme_minimal()- Higher glucose groups show higher stroke rates; this shows its importance as a risk factor.
#Figure 11: Stroke Rate by BMI Group
bmi_rate <- df %>%
group_by(bmi_group) %>%
summarize(stroke_rate = mean(stroke_num)) %>%
mutate(percent = round(stroke_rate * 100, 1))
ggplot(bmi_rate, aes(x = bmi_group, y = stroke_rate, fill = bmi_group)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste(percent, "%")), vjust = -0.5) +
labs(
title = "Stroke Rate by BMI Group",
subtitle = "BMI groups show smaller differences than age and glucose",
x = "BMI Group",
y = "Stroke Rate",
fill = "BMI Group"
) +
theme_minimal()- Small differences are shown between BMI groups, indicating a weaker influence compared to age and glucose.
#Figure 12: Stroke Rate by Work Type
work_rate <- df %>%
group_by(work_type) %>%
summarize(stroke_rate = mean(stroke_num)) %>%
mutate(percent = round(stroke_rate * 100, 1))
ggplot(work_rate, aes(x = reorder(work_type, stroke_rate), y = stroke_rate, fill = work_type)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste(percent, "%")), hjust = -0.1) +
coord_flip() +
labs(
title = "Stroke Rate by Work Type",
subtitle = "Self-employed individuals show the highest stroke rate",
x = "Work Type",
y = "Stroke Rate",
fill = "Work Type"
) +
theme_minimal()- Self employed individuals show higher stroke rates, reflecting age differences.
#Figure 13:Residence Type and Stroke Proportion
ggplot(df, aes(x = Residence_type, fill = stroke_label)) +
geom_bar(position = "fill") +
labs(
title = "Stroke Proportion by Residence Type",
subtitle = "Urban and rural groups show similar stroke proportions",
x = "Residence Type",
y = "Proportion",
fill = "Stroke Status"
) +
theme_minimal()- There is little difference between urban and rural populations, suggesting location is not a major factor.
#Figure 14: Mean age difference by Stroke
age_means <- df %>%
group_by(stroke_label) %>%
summarize(mean_age = mean(age))
ggplot(age_means, aes(x = stroke_label, y = mean_age, color = stroke_label)) +
geom_point(size = 4) +
geom_segment(
aes(
x = 1,
xend = 2,
y = age_means$mean_age[age_means$stroke_label == "No Stroke"],
yend = age_means$mean_age[age_means$stroke_label == "Stroke"]
)
) +
geom_text(aes(label = round(mean_age, 1)), vjust = -1) +
labs(
title = "Mean Age Difference by Stroke Status",
subtitle = "The average age is higher in the stroke group",
x = "Stroke Status",
y = "Mean Age",
color = "Stroke Status"
) +
theme_minimal()Warning: Use of `age_means$mean_age` is discouraged.
ℹ Use `mean_age` instead.
Warning: Use of `age_means$stroke_label` is discouraged.
ℹ Use `stroke_label` instead.
Warning: Use of `age_means$mean_age` is discouraged.
ℹ Use `mean_age` instead.
Warning: Use of `age_means$stroke_label` is discouraged.
ℹ Use `stroke_label` instead.
- Average age is significantly higher for individuals who experienced a stroke.
#Figure 15: Combined Comparison Plot
p1 <- ggplot(df, aes(x = stroke_label, y = age, fill = stroke_label)) +
geom_boxplot() +
labs(title = "Age", x = "", y = "Age") +
theme_minimal()
p2 <- ggplot(df, aes(x = stroke_label, y = avg_glucose_level, fill = stroke_label)) +
geom_boxplot() +
labs(title = "Glucose", x = "", y = "Average Glucose") +
theme_minimal()
p3 <- ggplot(df, aes(x = stroke_label, y = bmi, fill = stroke_label)) +
geom_boxplot() +
labs(title = "BMI", x = "", y = "BMI") +
theme_minimal()
(p1 + p2 + p3)- This combined plot compares age, glucose, and BMI; demonstrating clear differences between stroke groups compared to BMI.
Overall, the plots consistently show that age, glucose level, hypertension, heart disease are the most important variables associated with stroke.
Modeling
Multiple Linear Regression Model
Regression model was used to estimate how age, glucose, BMI, hypertension and heart disease are associated with the likelihood of stroke.
# Convert stroke to numeric for regression
df <- df %>%
mutate(stroke_num = as.numeric(as.character(stroke)))
# Linear model
model <- lm(stroke_num ~ age + avg_glucose_level + bmi + hypertension + heart_disease, data = df)
summary(model)
Call:
lm(formula = stroke_num ~ age + avg_glucose_level + bmi + hypertension +
heart_disease, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.26448 -0.06629 -0.02773 0.00551 1.03461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.869e-02 1.187e-02 -3.261 0.001119 **
age 1.720e-03 1.396e-04 12.320 < 2e-16 ***
avg_glucose_level 3.548e-04 6.542e-05 5.423 6.13e-08 ***
bmi -1.312e-03 3.786e-04 -3.465 0.000534 ***
hypertension1 5.325e-02 1.010e-02 5.270 1.42e-07 ***
heart_disease1 6.493e-02 1.334e-02 4.867 1.17e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1944 on 4903 degrees of freedom
Multiple R-squared: 0.07379, Adjusted R-squared: 0.07285
F-statistic: 78.12 on 5 and 4903 DF, p-value: < 2.2e-16
Key Findings:
Age (Strongest Predictor)
- As age increases, the likelihood of stroke increases.
Glucose Level (significant positive effect):
- Higher glucose levels are significantly associated with higher stroke risk.
BMI (Weak/slight negative effect):
- BMI has a small negative effect and is not a strong independent predictor
Hypertension:
- Individuals with hypertension have a higher likelihood of stroke compared individuals without.
Heart Disease:
- Significant predictor of stroke, with individuals showing a higher probability of stroke.
Model Strength:
- R² = 0.0738 (7%)
Model explains only a small portion of the variation in stroke, indicating that stroke is influenced by many additional factors such as lifestyle and health conditions. These results reinforces earlier findings that medical conditions along with age and glucose level are strong predictors of stroke risk.
Clustering
Clustering was used to group individuals based on age, glucose level and BMI without using the stroke variable. Data was standardized so that variables with larger scales would not dominate results. Pam clustering grouped individuals into two clusters and the summary table compares the average characteristics of each group.
############################################################
# CLUSTERING (REMOVE TARGET VARIABLE)
############################################################
cluster_df <- df %>%
select(age, avg_glucose_level, bmi)
cluster_scaled <- scale(cluster_df)
set.seed(123)
cluster_model <- pam(cluster_scaled, k = 2)
#medoids
cluster_model$medoids age avg_glucose_level bmi
[1,] 0.493663 -0.1128919 0.1663804
[2,] -1.146763 -0.3539760 -0.7758066
#cluster assignments
cluster_model$clustering [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1
[112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1
[223] 2 1 1 2 2 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 2 1 2 2 2 1 2 1 2 2 2 1 2 1 1 1 1
[260] 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2
[297] 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 1 2 1 1 2 1 2 2 1 2 1 1 1 1 1 2 1 2 1 2 1
[334] 1 2 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 2 1
[371] 1 1 2 1 2 1 2 2 1 2 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 2 2 1 2 1 2 2 1 1 1
[408] 1 2 1 1 1 2 1 1 2 2 2 2 1 1 2 1 1 1 1 1 2 2 1 1 2 1 2 1 2 1 1 2 1 1 2 2 1
[445] 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1
[482] 1 1 2 1 2 2 1 2 2 1 1 2 1 2 2 1 1 1 2 2 2 1 1 2 1 1 1 2 1 2 1 2 1 2 1 2 1
[519] 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 2 1 2 1 1 1 2 1 1 1 1 2 1 1 1 2 2 1 2
[556] 1 2 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1
[593] 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2
[630] 1 1 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1
[667] 1 1 2 2 1 1 2 2 1 1 1 1 1 2 2 1 1 2 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 2 1 1 2
[704] 1 1 2 2 1 1 1 1 2 1 2 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 2 1 1 2 1 1 1 1 1 1 2
[741] 2 1 2 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1 2
[778] 1 1 2 1 1 2 1 1 1 2 1 2 2 1 2 2 2 1 1 1 1 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1
[815] 2 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 1
[852] 1 1 1 1 1 2 2 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 1 1 1 2 1 1 1 2 2 2 1 1 1 1 1
[889] 1 1 2 1 1 1 2 2 2 1 1 2 2 1 2 1 2 2 2 1 2 2 1 2 1 1 2 1 2 1 1 2 1 2 2 1 2
[926] 1 2 2 2 2 2 1 1 1 1 1 1 2 2 1 2 2 1 1 1 1 1 2 2 1 2 2 1 1 2 1 1 1 1 1 1 1
[963] 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 2 1 1 1 1 2 1 2 1 2 1 2 1 1 1 1 2
[1000] 1 1 1 2 1 1 2 1 1 2 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 1 2
[1037] 2 2 1 1 1 2 1 2 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 2 1 2 2 2 1 1 2 1 1 1 1 2 1
[1074] 2 2 1 2 1 2 1 1 2 2 1 2 1 1 1 1 2 1 1 1 2 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2 1
[1111] 1 1 1 2 1 1 2 1 1 1 2 2 1 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 2 2 2 1 1 1
[1148] 1 1 2 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 2 1 1 1 1 2 2 1 1 2 2 2 1 2 1 2 2
[1185] 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 2 1 1 2 1 2 1 1 2 1 1 1 1 1 2
[1222] 1 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 2 1 1 2 1 2 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1
[1259] 2 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 1
[1296] 1 1 2 2 1 1 1 2 1 2 1 1 2 1 1 1 1 1 2 1 1 2 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1
[1333] 2 2 1 2 1 1 1 1 1 1 1 1 2 1 1 1 2 2 1 2 1 2 1 2 2 1 2 1 1 1 1 1 1 1 1 2 2
[1370] 1 2 2 1 1 1 2 2 1 1 1 1 2 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 2 2 1 1 1 1 1 1 2
[1407] 2 1 2 1 2 1 1 1 2 2 2 1 1 1 1 2 1 1 2 1 1 1 2 1 1 2 1 2 2 2 1 1 1 2 2 1 1
[1444] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 2 1 2 1 1 1 1 1 1 2 2 1 2 1 2 1
[1481] 1 1 1 2 1 2 1 1 2 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 2 2 2 1 2 1 1 2 1 1
[1518] 2 2 1 1 1 1 1 1 1 2 2 1 2 1 2 1 1 1 1 2 1 1 2 2 1 1 2 1 1 2 1 2 2 2 1 2 2
[1555] 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 1 1 2 1 1 1 1 1 1
[1592] 1 1 2 2 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 2 2 2 2 1 1 1 1 2 1 1 1 2 2 2 1 2 1
[1629] 2 1 1 1 2 2 2 1 2 1 2 1 1 1 1 2 1 1 2 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 2 2 1
[1666] 1 1 2 2 1 2 2 2 2 1 1 1 1 2 2 1 1 2 1 1 1 1 2 1 1 1 1 1 2 1 2 2 1 2 1 1 2
[1703] 1 2 2 1 2 2 1 1 2 2 2 2 1 2 2 1 1 1 1 1 1 2 2 1 1 2 1 2 1 1 1 1 2 1 2 2 1
[1740] 1 1 1 2 1 1 2 2 2 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1
[1777] 1 2 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 1 1 2 1 1 2 1 1 1 1
[1814] 1 2 1 2 1 1 2 1 1 1 1 2 2 2 1 2 2 1 1 1 1 2 1 2 2 1 2 1 1 1 1 1 2 1 2 1 1
[1851] 1 1 1 2 2 2 2 2 1 1 2 2 2 1 1 1 1 1 1 1 2 1 2 2 2 1 1 2 2 1 1 2 2 2 1 1 2
[1888] 2 1 2 1 2 2 1 2 1 1 1 1 1 2 2 1 2 2 1 1 2 2 1 2 1 1 1 2 2 1 1 2 2 1 1 1 1
[1925] 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1 2 1 1 1 1 2 2 2 1 2 1 2 1 1 1 1 1 2 1 1 2 1
[1962] 1 1 1 1 1 1 2 2 1 2 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 2 2 1 1 2 1 1
[1999] 2 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 2 1 2 1 2 1 1 1 2 1 2 1 1 1 2 1 1 1 2 2 1
[2036] 1 1 2 2 2 1 2 2 2 1 2 1 1 2 1 1 1 1 1 2 1 2 2 1 1 2 1 2 1 1 1 2 1 2 1 1 1
[2073] 2 1 2 1 1 2 2 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 2 1 2
[2110] 1 1 2 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 2 2 2 1 1 2 1 2 2 2 1 1 1 1 2 1 1
[2147] 1 2 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1
[2184] 1 1 1 1 2 1 1 1 2 2 1 1 2 1 2 2 1 2 1 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 1 2 1
[2221] 1 1 2 1 2 1 1 2 2 2 2 2 1 1 1 1 2 2 1 1 1 2 1 1 2 1 1 2 1 1 2 2 1 2 2 2 1
[2258] 1 1 1 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
[2295] 1 1 1 1 2 1 1 1 2 1 1 1 2 1 2 2 2 1 2 1 2 2 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1
[2332] 2 1 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1 2 2 1 2 2 2 2 2 2 1 2
[2369] 1 1 2 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 2
[2406] 2 1 2 2 2 1 1 2 1 1 2 2 1 2 2 1 1 1 1 1 1 1 2 1 1 2 1 1 2 2 1 2 1 1 1 1 1
[2443] 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 1 1 1 2 1 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1
[2480] 1 2 1 2 1 1 1 2 1 2 1 2 1 1 2 2 2 1 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 1
[2517] 1 1 1 2 1 1 2 1 2 1 1 2 2 1 1 1 1 1 2 2 1 2 2 1 2 1 2 1 1 1 2 2 1 1 1 1 1
[2554] 1 2 1 2 1 1 2 1 1 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1
[2591] 2 2 1 1 1 2 1 1 1 2 1 1 2 2 2 1 1 2 1 1 1 1 1 2 2 1 1 2 2 1 2 1 1 1 1 2 1
[2628] 2 1 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 2 1 1 2 1 2 2 2 2 1 2 1 1 1 1 1 2 2 1 1
[2665] 2 1 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 2 2 1 1 1 1 1 2 2 1 1 1 1 1 1 2 1 1 1 1
[2702] 1 2 1 1 2 1 1 1 1 2 2 2 2 1 1 1 1 1 1 2 1 2 2 1 2 1 1 2 1 1 1 1 1 1 1 1 2
[2739] 1 1 2 2 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1
[2776] 2 2 1 2 1 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1 1 1 2 1 1 1 1 2 2 2 1 1 2 1 1
[2813] 1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1
[2850] 2 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 2 2 1 2
[2887] 1 1 1 1 2 1 1 2 1 1 2 2 2 2 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 2 2 1
[2924] 1 2 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 2 2 2 1 2 1 1 1 1
[2961] 1 1 2 1 1 2 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1 2 1 2 2 2 1 1 2 1 1 1 2 1 2
[2998] 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 2 2 1 2 1 2 1 2 2 2 1 1 2 1
[3035] 2 1 2 2 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1
[3072] 2 1 1 2 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 1 1 2 2 1 1 1 1 2 1
[3109] 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 2 1 1 2 1 1 2 1 1 2 1 1 1 1
[3146] 2 2 1 1 2 1 1 1 1 1 2 2 1 2 1 2 1 1 1 2 1 2 1 2 2 2 1 1 1 1 1 2 2 1 2 1 1
[3183] 1 2 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1 2 2 1 2 1 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1
[3220] 2 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 2 1 2 2 1 2 2 1 1 2 2 2 1 1 1 1 1 1 1 2 2
[3257] 2 2 2 1 1 2 1 2 1 1 2 1 2 1 1 1 1 1 1 2 2 1 1 2 1 2 1 1 1 2 1 2 2 1 1 1 1
[3294] 1 2 1 1 1 2 2 1 2 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 2 1 1 1 1 2 2 2 1 2 1 2
[3331] 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 2 2 1 2 1 1 1 1 2 2 2 1 1 1 1 1 1
[3368] 2 1 1 2 1 2 2 1 1 1 1 2 2 2 1 1 2 1 2 2 1 1 1 2 1 2 1 1 2 1 1 1 1 2 1 1 1
[3405] 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
[3442] 1 1 1 1 1 1 2 2 1 2 1 1 2 2 1 2 2 1 2 1 2 2 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1
[3479] 1 1 1 2 2 1 2 1 1 1 2 1 2 2 2 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1
[3516] 2 1 1 1 1 2 2 1 1 1 1 1 1 2 1 2 2 1 1 2 1 1 1 2 2 2 1 1 1 1 1 2 2 2 2 1 2
[3553] 1 2 1 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1 2 2 2 2 1 2 1 1 1 1 2 2 1
[3590] 1 2 1 1 1 2 2 1 2 2 2 1 2 1 1 1 2 2 1 2 2 2 1 1 1 1 1 1 2 2 2 1 2 1 2 1 2
[3627] 1 1 2 2 2 1 2 2 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 1 1 2
[3664] 1 1 2 2 2 1 1 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 1 1 2 1
[3701] 2 1 2 2 1 2 2 1 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 2 2 2 2 1 1 1 1 1 1 2 1 1 1
[3738] 2 1 2 1 1 1 1 1 2 1 1 1 2 1 1 2 1 1 1 2 2 1 1 2 1 2 1 2 2 2 1 2 2 1 1 2 1
[3775] 1 1 2 2 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 2 1 2 1 2 2 1 1 1 1 2 1 1 1 1 1 1 1
[3812] 2 1 1 1 2 1 1 1 1 1 2 1 1 1 2 2 2 2 2 1 2 2 1 2 2 1 1 1 1 1 2 1 2 2 2 2 1
[3849] 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2 1 1 2 1 1 2 2 2 2 1
[3886] 1 1 1 2 2 1 2 2 1 1 1 1 1 2 1 1 1 2 1 1 2 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1
[3923] 1 2 1 1 2 1 1 1 1 2 1 1 1 2 1 2 2 1 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 1 1
[3960] 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2 1 1 2 1 2 2
[3997] 2 2 1 1 1 1 1 2 2 1 2 1 2 1 1 1 2 2 1 2 2 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1
[4034] 1 2 1 2 2 1 2 1 2 1 2 1 1 1 1 2 1 2 1 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 1
[4071] 1 2 2 1 2 2 2 1 1 1 2 2 1 1 1 1 1 2 2 1 1 2 2 1 2 2 2 1 1 2 1 1 1 1 2 1 1
[4108] 1 1 1 2 2 1 1 1 2 1 1 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 2 2 1 1 1
[4145] 1 2 1 1 1 2 2 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2
[4182] 2 1 1 1 1 1 2 1 2 2 1 1 1 1 2 2 1 1 2 2 2 1 2 1 1 1 1 2 2 2 2 1 2 1 2 2 1
[4219] 1 1 2 1 1 1 1 1 2 2 1 2 2 2 1 2 1 2 1 2 2 2 2 1 1 1 2 1 1 1 2 2 1 2 1 1 1
[4256] 2 1 1 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 1 1
[4293] 1 1 1 1 1 1 1 2 1 2 2 1 2 2 1 2 1 1 2 1 1 1 1 1 2 2 2 1 1 1 2 2 2 2 1 1 1
[4330] 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 1 2 1 2 1 1 1 2 2 2 1 1 2 2 1
[4367] 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 2 1 1 1 1 1 1 2 1 1 2 2 1 2 1 1
[4404] 1 2 2 2 2 1 2 2 1 1 2 2 2 2 1 1 2 1 1 1 2 1 2 2 1 1 1 2 1 1 1 2 2 1 1 1 1
[4441] 1 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 2 2 2 1 2 1 1 2 1 2
[4478] 1 1 2 1 1 1 2 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 2 1 1 1 2 1 1 1 1
[4515] 1 1 2 1 1 1 2 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 2 2 1 1 1 2 1 1 1 2 1
[4552] 1 1 1 2 1 1 2 2 1 1 2 1 1 1 2 1 1 1 1 2 1 1 2 1 2 2 2 1 2 2 2 1 2 1 2 1 1
[4589] 1 1 2 2 2 1 2 2 2 2 1 1 2 2 2 1 1 1 2 2 1 1 2 1 1 1 2 2 1 1 1 1 1 2 2 2 1
[4626] 1 1 1 1 1 1 1 1 2 2 1 2 1 2 2 2 1 1 1 1 2 1 2 1 2 2 1 2 1 1 1 2 1 2 1 1 2
[4663] 1 1 2 1 1 1 2 1 1 2 2 2 1 1 1 1 2 2 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1
[4700] 1 1 1 1 1 1 1 2 1 2 2 2 2 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2
[4737] 2 2 1 1 1 2 2 1 1 1 2 1 1 2 2 2 1 2 1 1 2 1 1 1 1 1 2 1 1 1 2 1 1 2 2 1 1
[4774] 2 1 2 1 1 1 1 1 1 1 2 1 2 2 2 1 1 1 1 1 2 2 1 1 1 2 1 2 2 2 2 2 1 1 1 2 2
[4811] 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 2 2 1 2 2 2 2 1 1 1 1 1 2 1 2 1
[4848] 2 1 1 1 1 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 2 2 1 1 2 1 1 2 2 2 2 2 1
[4885] 1 2 1 1 1 1 1 2 2 1 1 2 2 1 1 2 1 1 1 1 2 1 1 1 1
df <- df %>%
mutate(cluster = cluster_model$clustering)
############################################################
# CLUSTER PLOT
############################################################
plot(cluster_model)############################################################
# CLUSTER SUMMARY
############################################################
cluster_summary <- df %>%
group_by(cluster) %>%
summarize(
Mean_Age = round(mean(age), 2),
Mean_Glucose = round(mean(avg_glucose_level), 2),
Mean_BMI = round(mean(bmi), 2),
Stroke_Rate = round(mean(stroke_num), 3),
.groups = "drop"
)
kable(
cluster_summary,
caption = "Cluster Summary Statistics"
)| cluster | Mean_Age | Mean_Glucose | Mean_BMI | Stroke_Rate |
|---|---|---|---|---|
| 1 | 55.18 | 111.34 | 31.87 | 0.063 |
| 2 | 17.52 | 92.89 | 22.77 | 0.001 |
Results:
Two clusters were identified:
Cluster 1 (Higher Risk):
Older individuals
Higher glucose levels
More normal BMI
Cluster 2(Lower Risk):
Younger Individuals
Lower glucose levels
Slightly higher BMI
Clustering results aligns with regression findings, reinforcing that age and glucose are the most important factors associated with stroke risk.
Interactive Visualizations
Age Distribution by Stroke Status
This interactive boxplot compares the age distribution of individuals with and without experiencing a stroke. It helps show whether stroke cases tend to occur among older individuals.
Individuals with stroke (1) are significantly older.
Median age for stroke group = late 60s-70s
Median age for no stroke = early 40s
Stroke group is more concentrated in older ages.
Strong evidence that age is associated with stroke risk.
age_interactive <- ggplot(df, aes(x = factor(stroke), y = age, fill = factor(stroke))) +
geom_boxplot() +
labs(
title = "Age Distribution by Stroke Status",
x = "Stroke (0 = No, 1 = Yes)",
y = "Age"
) +
scale_fill_manual(values = c("red", "royalblue")) +
theme_minimal()
ggplotly(age_interactive)
Stroke Proportion by Hypertension Status
This interactive stacked bar chart shows the proportion of stroke cases among individuals with and without hypertension. It helps examine whether stroke occurrence appears more common among those with hypertension.
People with hypertension (1) have a higher proportion of stroke.
The stroke portion (pink) is visibly larger in hypertension group.
People without hypertension have very low stroke proportion.
Indicates hypertension is associated with stroke.
hypertension_interactive <- ggplot(df, aes(x = factor(hypertension), fill = factor(stroke))) +
geom_bar(position = "fill") +
labs(
title = "Stroke Proportion by Hypertension Status",
x = "Hypertension (0 = No, 1 = Yes)",
y = "Proportion",
fill = "Stroke (0 = No, 1 = Yes)"
) +
scale_fill_manual(values = c("orange", "pink")) +
theme_minimal()
ggplotly(hypertension_interactive)Age vs Average Glucose Level by Stroke Status
This interactive scatter-plot shows the relationship between age and average glucose level, colored by stroke status. It allows us to explore whether higher age or glucose levels are associated with stroke cases.
Stroke cases (pink) are mostly:
Older individuals
Often with higher glucose levels
Higher glucose values are more common among stroke group
There is a pattern clustering:
- Stroke points appear more in upper age and higher glucose areas.
Suggests combined effect of age and glucose.
age_glucose <- ggplot(df, aes(
x = age,
y = avg_glucose_level,
color = factor(stroke)
)) +
geom_point() +
labs(
title = "Age vs Glucose Level by Stroke Status",
x = "Age",
y = "Average Glucose Level",
color = "Stroke (0 = No, 1 = Yes)"
) +
scale_color_manual(values = c("lightgreen", "magenta")) +
theme_minimal()
ggplotly(age_glucose)Conclusion
In conclusion, this analysis addressed the research question of which demographic and health-related factors are associated with stroke risk. The results suggest that stroke occurrence is more commonly observed among individuals who are older and those with higher average glucose levels. Additionally, a greater proportion of stroke cases was observed among individuals with hypertension and heart disease, indicating that these health conditions are also associated with increased stroke risk. While these findings do not imply causation, they highlight important patterns that can support early detection, prevention efforts, and data-driven decision-making in healthcare.
Contact Information
lbolden4@students.kennesaw.edu
cwalk158@students.kennesaw.edu
#install.packages("leaflet")
library(leaflet)
leaflet() %>%
addTiles() %>% # base map
addMarkers(
lng = -84.5197,
lat = 33.9391,
popup = "Kennesaw State University"
)Acknowledgment
The findings presented in this project are exclusive to this course and were not in this or previous semesters and will not be presented in any other courses during the semester.