Stroke Dataset Analysis

Author

Leyla Cuenca

Dataset

#Download dataset and library
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
stroke_data <- read_csv("Stroke_dataset.csv")
Rows: 5110 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): gender, ever_married, work_type, Residence_type, bmi, smoking_status
dbl (6): id, age, hypertension, heart_disease, avg_glucose_level, stroke

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(stroke_data)
# A tibble: 6 × 12
     id gender   age hypertension heart_disease ever_married work_type    
  <dbl> <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>        
1  9046 Male      67            0             1 Yes          Private      
2 51676 Female    61            0             0 Yes          Self-employed
3 31112 Male      80            0             1 Yes          Private      
4 60182 Female    49            0             0 Yes          Private      
5  1665 Female    79            1             0 Yes          Self-employed
6 56669 Male      81            0             0 Yes          Private      
# ℹ 5 more variables: Residence_type <chr>, avg_glucose_level <dbl>, bmi <chr>,
#   smoking_status <chr>, stroke <dbl>

1. Introduction

Stroke is the second leading cause of death worldwide according to the World Health Organization (WHO). Understanding the factors that affect influence the likelihood of having a stroke is important for public health planning and prevention.

The variables that I will use in this project wil be Stroke as my dependent variable, (Y) and BMI , Age, and average_glucose_level as my independent variables (X). Also, the variable of gender will be used in my visualization code.

The purpose of this project is to explore how age and BMI relate to the odds of having a stroke. Using multiple linear regression, I will analyze the relationship between these variables, and visualize patterns to better understand how stroke outcomes vary across the dataset.

Source:

The dataset used in this project was compiled and shared by the user fedesoriano. The original name of the csv file is “heathcare-dataset-stroke-data” and is published on Kaggle. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

2. Cleaning the Dataset

A. First, I will use the Select() function to keep only average glucose level, age, BMI, stroke and gender to be ready to analyze.

stroke_df <- stroke_data |>
  select(stroke, age, bmi, gender, avg_glucose_level)
stroke_df
# A tibble: 5,110 × 5
   stroke   age bmi   gender avg_glucose_level
    <dbl> <dbl> <chr> <chr>              <dbl>
 1      1    67 36.6  Male               229. 
 2      1    61 N/A   Female             202. 
 3      1    80 32.5  Male               106. 
 4      1    49 34.4  Female             171. 
 5      1    79 24    Female             174. 
 6      1    81 29    Male               186. 
 7      1    74 27.4  Male                70.1
 8      1    69 22.8  Female              94.4
 9      1    59 N/A   Female              76.2
10      1    78 24.2  Female              58.6
# ℹ 5,100 more rows

Something to keep in mind is that the stroke variable shows 0 or 1 instead of words. This means that the patient has a 1 they did have a stroke, and a 0 means that they did not have a stroke.

B. Next, I will clean the missing values from BMI and Average glucose level since is the only column with NAs.

stroke_df$bmi <- as.numeric(stroke_df$bmi)
Warning: NAs introduced by coercion
stroke_df <- stroke_df |>
  filter(!is.na(bmi), !is.na(avg_glucose_level))

The BMI variable contained values such as “N/A”, which are stored as text and cannot be treated as numbers. When converting the column to numeric, so this values were automatically converted to NAs , representing to missing data.

C. Inspecting the cleaning data.

str(stroke_df)
tibble [4,909 × 5] (S3: tbl_df/tbl/data.frame)
 $ stroke           : num [1:4909] 1 1 1 1 1 1 1 1 1 1 ...
 $ age              : num [1:4909] 67 80 49 79 81 74 69 78 81 61 ...
 $ bmi              : num [1:4909] 36.6 32.5 34.4 24 29 27.4 22.8 24.2 29.7 36.8 ...
 $ gender           : chr [1:4909] "Male" "Male" "Female" "Female" ...
 $ avg_glucose_level: num [1:4909] 229 106 171 174 186 ...

3. Multiple Linear Regression Model

#Fitting the model
stroke_lm <- lm(stroke ~ age + bmi + avg_glucose_level, data = stroke_df)
summary(stroke_lm)

Call:
lm(formula = stroke ~ age + bmi + avg_glucose_level, data = stroke_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.19032 -0.07265 -0.03326  0.00542  1.03886 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -5.256e-02  1.175e-02  -4.471 7.95e-06 ***
age                2.029e-03  1.337e-04  15.184  < 2e-16 ***
bmi               -1.279e-03  3.789e-04  -3.375 0.000744 ***
avg_glucose_level  4.282e-04  6.499e-05   6.589 4.91e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1954 on 4905 degrees of freedom
Multiple R-squared:  0.06365,   Adjusted R-squared:  0.06308 
F-statistic: 111.1 on 3 and 4905 DF,  p-value: < 2.2e-16

Regression equation:

stroke = -0.0526 + 0.00203(age) − 0.00128(bmi) + 0.000428(avg_glucose_level)

The result of the multiple linear regression show that all three predictors used are statistically significant, as all their p-values were below 0.05. The model shows that age and average level of glucose are highly statistically significant and they are meaningful predictors of stroke, both increasing the probablity of stroke as expected/ However, BMI shows a small negative effect because of its negative coefficient, which does not match real world expectations, because it suggest that a higher BMI decreases the probability of stroke.

The adjusted R-squared value is 0.063, meaning that the model explains about 6.3% of the variation in stroke outcomes. While the model is statistically significant overall it only explains only a small portion of the variation, indicating that other important factors are not included in the analysis.

4. Diagnostic Plots

In this part I will create and arrange 4 plots in one window for my model.

par(mfrow = c(2, 2))
plot(stroke_lm)

par(mfrow = c(1,1))

To see if the model is doing a good job I analyze this plot. In the Residuals vs Fitted plot, the points should be scattered randomly around zero, which mostly happens here, but there is a small curve. This means our model might be missing some details in how age, BMI, and glucose relate to stroke. The Q-Q plot shows if the differences between the real stroke values and our model’s predictions follow a normal pattern. Most points follow the line well, so this is mostly true, but some points at the ends don’t fit perfectly. With the Scale-Location plot  we see it gets a little wider as predictions increase, which means the model’s accuracy varies a bit depending on the patient’s risk. Finally, the Residuals vs Leverage plot helps us find if any single data points have too much influence on the model. Here, all points are within safe limits, so no single patient is distorting the results. Overall, the model works okay, but could be improved.

5. Visualization plot

I will order levels manually based on counts from smallest to largest, this way I can plot a histogram that shows clearly the categories without the overlapping being to hard to differentiate.

stroke_df <- stroke_df |>
  mutate( stroke_gender = factor( interaction(stroke, gender),
                                  levels = c("0.Female", "0.Male", "1.Female"
                                             , "1.Male", "0.Other", "1.Other"
                                             )))

The dataset includes a small number of individuals labeled as “Others” for gender. To be accurate with the representation I keep it in the code, eventhoug is a very small number of cases.

ggplot(stroke_df, aes(x = age, fill = stroke_gender)) +
  geom_histogram(bins = 30, color = "black", alpha = 0.6) +
  labs(
    title = "Age Distribution by Stroke  and Gender",
    x = "Age (years)",
    y = "Count",
    fill = "Stroke & Gender",
    caption = "Source: Stroke Prediction Dataset (original source unknown)"
  ) +
  
  scale_fill_manual(values = c("skyblue", "darkgreen", "yellow", "red","darkblue")) +
  
  theme_minimal()

This graph shows the distribution of ages in the dataset separated by stroke odds and gender. By looking at the graph, it is clear that most people observed did not have a stroke, although stroke cases increases with age, specially after age 50. Also, this patter aligns with the regression model, which identifies age as a significant factor of increasing stroke occurrences. So, in this graph the bars with red and yellow are the one showing males and female with output 1, which is the cases positive for stroke.

6. Final Reflection

To prepare the dataset for analysis, I first selected only the relevant variables: stroke status, age, BMI, gender, and average glucose level. I noticed that the BMI variable was recorded as a character type and included some missing values represented as “N/A”. I converted this variable to numeric, which automatically turned these “N/A” entries into NA . Then, I filtered out all rows with missing values in BMI or average glucose level to ensure the regression and visualization were based on complete cases. This step removed some observations but improved the accuracy and reliability of the analysis. I also inspected the structure of the cleaned dataset to confirm that all variables were of the appropriate types for modeling.

The visualization displays the age distribution of individuals grouped by stroke occurrence and gender. It clearly shows that stroke cases increase with age, especially after age 50, which supports the regression model’s finding that age is a significant predictor of stroke risk. The plot also reflects that the dataset contains more females than males, with stroke present in both genders. Including the small “Other” gender category maintains transparency and completeness in representing the data. This visualization helped to highlight the age-related risk patterns in an intuitive way, making it easier to grasp the impact of age and gender on stroke occurrence. One interesting observation is the relatively low stroke incidence in younger age groups, emphasizing the strong age dependence of stroke

One challenge was handling missing BMI data; filtering them improved accuracy but reduced the dataset size. Scatterplots for age, BMI, and glucose were too cluttered, so I used a histogram for clarity. While the project required a multiple linear regression, I also explored logistic regression because stroke is a binary outcome. Logistic regression models probabilities between 0 and 1, making the results easier to interpret in terms of stroke risk. This comparison of models provided additional insight and shows how different statistical methods can complement the main analysis.

Logistic Model Regression

# Logistic regression
log_model <- glm(stroke ~ age + bmi + avg_glucose_level,
                 data = stroke_df,
                 family = "binomial")

summary(log_model)

Call:
glm(formula = stroke ~ age + bmi + avg_glucose_level, family = "binomial", 
    data = stroke_df)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -8.043403   0.538824 -14.928  < 2e-16 ***
age                0.072271   0.005531  13.066  < 2e-16 ***
bmi                0.005563   0.011558   0.481     0.63    
avg_glucose_level  0.005455   0.001263   4.320 1.56e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1728.4  on 4908  degrees of freedom
Residual deviance: 1387.7  on 4905  degrees of freedom
AIC: 1395.7

Number of Fisher Scoring iterations: 7
exp(coef(log_model))
      (Intercept)               age               bmi avg_glucose_level 
      0.000321214       1.074946140       1.005578648       1.005469667 

If we compare multiple linear regression and logistic regression to understand how age, BMI, and average glucose level affect stroke risk we can see that the linear regression, BMI is statistically significant (coefficient = -0.00128, p = 0.000744), In the logistic regression, however, BMI is not significant (p = 0.63), showing no clear effect on the likelihood of stroke, which is more realistic. Both models agree that age and average glucose levels are significant predictors, increasing stroke risk as expected. The linear model gives an adjusted R² of 0.063, meaning it explains about 6.3% of the variation in stroke outcomes, while the logistic model does not provide a traditional R² value. We use logistic regression because stroke is a binary outcome, and logistic regression models probabilities between 0 and 1, giving results that are more realistic and interpretable in terms of the odds of having a stroke.