Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
stroke_data <-read_csv("Stroke_dataset.csv")
Rows: 5110 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): gender, ever_married, work_type, Residence_type, bmi, smoking_status
dbl (6): id, age, hypertension, heart_disease, avg_glucose_level, stroke
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Stroke is the second leading cause of death worldwide according to the World Health Organization (WHO). Understanding the factors that affect influence the likelihood of having a stroke is important for public health planning and prevention.
The variables that I will use in this project wil be Stroke as my dependent variable, (Y) and BMI , Age, and average_glucose_level as my independent variables (X). Also, the variable of gender will be used in my visualization code.
The purpose of this project is to explore how age and BMI relate to the odds of having a stroke. Using multiple linear regression, I will analyze the relationship between these variables, and visualize patterns to better understand how stroke outcomes vary across the dataset.
Something to keep in mind is that the stroke variable shows 0 or 1 instead of words. This means that the patient has a 1 they did have a stroke, and a 0 means that they did not have a stroke.
B. Next, I will clean the missing values from BMI and Average glucose level since is the only column with NAs.
The BMI variable contained values such as “N/A”, which are stored as text and cannot be treated as numbers. When converting the column to numeric, so this values were automatically converted to NAs , representing to missing data.
The result of the multiple linear regression show that all three predictors used are statistically significant, as all their p-values were below 0.05. The model shows that age and average level of glucose are highly statistically significant and they are meaningful predictors of stroke, both increasing the probablity of stroke as expected/ However, BMI shows a small negative effect because of its negative coefficient, which does not match real world expectations, because it suggest that a higher BMI decreases the probability of stroke.
The adjusted R-squared value is 0.063, meaning that the model explains about 6.3% of the variation in stroke outcomes. While the model is statistically significant overall it only explains only a small portion of the variation, indicating that other important factors are not included in the analysis.
4. Diagnostic Plots
In this part I will create and arrange 4 plots in one window for my model.
par(mfrow =c(2, 2))plot(stroke_lm)
par(mfrow =c(1,1))
To see if the model is doing a good job I analyze this plot. In the Residuals vs Fitted plot, the points should be scattered randomly around zero, which mostly happens here, but there is a small curve. This means our model might be missing some details in how age, BMI, and glucose relate to stroke. The Q-Q plot shows if the differences between the real stroke values and our model’s predictions follow a normal pattern. Most points follow the line well, so this is mostly true, but some points at the ends don’t fit perfectly. With the Scale-Location plot we see it gets a little wider as predictions increase, which means the model’s accuracy varies a bit depending on the patient’s risk. Finally, the Residuals vs Leverage plot helps us find if any single data points have too much influence on the model. Here, all points are within safe limits, so no single patient is distorting the results. Overall, the model works okay, but could be improved.
5. Visualization plot
I will order levels manually based on counts from smallest to largest, this way I can plot a histogram that shows clearly the categories without the overlapping being to hard to differentiate.
The dataset includes a small number of individuals labeled as “Others” for gender. To be accurate with the representation I keep it in the code, eventhoug is a very small number of cases.
ggplot(stroke_df, aes(x = age, fill = stroke_gender)) +geom_histogram(bins =30, color ="black", alpha =0.6) +labs(title ="Age Distribution by Stroke and Gender",x ="Age (years)",y ="Count",fill ="Stroke & Gender",caption ="Source: Stroke Prediction Dataset (original source unknown)" ) +scale_fill_manual(values =c("skyblue", "darkgreen", "yellow", "red","darkblue")) +theme_minimal()
This graph shows the distribution of ages in the dataset separated by stroke odds and gender. By looking at the graph, it is clear that most people observed did not have a stroke, although stroke cases increases with age, specially after age 50. Also, this patter aligns with the regression model, which identifies age as a significant factor of increasing stroke occurrences. So, in this graph the bars with red and yellow are the one showing males and female with output 1, which is the cases positive for stroke.
6. Final Reflection
To prepare the dataset for analysis, I first selected only the relevant variables: stroke status, age, BMI, gender, and average glucose level. I noticed that the BMI variable was recorded as a character type and included some missing values represented as “N/A”. I converted this variable to numeric, which automatically turned these “N/A” entries into NA . Then, I filtered out all rows with missing values in BMI or average glucose level to ensure the regression and visualization were based on complete cases. This step removed some observations but improved the accuracy and reliability of the analysis. I also inspected the structure of the cleaned dataset to confirm that all variables were of the appropriate types for modeling.
The visualization displays the age distribution of individuals grouped by stroke occurrence and gender. It clearly shows that stroke cases increase with age, especially after age 50, which supports the regression model’s finding that age is a significant predictor of stroke risk. The plot also reflects that the dataset contains more females than males, with stroke present in both genders. Including the small “Other” gender category maintains transparency and completeness in representing the data. This visualization helped to highlight the age-related risk patterns in an intuitive way, making it easier to grasp the impact of age and gender on stroke occurrence. One interesting observation is the relatively low stroke incidence in younger age groups, emphasizing the strong age dependence of stroke
One challenge was handling missing BMI data; filtering them improved accuracy but reduced the dataset size. Scatterplots for age, BMI, and glucose were too cluttered, so I used a histogram for clarity. While the project required a multiple linear regression, I also explored logistic regression because stroke is a binary outcome. Logistic regression models probabilities between 0 and 1, making the results easier to interpret in terms of stroke risk. This comparison of models provided additional insight and shows how different statistical methods can complement the main analysis.
Call:
glm(formula = stroke ~ age + bmi + avg_glucose_level, family = "binomial",
data = stroke_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.043403 0.538824 -14.928 < 2e-16 ***
age 0.072271 0.005531 13.066 < 2e-16 ***
bmi 0.005563 0.011558 0.481 0.63
avg_glucose_level 0.005455 0.001263 4.320 1.56e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1728.4 on 4908 degrees of freedom
Residual deviance: 1387.7 on 4905 degrees of freedom
AIC: 1395.7
Number of Fisher Scoring iterations: 7
exp(coef(log_model))
(Intercept) age bmi avg_glucose_level
0.000321214 1.074946140 1.005578648 1.005469667
If we compare multiple linear regression and logistic regression to understand how age, BMI, and average glucose level affect stroke risk we can see that the linear regression, BMI is statistically significant (coefficient = -0.00128, p = 0.000744), In the logistic regression, however, BMI is not significant (p = 0.63), showing no clear effect on the likelihood of stroke, which is more realistic. Both models agree that age and average glucose levels are significant predictors, increasing stroke risk as expected. The linear model gives an adjusted R² of 0.063, meaning it explains about 6.3% of the variation in stroke outcomes, while the logistic model does not provide a traditional R² value. We use logistic regression because stroke is a binary outcome, and logistic regression models probabilities between 0 and 1, giving results that are more realistic and interpretable in terms of the odds of having a stroke.