knitr::opts_chunk$set(echo = FALSE) # False when reporting
library(readr)
library(ggplot2)
library(car)
## Loading required package: carData
library(leaps)
stroke_risk_dataset <- read_csv("~/STAT 840 Projects/FINAL PROJECT/stroke_risk_dataset_v2.csv")
## Rows: 35000 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (18): age, chest_pain, high_blood_pressure, irregular_heartbeat, shortne...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
I. Introduction
A. Study Design This study investigates whether stroke-related symptoms are associated with age amoung patients in the stroke risk data set. Understanding how stroke-related symptoms are associated with age may provide valuable insight into stroke risk and prevention. Exploring this data set and interest of unknown parameters, Yi is the dependent/response variable “Age”. Xij is the stroke-related symptoms are the primary binary predictor variables including chest pain, high blood pressure, shortness of breath, irregular heartbeat, fatigue/weakness, dizziness, swelling, neck/jaw pain, excessive sweating, persistent cough, nausea and vomiting, chest discomfort, cold hands/feet, snoring or sleep apnea, and anxiety. The data set contains (n = 35,000) observations among patients.Each variable is coded as ) for absence of symptom and 1 for presence of the symptom.
Model Assumptions The multiple linear regression model assumes that there is a existing linear relationship between Age and stroke-related symptoms, observations are independent, residuals are scattered randomly assuming normal distribution, homoscedasticity of the residuals across fitted values and predictors are not multicolinear.
Regression Equation Yᵢ = Predicted Age for individual i β₀ = Intercept β₁ = Coefficient for predicted j Xᵢⱼ = Value of predictor j for individual i εᵢ = random error
Yᵢ = β₀ + β₁(Chest Pain)ᵢ + β₂(High Blood Pressure)ᵢ + β₃(Shortness of Breath)ᵢ + β₄(Irregular Heartbeat)ᵢ + β₅(Fatigue/Weakness)ᵢ + β₆(Dizziness)ᵢ + β₇(Swelling/Edema)ᵢ + β₈(Neck/Jaw Pain)ᵢ + β₉(Excessive Sweating)ᵢ + β₁₀(Persistent Cough)ᵢ + β₁₁(Nausea/Vomiting)ᵢ + β₁₂(Chest Discomfort)ᵢ + β₁₃(Cold Hands/Feet)ᵢ + β₁₄(Sleep Apnea)ᵢ + β₁₅(Anxiety Doom)ᵢ + εᵢ
B. Aims The purpose of this study is to ask whether stroke-related symptoms associated with age.This analysis evaluates whether the individuals reporting stroke-related symptoms tend to have different predicted ages compared with those who do not report those symptoms.
A. Preliminary Model A preliminary multiple linear regression model was fit using all stroke-related symptoms considered clinically relevant to age prediction. Diagnostic procedure and predictor screening methods were used to evaluate the assumptions, variables significance, and multicollinearity.
##
## Call:
## lm(formula = age ~ chest_pain + high_blood_pressure + shortness_of_breath +
## irregular_heartbeat + fatigue_weakness + dizziness + swelling_edema +
## neck_jaw_pain + excessive_sweating + persistent_cough + nausea_vomiting +
## chest_discomfort + cold_hands_feet + snoring_sleep_apnea +
## anxiety_doom, data = stroke_risk_dataset)
##
## Coefficients:
## (Intercept) chest_pain high_blood_pressure
## 30.591 3.871 4.533
## shortness_of_breath irregular_heartbeat fatigue_weakness
## 2.870 4.861 2.729
## dizziness swelling_edema neck_jaw_pain
## 3.163 3.578 5.028
## excessive_sweating persistent_cough nausea_vomiting
## 2.936 3.775 2.923
## chest_discomfort cold_hands_feet snoring_sleep_apnea
## 3.833 3.056 4.139
## anxiety_doom
## 2.934
B. Final Model The Final model retained all stroke-related predictors, because each predictor was clinically relevant to the study aim and contributed to evaluating the overall relationship between age and symptoms of strokes. Predictor screening showed no evidence of severe multicollinearity. The model showed statstical relevance. The final fitted model remained the same as the preliminary model.
The overall model was statistically significant, F(15, 34984) = 857.1873,p < 0.001, indicating that at least one stroke-related symptom was significantly associated with age. The model explained approximately 26.88% of the variability in age, with an adjusted of 26.84%. This suggests that the symptom predictors provide meaningful explanatory information about variation in age, although a substantial amount of age variability remains unexplained by symptoms alone.
The hypotheses for the overall model were:
H0: Stroke-related symptoms are not associated with age. Ha: At least one stroke-related symptom is associated with age.
Because the overall F-test was significant, the null hypothesis was rejected. This provides evidence that age is associated with at least one stroke-related symptom predictor.
Summary/Conclusion: The multiple linear regression analysis identified a statistically significant association between stroke-related symptoms and age. All symptom variables demonstrated positive relationships with age, with high blood pressure showing one of the strongest associations in the final model. Although the adjusted R^2 indicated that the model explained only part of the variability in age, the diagnostic results supported the overall adequacy and stability of the model. Because the data set was synthetic and age is influenced by many additional factors not included in the analysis, the results should be interpreted as evidence of association rather than causation. Future research using real-world clinical data and additional predictors may improve predictive performance and model generalization.
##
## Call:
## lm(formula = age ~ chest_pain + high_blood_pressure + shortness_of_breath +
## irregular_heartbeat + fatigue_weakness + dizziness + swelling_edema +
## neck_jaw_pain + excessive_sweating + persistent_cough + nausea_vomiting +
## chest_discomfort + cold_hands_feet + snoring_sleep_apnea +
## anxiety_doom, data = stroke_risk_dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.879 -7.423 -1.535 6.576 48.483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.59130 0.09034 338.62 <2e-16 ***
## chest_pain 3.87145 0.15059 25.71 <2e-16 ***
## high_blood_pressure 4.53283 0.12342 36.73 <2e-16 ***
## shortness_of_breath 2.86985 0.13555 21.17 <2e-16 ***
## irregular_heartbeat 4.86091 0.17869 27.20 <2e-16 ***
## fatigue_weakness 2.72947 0.12354 22.09 <2e-16 ***
## dizziness 3.16339 0.13515 23.41 <2e-16 ***
## swelling_edema 3.57820 0.15069 23.75 <2e-16 ***
## neck_jaw_pain 5.02797 0.17762 28.31 <2e-16 ***
## excessive_sweating 2.93633 0.17857 16.44 <2e-16 ***
## persistent_cough 3.77514 0.17257 21.88 <2e-16 ***
## nausea_vomiting 2.92280 0.17855 16.37 <2e-16 ***
## chest_discomfort 3.83257 0.15148 25.30 <2e-16 ***
## cold_hands_feet 3.05599 0.13425 22.76 <2e-16 ***
## snoring_sleep_apnea 4.13918 0.15039 27.52 <2e-16 ***
## anxiety_doom 2.93380 0.17773 16.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.891 on 34984 degrees of freedom
## Multiple R-squared: 0.2688, Adjusted R-squared: 0.2684
## F-statistic: 857.2 on 15 and 34984 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: age
## Df Sum Sq Mean Sq F value Pr(>F)
## chest_pain 1 132450 132450 1353.88 < 2.2e-16 ***
## high_blood_pressure 1 259708 259708 2654.68 < 2.2e-16 ***
## shortness_of_breath 1 81173 81173 829.74 < 2.2e-16 ***
## irregular_heartbeat 1 114592 114592 1171.33 < 2.2e-16 ***
## fatigue_weakness 1 71927 71927 735.22 < 2.2e-16 ***
## dizziness 1 74553 74553 762.07 < 2.2e-16 ***
## swelling_edema 1 78582 78582 803.25 < 2.2e-16 ***
## neck_jaw_pain 1 95252 95252 973.64 < 2.2e-16 ***
## excessive_sweating 1 32588 32588 333.11 < 2.2e-16 ***
## persistent_cough 1 58611 58611 599.11 < 2.2e-16 ***
## nausea_vomiting 1 30156 30156 308.25 < 2.2e-16 ***
## chest_discomfort 1 71916 71916 735.11 < 2.2e-16 ***
## cold_hands_feet 1 54403 54403 556.10 < 2.2e-16 ***
## snoring_sleep_apnea 1 75316 75316 769.87 < 2.2e-16 ***
## anxiety_doom 1 26656 26656 272.48 < 2.2e-16 ***
## Residuals 34984 3422498 98
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2.5 % 97.5 %
## (Intercept) 30.414224 30.768368
## chest_pain 3.576289 4.166605
## high_blood_pressure 4.290922 4.774736
## shortness_of_breath 2.604181 3.135528
## irregular_heartbeat 4.510660 5.211154
## fatigue_weakness 2.487337 2.971607
## dizziness 2.898480 3.428293
## swelling_edema 3.282854 3.873551
## neck_jaw_pain 4.679825 5.376121
## excessive_sweating 2.586325 3.286330
## persistent_cough 3.436894 4.113393
## nausea_vomiting 2.572828 3.272768
## chest_discomfort 3.535664 4.129478
## cold_hands_feet 2.792851 3.319130
## snoring_sleep_apnea 3.844418 4.433948
## anxiety_doom 2.585441 3.282164
This analysis used a synthetic stroke-related dataset created by Mahatir Ahmed Tusher using medical literature from the American Stroke Association, WHO Global Stroke Reports, Harrison’s Principles of Internal Medicine (20th Edition), and Stroke Prevention, Treatment, and Rehabilitation (Oxford, 2021). Because the dataset was synthetic rather than collected from real patients, the results may not fully represent real-world clinical variability and could introduce simulation-related bias or measurement error[3].
This analysis found evidence of an association between stroke-related symptoms and age. Positive coefficient estimates suggested that individuals reporting symptoms tended to have higher predicted ages than individuals not reporting symptoms, while holding other symptoms constant. High blood pressure demonstrated one of the strongest associations with age based on the coefficient and t-statistic output.
The overall model was statistically significant. The adjusted R^2 indicated that symptoms explained only part of the variation in age. This was expected because age is influenced by many additional factors not included in the model, and binary symptom predictors limit the precision of predicting a continuous outcome such as age.
## # A tibble: 6 × 19
## age gender chest_pain high_blood_pressure irregular_heartbeat
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 22 Male 1 0 0
## 2 52 Male 0 1 1
## 3 63 Female 0 1 0
## 4 41 Male 0 0 1
## 5 53 Male 0 0 0
## 6 28 Female 0 0 0
## # ℹ 14 more variables: shortness_of_breath <dbl>, fatigue_weakness <dbl>,
## # dizziness <dbl>, swelling_edema <dbl>, neck_jaw_pain <dbl>,
## # excessive_sweating <dbl>, persistent_cough <dbl>, nausea_vomiting <dbl>,
## # chest_discomfort <dbl>, cold_hands_feet <dbl>, snoring_sleep_apnea <dbl>,
## # anxiety_doom <dbl>, stroke_risk_percentage <dbl>, at_risk <dbl>
## Total Missing Values: 0
V. Appendix
A. Diagnostics for predictors
The histogram for age demonstrates moderate variability with the data set slightly right skewed. Majority of the patients’ age falls between 25 - 45. The box plots for all independent variables patients who have experienced the stroke-related symptom variables suggest that individuals reporting the symptoms have a higher median for age in comparison to those who reported no symptom. Additionally, several upper age outliers were observed across groups, but a greater concentration of outliers among individuals not reporting symptoms.
The correlation matrix showed generally weak correlations among the predictor variables, suggesting low multicollinearity. Most pair correlations were below 0.25, indicating that the symptom variables provided relatively independent information in the regression model.
## age gender chest_pain high_blood_pressure
## Min. :18.00 Length:35000 Min. :0.0000 Min. :0.0000
## 1st Qu.:30.00 Class :character 1st Qu.:0.0000 1st Qu.:0.0000
## Median :37.00 Mode :character Median :0.0000 Median :0.0000
## Mean :38.63 Mean :0.1459 Mean :0.2519
## 3rd Qu.:46.00 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :86.00 Max. :1.0000 Max. :1.0000
## irregular_heartbeat shortness_of_breath fatigue_weakness dizziness
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.09846 Mean :0.1901 Mean :0.2445 Mean :0.1907
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## swelling_edema neck_jaw_pain excessive_sweating persistent_cough
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000
## Median :0.0000 Median :0.00000 Median :0.00000 Median :0.000
## Mean :0.1459 Mean :0.09951 Mean :0.09751 Mean :0.106
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.000
## nausea_vomiting chest_discomfort cold_hands_feet snoring_sleep_apnea
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.09754 Mean :0.1438 Mean :0.1946 Mean :0.1471
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## anxiety_doom stroke_risk_percentage at_risk
## Min. :0.00000 Min. : 1.50 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.: 19.90 1st Qu.:0.0000
## Median :0.00000 Median : 38.70 Median :0.0000
## Mean :0.09854 Mean : 44.48 Mean :0.3682
## 3rd Qu.:0.00000 3rd Qu.: 64.50 3rd Qu.:1.0000
## Max. :1.00000 Max. :100.00 Max. :1.0000
## age chest_pain high_blood_pressure shortness_of_breath
## age 1.00 0.17 0.24 0.15
## chest_pain 0.17 1.00 0.05 0.03
## high_blood_pressure 0.24 0.05 1.00 0.06
## shortness_of_breath 0.15 0.03 0.06 1.00
## irregular_heartbeat 0.18 0.04 0.07 0.03
## fatigue_weakness 0.15 0.02 0.04 0.03
## dizziness 0.15 0.04 0.05 0.03
## swelling_edema 0.16 0.03 0.06 0.03
## neck_jaw_pain 0.18 0.04 0.06 0.04
## excessive_sweating 0.11 0.02 0.03 0.03
## persistent_cough 0.15 0.03 0.05 0.03
## nausea_vomiting 0.11 0.01 0.02 0.02
## chest_discomfort 0.17 0.03 0.05 0.03
## cold_hands_feet 0.15 0.03 0.05 0.04
## snoring_sleep_apnea 0.18 0.04 0.06 0.03
## anxiety_doom 0.11 0.03 0.03 0.02
## irregular_heartbeat fatigue_weakness dizziness
## age 0.18 0.15 0.15
## chest_pain 0.04 0.02 0.04
## high_blood_pressure 0.07 0.04 0.05
## shortness_of_breath 0.03 0.03 0.03
## irregular_heartbeat 1.00 0.04 0.03
## fatigue_weakness 0.04 1.00 0.02
## dizziness 0.03 0.02 1.00
## swelling_edema 0.04 0.03 0.03
## neck_jaw_pain 0.03 0.03 0.03
## excessive_sweating 0.02 0.02 0.01
## persistent_cough 0.03 0.04 0.03
## nausea_vomiting 0.02 0.03 0.01
## chest_discomfort 0.03 0.03 0.02
## cold_hands_feet 0.03 0.03 0.02
## snoring_sleep_apnea 0.05 0.02 0.04
## anxiety_doom 0.03 0.02 0.01
## swelling_edema neck_jaw_pain excessive_sweating
## age 0.16 0.18 0.11
## chest_pain 0.03 0.04 0.02
## high_blood_pressure 0.06 0.06 0.03
## shortness_of_breath 0.03 0.04 0.03
## irregular_heartbeat 0.04 0.03 0.02
## fatigue_weakness 0.03 0.03 0.02
## dizziness 0.03 0.03 0.01
## swelling_edema 1.00 0.04 0.02
## neck_jaw_pain 0.04 1.00 0.02
## excessive_sweating 0.02 0.02 1.00
## persistent_cough 0.02 0.02 0.02
## nausea_vomiting 0.02 0.03 0.01
## chest_discomfort 0.03 0.03 0.02
## cold_hands_feet 0.03 0.03 0.03
## snoring_sleep_apnea 0.06 0.04 0.02
## anxiety_doom 0.02 0.01 0.01
## persistent_cough nausea_vomiting chest_discomfort
## age 0.15 0.11 0.17
## chest_pain 0.03 0.01 0.03
## high_blood_pressure 0.05 0.02 0.05
## shortness_of_breath 0.03 0.02 0.03
## irregular_heartbeat 0.03 0.02 0.03
## fatigue_weakness 0.04 0.03 0.03
## dizziness 0.03 0.01 0.02
## swelling_edema 0.02 0.02 0.03
## neck_jaw_pain 0.02 0.03 0.03
## excessive_sweating 0.02 0.01 0.02
## persistent_cough 1.00 0.02 0.03
## nausea_vomiting 0.02 1.00 0.02
## chest_discomfort 0.03 0.02 1.00
## cold_hands_feet 0.04 0.02 0.04
## snoring_sleep_apnea 0.03 0.02 0.04
## anxiety_doom 0.02 0.01 0.02
## cold_hands_feet snoring_sleep_apnea anxiety_doom
## age 0.15 0.18 0.11
## chest_pain 0.03 0.04 0.03
## high_blood_pressure 0.05 0.06 0.03
## shortness_of_breath 0.04 0.03 0.02
## irregular_heartbeat 0.03 0.05 0.03
## fatigue_weakness 0.03 0.02 0.02
## dizziness 0.02 0.04 0.01
## swelling_edema 0.03 0.06 0.02
## neck_jaw_pain 0.03 0.04 0.01
## excessive_sweating 0.03 0.02 0.01
## persistent_cough 0.04 0.03 0.02
## nausea_vomiting 0.02 0.02 0.01
## chest_discomfort 0.04 0.04 0.02
## cold_hands_feet 1.00 0.03 0.02
## snoring_sleep_apnea 0.03 1.00 0.02
## anxiety_doom 0.02 0.02 1.00
B. Screening for Predictors All stroke-related symptoms were retained in the regression model due to their relevance to the primary research question.
Variance Inflation Factor (VIF) VIF was used to assess multicollinearity. This is important because highly correlated predictors can cause instability in the coefficients, inflate the standard errors, and make it harder to isolate the effect of each predictor [2]. An acceptable VIF score typically falls within the range of 1 to 5, with values around 5 raising concern and values between 6 and 10 indicating a serious multicollinearity problem. All predictors fell between 1 and 2, indicating no concern for multicollinearity. This was expected due to the low correlations reviewed earlier.
Adjusted R Squared When all predictors were included in the model, approximately 26.84% of the variability in age was explained by stroke-related symptom variables. The R-squared value was 26.87%. This closeness in percentages suggests that the predictors contributed meaningful explanatory information to the model and explained variation in age rather than serving as unnecessary variables. This helps reflect the main question of how age tends to differ across stroke-related symptoms[2].
Coefficient Significance The predictor variables all displayed large positive t-values, providing evidence against H0. The large t-values and very small standard errors support strong evidence of a meaningful relationship between age and stroke-related symptoms, particularly showing a noticeably greater gap for the predictor variable high blood pressure. This suggests the relationship is unlikely to be due to chance.
F-Value The F-statistic evaluates the null hypothesis that all regression coefficients associated with the predictor variables are simultaneously equal to zero [2], compared with the alternative hypothesis that at least one predictor contributes meaningful explanatory information regarding the response variable [2]. With F(15, 34,984) = 857.1873, p < .001, the regression model explains more variation in age than a model containing only the intercept. Therefore, at least one stroke-related symptom predictor is significantly associated with age.
## chest_pain high_blood_pressure shortness_of_breath irregular_heartbeat
## 1.010731 1.026998 1.011818 1.014029
## fatigue_weakness dizziness swelling_edema neck_jaw_pain
## 1.008667 1.008530 1.012529 1.011481
## excessive_sweating persistent_cough nausea_vomiting chest_discomfort
## 1.003967 1.009443 1.004041 1.010581
## cold_hands_feet snoring_sleep_apnea anxiety_doom
## 1.010524 1.015069 1.003918
## [1] 0.2684432
## [1] 0.2687568
## (Intercept) chest_pain high_blood_pressure shortness_of_breath
## 338.61879 25.70880 36.72685 21.17264
## irregular_heartbeat fatigue_weakness dizziness swelling_edema
## 27.20233 22.09451 23.40575 23.74615
## neck_jaw_pain excessive_sweating persistent_cough nausea_vomiting
## 28.30689 16.44357 21.87559 16.36934
## chest_discomfort cold_hands_feet snoring_sleep_apnea anxiety_doom
## 25.30072 22.76297 27.52338 16.50684
## value numdf dendf
## 857.1873 15.0000 34984.0000
C. Model Validation Model validation was performed using a 70/30 train-test split. The model was trained on 70% of the observations and evaluated on the remaining 30%. Mean squared prediction error (MSPR) was used to assess predictive performance in the testing sample[2]. The square root of MSPR gives the prediction error in years.
The model had an MSPR of 98.18 and an approximate root mean squared prediction error of 9.9 years. Given that age ranges from (18 to 86) years and the predictors are binary symptoms, this suggests moderate predictive performance for unseen observations.
## [1] 98.17695
## [1] 9.908428
D. Residual Diagnostics Residual Vs Fitted The residuals vs fitted plot was examined to assess model form and constant variance. The residuals were generally centered around zero, which suggests that the model was not systematically over predicting or under predicting age overall. However, the spread of residuals changed across fitted values, suggesting heteroscedasticity. This indicates that the relationship between age and symptom predictors may not be equally stable across all fitted age values.
QQ Plot The Q-Q plot was used to assess the normality assumption[2]. The points followed the reference line reasonably well in the center of the distribution, but deviations were present in the tails. This suggests approximate normality for most observations, with some evidence of extreme residuals or heavy tails.
Cook’s Distance / Residual vs Leverage The residuals-versus-leverage plot and Cook’s distance were examined to identify observations with excessive influence on the fitted model[2]. Most observations had low leverage and clustered near zero residuals. Although a few observations had relatively higher influence values, the Cook’s distance values remained small, suggesting that no single observation excessively influenced the estimated relationship between age and symptom predictors.
References:
1.WHO | Disease burden and mortality estimatesAccessed May 21, 2020 at:http://www.who.int/healthinfo/global_burden_disease/estimates/en
2.Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied linear statistical models (5th ed.). McGraw-Hill Irwin.