Stroke Risk Prediction for Patients: A Multiple Linear
Regression Analysis
Step 1: Load the packages into R
library(ggpubr)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.1
library(tidyverse)
## Warning: package 'lubridate' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("readxl")
theme_set(theme_pubr())
Step 2: Import the data into RStudio
df <- read_excel("Wk 13_Stroke.xlsx")
head(df)
## # A tibble: 6 × 4
## Risk Age Pressure Smoker
## <dbl> <dbl> <dbl> <chr>
## 1 12 57 152 No
## 2 24 67 163 No
## 3 13 58 155 No
## 4 56 86 177 Yes
## 5 28 59 196 No
## 6 51 76 189 Yes
# Please use df <- read_excel(file.choose()) to call your data into R
Step 3: Data summary & visualization
summary(df)
## Risk Age Pressure Smoker
## Min. : 3.00 Min. :56.00 Min. : 98.0 Length:20
## 1st Qu.:15.00 1st Qu.:59.75 1st Qu.:132.5 Class :character
## Median :26.00 Median :68.50 Median :155.0 Mode :character
## Mean :26.95 Mean :69.45 Mean :157.1
## 3rd Qu.:36.25 3rd Qu.:78.00 3rd Qu.:180.0
## Max. :56.00 Max. :86.00 Max. :209.0
We need to pre-process the data because the smoker variable cannot be used for analysis as it is in a text format. Convert the Smoker variable into dummy variables (0 = nonsmoker and 1 = smoker). The "ifelse" function enables us to convert the smoker variable into a dummy variable.
Step 3a: Convert the categorical variable
df$Smoker <- ifelse(df$Smoker == 'Yes', 1, 0)
head(df) #Outputs a snapshot of the data with the new column
## # A tibble: 6 × 4
## Risk Age Pressure Smoker
## <dbl> <dbl> <dbl> <dbl>
## 1 12 57 152 0
## 2 24 67 163 0
## 3 13 58 155 0
## 4 56 86 177 1
## 5 28 59 196 0
## 6 51 76 189 1
Step 3b: Scatterplot to visualize the data
pairs(~Risk + Age + Pressure + Smoker, data = df)

Step 3c: Correlation coefficient
cor(df)
## Risk Age Pressure Smoker
## Risk 1.0000000 0.6502396 0.3881635 0.6804481
## Age 0.6502396 1.0000000 -0.3089517 0.4107675
## Pressure 0.3881635 -0.3089517 1.0000000 0.1666461
## Smoker 0.6804481 0.4107675 0.1666461 1.0000000
Step 4: Build the regression model
model <- lm(Risk ~ Age + Pressure + Smoker, data = df)
summary(model)
##
## Call:
## lm(formula = Risk ~ Age + Pressure + Smoker, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1064 -1.5715 0.4225 3.4855 8.5561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -91.75950 15.22276 -6.028 1.76e-05 ***
## Age 1.07674 0.16596 6.488 7.49e-06 ***
## Pressure 0.25181 0.04523 5.568 4.24e-05 ***
## Smoker 8.73987 3.00082 2.912 0.0102 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.757 on 16 degrees of freedom
## Multiple R-squared: 0.8735, Adjusted R-squared: 0.8498
## F-statistic: 36.82 on 3 and 16 DF, p-value: 2.064e-07
The estimated regression equation: y_hat = -91.76 + 1.08(Age) + 0.25(Blood Pressure) + 8.74(Smoker)
All the independent variables are significant (p < 0.05).
Step 5: Predict the risk of stroke for a new patient
new_data <- data.frame(
Age = 68, Pressure = 175, Smoker = 1, check.names = FALSE)
# Note: The check.names = FALSE function enables us to use the exact names given for each column in our data set.
# Predict on new data
round(predict(model, newdata=new_data),0)
## 1
## 34