Stroke Risk Prediction for Patients: A Multiple Linear Regression Analysis

Step 1: Load the packages into R

library(ggpubr)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.1
library(tidyverse)
## Warning: package 'lubridate' was built under R version 4.3.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("readxl")
theme_set(theme_pubr())

Step 2: Import the data into RStudio

df <- read_excel("Wk 13_Stroke.xlsx")
head(df)
## # A tibble: 6 × 4
##    Risk   Age Pressure Smoker
##   <dbl> <dbl>    <dbl> <chr> 
## 1    12    57      152 No    
## 2    24    67      163 No    
## 3    13    58      155 No    
## 4    56    86      177 Yes   
## 5    28    59      196 No    
## 6    51    76      189 Yes
# Please use df <- read_excel(file.choose()) to call your data into R

Step 3: Data summary & visualization

summary(df)
##       Risk            Age           Pressure        Smoker         
##  Min.   : 3.00   Min.   :56.00   Min.   : 98.0   Length:20         
##  1st Qu.:15.00   1st Qu.:59.75   1st Qu.:132.5   Class :character  
##  Median :26.00   Median :68.50   Median :155.0   Mode  :character  
##  Mean   :26.95   Mean   :69.45   Mean   :157.1                     
##  3rd Qu.:36.25   3rd Qu.:78.00   3rd Qu.:180.0                     
##  Max.   :56.00   Max.   :86.00   Max.   :209.0
We need to pre-process the data because the smoker variable cannot be used for analysis as it is in a text format. Convert the Smoker variable into dummy variables (0 = nonsmoker and 1 = smoker). The "ifelse" function enables us to convert the smoker variable into a dummy variable.

Step 3a: Convert the categorical variable

df$Smoker <- ifelse(df$Smoker == 'Yes', 1, 0)
head(df) #Outputs a snapshot of the data with the new column
## # A tibble: 6 × 4
##    Risk   Age Pressure Smoker
##   <dbl> <dbl>    <dbl>  <dbl>
## 1    12    57      152      0
## 2    24    67      163      0
## 3    13    58      155      0
## 4    56    86      177      1
## 5    28    59      196      0
## 6    51    76      189      1

Step 3b: Scatterplot to visualize the data

pairs(~Risk + Age + Pressure + Smoker, data = df)

Step 3c: Correlation coefficient

cor(df)
##               Risk        Age   Pressure    Smoker
## Risk     1.0000000  0.6502396  0.3881635 0.6804481
## Age      0.6502396  1.0000000 -0.3089517 0.4107675
## Pressure 0.3881635 -0.3089517  1.0000000 0.1666461
## Smoker   0.6804481  0.4107675  0.1666461 1.0000000

Step 4: Build the regression model

model <- lm(Risk ~ Age + Pressure + Smoker, data = df)
summary(model)
## 
## Call:
## lm(formula = Risk ~ Age + Pressure + Smoker, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1064  -1.5715   0.4225   3.4855   8.5561 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -91.75950   15.22276  -6.028 1.76e-05 ***
## Age           1.07674    0.16596   6.488 7.49e-06 ***
## Pressure      0.25181    0.04523   5.568 4.24e-05 ***
## Smoker        8.73987    3.00082   2.912   0.0102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.757 on 16 degrees of freedom
## Multiple R-squared:  0.8735, Adjusted R-squared:  0.8498 
## F-statistic: 36.82 on 3 and 16 DF,  p-value: 2.064e-07
The estimated regression equation: y_hat = -91.76 + 1.08(Age) + 0.25(Blood Pressure) + 8.74(Smoker)
All the independent variables are significant (p < 0.05).

Step 5: Predict the risk of stroke for a new patient

new_data <- data.frame(
  Age = 68, Pressure = 175, Smoker = 1, check.names = FALSE)

# Note: The check.names = FALSE function enables us to use the exact names given for each column in our data set.

# Predict on new data
round(predict(model, newdata=new_data),0)
##  1 
## 34