About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on building and deploying a stroke prediction model in R. It contains analysis such as data exploration, summary statistics and building the prediction models. The final report was completed on Mon Dec 2 15:46:09 2024.

Data Description:

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.

This data set is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

Task One: Import data and data preprocessing

packages <- c("tidyverse", "lubridate", "ggplot2", "dplyr", "caret", "randomForest", "skimr", "gridExtra", "caTools", "corrplot", "ggcorrplot", "naniar")

for (pkg in packages) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg)
    library(pkg, character.only = TRUE)
  }
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: caret
## 
## Loading required package: lattice
## 
## 
## Attaching package: 'caret'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## 
## Loading required package: randomForest
## 
## randomForest 4.7-1.1
## 
## Type rfNews() to see new features/changes/bug fixes.
## 
## 
## Attaching package: 'randomForest'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## 
## Loading required package: skimr
## 
## Loading required package: gridExtra
## 
## 
## Attaching package: 'gridExtra'
## 
## 
## The following object is masked from 'package:randomForest':
## 
##     combine
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## 
## Loading required package: caTools
## 
## Loading required package: corrplot
## 
## corrplot 0.92 loaded
## 
## Loading required package: ggcorrplot
## 
## Loading required package: naniar
## 
## 
## Attaching package: 'naniar'
## 
## 
## The following object is masked from 'package:skimr':
## 
##     n_complete
stroke_data <- read_csv("healthcare-dataset-stroke-data.csv", show_col_types = FALSE)
head(stroke_data)
## # A tibble: 6 × 12
##      id gender   age hypertension heart_disease ever_married work_type    
##   <dbl> <chr>  <dbl>        <dbl>         <dbl> <chr>        <chr>        
## 1  9046 Male      67            0             1 Yes          Private      
## 2 51676 Female    61            0             0 Yes          Self-employed
## 3 31112 Male      80            0             1 Yes          Private      
## 4 60182 Female    49            0             0 Yes          Private      
## 5  1665 Female    79            1             0 Yes          Self-employed
## 6 56669 Male      81            0             0 Yes          Private      
## # ℹ 5 more variables: Residence_type <chr>, avg_glucose_level <dbl>, bmi <chr>,
## #   smoking_status <chr>, stroke <dbl>

Describe and explore the data

summary(stroke_data)
##        id           gender               age         hypertension    
##  Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
##  1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Mode  :character   Median :45.00   Median :0.00000  
##  Mean   :36518                      Mean   :43.23   Mean   :0.09746  
##  3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                      Max.   :82.00   Max.   :1.00000  
##  heart_disease     ever_married        work_type         Residence_type    
##  Min.   :0.00000   Length:5110        Length:5110        Length:5110       
##  1st Qu.:0.00000   Class :character   Class :character   Class :character  
##  Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.05401                                                           
##  3rd Qu.:0.00000                                                           
##  Max.   :1.00000                                                           
##  avg_glucose_level     bmi            smoking_status         stroke       
##  Min.   : 55.12    Length:5110        Length:5110        Min.   :0.00000  
##  1st Qu.: 77.25    Class :character   Class :character   1st Qu.:0.00000  
##  Median : 91.89    Mode  :character   Mode  :character   Median :0.00000  
##  Mean   :106.15                                          Mean   :0.04873  
##  3rd Qu.:114.09                                          3rd Qu.:0.00000  
##  Max.   :271.74                                          Max.   :1.00000
glimpse(stroke_data)
## Rows: 5,110
## Columns: 12
## $ id                <dbl> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434…
## $ gender            <chr> "Male", "Female", "Male", "Female", "Female", "Male"…
## $ age               <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, …
## $ hypertension      <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
## $ heart_disease     <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0…
## $ ever_married      <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
## $ work_type         <chr> "Private", "Self-employed", "Private", "Private", "S…
## $ Residence_type    <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"…
## $ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0…
## $ bmi               <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "…
## $ smoking_status    <chr> "formerly smoked", "never smoked", "never smoked", "…
## $ stroke            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
skim(stroke_data)
Data summary
Name stroke_data
Number of rows 5110
Number of columns 12
_______________________
Column type frequency:
character 6
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 0 1 4 6 0 3 0
ever_married 0 1 2 3 0 2 0
work_type 0 1 7 13 0 5 0
Residence_type 0 1 5 5 0 2 0
bmi 0 1 2 4 0 419 0
smoking_status 0 1 6 15 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1 36517.83 21161.72 67.00 17741.25 36932.00 54682.00 72940.00 ▇▇▇▇▇
age 0 1 43.23 22.61 0.08 25.00 45.00 61.00 82.00 ▅▆▇▇▆
hypertension 0 1 0.10 0.30 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
heart_disease 0 1 0.05 0.23 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
avg_glucose_level 0 1 106.15 45.28 55.12 77.24 91.88 114.09 271.74 ▇▃▁▁▁
stroke 0 1 0.05 0.22 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
miss_scan_count(data = stroke_data, search = list("Unknown","N/A","Other"))
## # A tibble: 12 × 2
##    Variable              n
##    <chr>             <int>
##  1 id                    0
##  2 gender                1
##  3 age                   0
##  4 hypertension          0
##  5 heart_disease         0
##  6 ever_married          0
##  7 work_type             0
##  8 Residence_type        0
##  9 avg_glucose_level     0
## 10 bmi                 201
## 11 smoking_status     1544
## 12 stroke                0
stroke_data$bmi <- as.numeric(stroke_data$bmi)
## Warning: NAs introduced by coercion
idx <- complete.cases(stroke_data)
bmi_idx <- is.na(stroke_data$bmi)
median_bmi <- median(stroke_data$bmi, na.rm = TRUE)

stroke_data[bmi_idx,]$bmi <- median_bmi
colSums(is.na(stroke_data))
##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0                 0                 0                 0
stroke_data %>%
  ggplot(aes(x = gender, fill = factor(stroke))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("maroon", "blue")) +
  ggtitle("Gender vs. Stroke")

stroke_data %>%
  ggplot(aes(x = factor(heart_disease), fill = factor(stroke))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("maroon", "blue")) +
  ggtitle("Heart disease vs. Stroke")

# Residence Type vs. Stroke
stroke_data %>%
  ggplot(aes(x = Residence_type, fill = factor(stroke))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("maroon", "blue")) +
  ggtitle("Residence Type vs. Stroke")

# Smoking Status vs. Stroke
stroke_data %>%
  ggplot(aes(x = smoking_status, fill = factor(stroke))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("maroon", "blue")) +
  ggtitle("Smoking Status vs. Stroke")

# Average Glucose Level vs. Stroke
stroke_data %>%
  ggplot(aes(x = avg_glucose_level, fill = factor(stroke))) +
  geom_density(alpha = 0.7) +
  scale_fill_manual(values = c("maroon", "blue")) +
  ggtitle("Average Glucose Level vs. Stroke")

# Body Mass Index vs. Stroke (Filtered for valid BMI range)
stroke_data %>%
  filter(between(bmi, 0, 60)) %>%
  ggplot(aes(x = bmi, fill = factor(stroke))) +
  geom_density(alpha = 0.7) +
  scale_fill_manual(values = c("maroon", "blue")) +
  ggtitle("Body Mass Index vs. Stroke")

##Check duplicates

sum(duplicated(stroke_data))
## [1] 0
colSums(stroke_data == 'N/A')
##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0                 0                 0                 0
colSums(stroke_data == '')
##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0                 0                 0                 0
stroke_data %>% count(gender)
## # A tibble: 3 × 2
##   gender     n
##   <chr>  <int>
## 1 Female  2994
## 2 Male    2115
## 3 Other      1

##Remove ID and filter out ‘Other’ values in Gender

stroke_data <- stroke_data %>% 
  select(-c(id)) %>% 
  filter(gender != "Other")
str(stroke_data)
## tibble [5,109 × 11] (S3: tbl_df/tbl/data.frame)
##  $ gender           : chr [1:5109] "Male" "Female" "Male" "Female" ...
##  $ age              : num [1:5109] 67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : num [1:5109] 0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : num [1:5109] 1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : chr [1:5109] "Yes" "Yes" "Yes" "Yes" ...
##  $ work_type        : chr [1:5109] "Private" "Self-employed" "Private" "Private" ...
##  $ Residence_type   : chr [1:5109] "Urban" "Rural" "Rural" "Urban" ...
##  $ avg_glucose_level: num [1:5109] 229 202 106 171 174 ...
##  $ bmi              : num [1:5109] 36.6 28.1 32.5 34.4 24 29 27.4 22.8 28.1 24.2 ...
##  $ smoking_status   : chr [1:5109] "formerly smoked" "never smoked" "never smoked" "smokes" ...
##  $ stroke           : num [1:5109] 1 1 1 1 1 1 1 1 1 1 ...

##Convert non-numeric variables to factors

stroke_data$stroke <- factor(stroke_data$stroke, levels = c(0,1), labels = c("No", "Yes"))
stroke_data$hypertension <- factor(stroke_data$hypertension, levels = c(0,1), labels = c("No", "Yes"))
stroke_data$heart_disease <- factor(stroke_data$heart_disease, levels = c(0,1), labels = c("No", "Yes"))

Task Two: Build prediction models

d1 <- stroke_data %>%
  ggplot(aes(x = gender, fill = gender)) +
  geom_bar(fill = c("red", "blue")) +
  ggtitle("Gender Distribution") +
  geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)
  
d2 <- stroke_data %>%
  ggplot(aes(x = hypertension, fill = hypertension)) +
  geom_bar(fill = c("red", "blue")) +
  ggtitle("Hypertenstion Distribution") +
  geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)
  

d3 <- stroke_data %>%
  ggplot(aes(x = heart_disease, fill = heart_disease)) +
  geom_bar(fill = c("red", "blue")) +
  ggtitle("Heart Disease Distribution") +
  geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)

d4 <- stroke_data %>%
  ggplot(aes(x = ever_married, fill = ever_married)) +
  geom_bar(fill = c("red","blue")) +
  ggtitle("Married distribution") +
  geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)

d5 <- stroke_data %>%
  ggplot(aes(x = work_type, fill = work_type)) +
  geom_bar(fill = c("red", "blue","green","orange","aquamarine")) +
  ggtitle("Work type distribution") +
  geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)

d6 <- stroke_data %>%
  ggplot(aes(x = stroke, fill = stroke)) +
  geom_bar(fill = c("red", "blue")) +
  ggtitle("Stroke distribution") +
  geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)

d7 <- stroke_data %>%
  ggplot(aes(x = Residence_type, fill = Residence_type)) +
  geom_bar(fill = c("red", "blue")) +
  ggtitle("Residence distribution") +
  geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)


grid.arrange(d1,d2,d3,d4,d5,d6,d7, ncol=2)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Task Three: Evaluate and select prediction models

sample.split(stroke_data$stroke,SplitRatio = 0.8)->split_tag
train<-subset(stroke_data,split_tag==TRUE)
test<-subset(stroke_data,split_tag==FALSE)
dim(train)
## [1] 4087   11

Task Four: Deploy the prediction model

set.seed(123)
rf <- randomForest(formula = stroke~.,data = train)
rf
## 
## Call:
##  randomForest(formula = stroke ~ ., data = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 4.99%
## Confusion matrix:
##       No Yes class.error
## No  3883   5 0.001286008
## Yes  199   0 1.000000000

Task Five: Findings and Conclusions

confusionMatrix(predict(rf,test),test$stroke)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  972  50
##        Yes   0   0
##                                          
##                Accuracy : 0.9511         
##                  95% CI : (0.936, 0.9635)
##     No Information Rate : 0.9511         
##     P-Value [Acc > NIR] : 0.5375         
##                                          
##                   Kappa : 0              
##                                          
##  Mcnemar's Test P-Value : 4.219e-12      
##                                          
##             Sensitivity : 1.0000         
##             Specificity : 0.0000         
##          Pos Pred Value : 0.9511         
##          Neg Pred Value :    NaN         
##              Prevalence : 0.9511         
##          Detection Rate : 0.9511         
##    Detection Prevalence : 1.0000         
##       Balanced Accuracy : 0.5000         
##                                          
##        'Positive' Class : No             
## 

Our model achieves an impressive accuracy rate of over 95%, demonstrating the effectiveness of its training process.