This RMarkdown file contains the report of the data analysis done for the project on building and deploying a stroke prediction model in R. It contains analysis such as data exploration, summary statistics and building the prediction models. The final report was completed on Mon Dec 2 15:46:09 2024.
Data Description:
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This data set is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.
packages <- c("tidyverse", "lubridate", "ggplot2", "dplyr", "caret", "randomForest", "skimr", "gridExtra", "caTools", "corrplot", "ggcorrplot", "naniar")
for (pkg in packages) {
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg)
library(pkg, character.only = TRUE)
}
}
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: caret
##
## Loading required package: lattice
##
##
## Attaching package: 'caret'
##
##
## The following object is masked from 'package:purrr':
##
## lift
##
##
## Loading required package: randomForest
##
## randomForest 4.7-1.1
##
## Type rfNews() to see new features/changes/bug fixes.
##
##
## Attaching package: 'randomForest'
##
##
## The following object is masked from 'package:dplyr':
##
## combine
##
##
## The following object is masked from 'package:ggplot2':
##
## margin
##
##
## Loading required package: skimr
##
## Loading required package: gridExtra
##
##
## Attaching package: 'gridExtra'
##
##
## The following object is masked from 'package:randomForest':
##
## combine
##
##
## The following object is masked from 'package:dplyr':
##
## combine
##
##
## Loading required package: caTools
##
## Loading required package: corrplot
##
## corrplot 0.92 loaded
##
## Loading required package: ggcorrplot
##
## Loading required package: naniar
##
##
## Attaching package: 'naniar'
##
##
## The following object is masked from 'package:skimr':
##
## n_complete
stroke_data <- read_csv("healthcare-dataset-stroke-data.csv", show_col_types = FALSE)
head(stroke_data)
## # A tibble: 6 × 12
## id gender age hypertension heart_disease ever_married work_type
## <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## # ℹ 5 more variables: Residence_type <chr>, avg_glucose_level <dbl>, bmi <chr>,
## # smoking_status <chr>, stroke <dbl>
summary(stroke_data)
## id gender age hypertension
## Min. : 67 Length:5110 Min. : 0.08 Min. :0.00000
## 1st Qu.:17741 Class :character 1st Qu.:25.00 1st Qu.:0.00000
## Median :36932 Mode :character Median :45.00 Median :0.00000
## Mean :36518 Mean :43.23 Mean :0.09746
## 3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
## Max. :72940 Max. :82.00 Max. :1.00000
## heart_disease ever_married work_type Residence_type
## Min. :0.00000 Length:5110 Length:5110 Length:5110
## 1st Qu.:0.00000 Class :character Class :character Class :character
## Median :0.00000 Mode :character Mode :character Mode :character
## Mean :0.05401
## 3rd Qu.:0.00000
## Max. :1.00000
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.12 Length:5110 Length:5110 Min. :0.00000
## 1st Qu.: 77.25 Class :character Class :character 1st Qu.:0.00000
## Median : 91.89 Mode :character Mode :character Median :0.00000
## Mean :106.15 Mean :0.04873
## 3rd Qu.:114.09 3rd Qu.:0.00000
## Max. :271.74 Max. :1.00000
glimpse(stroke_data)
## Rows: 5,110
## Columns: 12
## $ id <dbl> 9046, 51676, 31112, 60182, 1665, 56669, 53882, 10434…
## $ gender <chr> "Male", "Female", "Male", "Female", "Female", "Male"…
## $ age <dbl> 67, 61, 80, 49, 79, 81, 74, 69, 59, 78, 81, 61, 54, …
## $ hypertension <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1…
## $ heart_disease <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0…
## $ ever_married <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No…
## $ work_type <chr> "Private", "Self-employed", "Private", "Private", "S…
## $ Residence_type <chr> "Urban", "Rural", "Rural", "Urban", "Rural", "Urban"…
## $ avg_glucose_level <dbl> 228.69, 202.21, 105.92, 171.23, 174.12, 186.21, 70.0…
## $ bmi <chr> "36.6", "N/A", "32.5", "34.4", "24", "29", "27.4", "…
## $ smoking_status <chr> "formerly smoked", "never smoked", "never smoked", "…
## $ stroke <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
skim(stroke_data)
| Name | stroke_data |
| Number of rows | 5110 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 0 | 1 | 4 | 6 | 0 | 3 | 0 |
| ever_married | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| work_type | 0 | 1 | 7 | 13 | 0 | 5 | 0 |
| Residence_type | 0 | 1 | 5 | 5 | 0 | 2 | 0 |
| bmi | 0 | 1 | 2 | 4 | 0 | 419 | 0 |
| smoking_status | 0 | 1 | 6 | 15 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 36517.83 | 21161.72 | 67.00 | 17741.25 | 36932.00 | 54682.00 | 72940.00 | ▇▇▇▇▇ |
| age | 0 | 1 | 43.23 | 22.61 | 0.08 | 25.00 | 45.00 | 61.00 | 82.00 | ▅▆▇▇▆ |
| hypertension | 0 | 1 | 0.10 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| heart_disease | 0 | 1 | 0.05 | 0.23 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| avg_glucose_level | 0 | 1 | 106.15 | 45.28 | 55.12 | 77.24 | 91.88 | 114.09 | 271.74 | ▇▃▁▁▁ |
| stroke | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
miss_scan_count(data = stroke_data, search = list("Unknown","N/A","Other"))
## # A tibble: 12 × 2
## Variable n
## <chr> <int>
## 1 id 0
## 2 gender 1
## 3 age 0
## 4 hypertension 0
## 5 heart_disease 0
## 6 ever_married 0
## 7 work_type 0
## 8 Residence_type 0
## 9 avg_glucose_level 0
## 10 bmi 201
## 11 smoking_status 1544
## 12 stroke 0
stroke_data$bmi <- as.numeric(stroke_data$bmi)
## Warning: NAs introduced by coercion
idx <- complete.cases(stroke_data)
bmi_idx <- is.na(stroke_data$bmi)
median_bmi <- median(stroke_data$bmi, na.rm = TRUE)
stroke_data[bmi_idx,]$bmi <- median_bmi
colSums(is.na(stroke_data))
## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 0 0 0
stroke_data %>%
ggplot(aes(x = gender, fill = factor(stroke))) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("maroon", "blue")) +
ggtitle("Gender vs. Stroke")
stroke_data %>%
ggplot(aes(x = factor(heart_disease), fill = factor(stroke))) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("maroon", "blue")) +
ggtitle("Heart disease vs. Stroke")
# Residence Type vs. Stroke
stroke_data %>%
ggplot(aes(x = Residence_type, fill = factor(stroke))) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("maroon", "blue")) +
ggtitle("Residence Type vs. Stroke")
# Smoking Status vs. Stroke
stroke_data %>%
ggplot(aes(x = smoking_status, fill = factor(stroke))) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("maroon", "blue")) +
ggtitle("Smoking Status vs. Stroke")
# Average Glucose Level vs. Stroke
stroke_data %>%
ggplot(aes(x = avg_glucose_level, fill = factor(stroke))) +
geom_density(alpha = 0.7) +
scale_fill_manual(values = c("maroon", "blue")) +
ggtitle("Average Glucose Level vs. Stroke")
# Body Mass Index vs. Stroke (Filtered for valid BMI range)
stroke_data %>%
filter(between(bmi, 0, 60)) %>%
ggplot(aes(x = bmi, fill = factor(stroke))) +
geom_density(alpha = 0.7) +
scale_fill_manual(values = c("maroon", "blue")) +
ggtitle("Body Mass Index vs. Stroke")
##Check duplicates
sum(duplicated(stroke_data))
## [1] 0
colSums(stroke_data == 'N/A')
## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 0 0 0
colSums(stroke_data == '')
## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 0 0 0
stroke_data %>% count(gender)
## # A tibble: 3 × 2
## gender n
## <chr> <int>
## 1 Female 2994
## 2 Male 2115
## 3 Other 1
##Remove ID and filter out ‘Other’ values in Gender
stroke_data <- stroke_data %>%
select(-c(id)) %>%
filter(gender != "Other")
str(stroke_data)
## tibble [5,109 × 11] (S3: tbl_df/tbl/data.frame)
## $ gender : chr [1:5109] "Male" "Female" "Male" "Female" ...
## $ age : num [1:5109] 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : num [1:5109] 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : num [1:5109] 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : chr [1:5109] "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : chr [1:5109] "Private" "Self-employed" "Private" "Private" ...
## $ Residence_type : chr [1:5109] "Urban" "Rural" "Rural" "Urban" ...
## $ avg_glucose_level: num [1:5109] 229 202 106 171 174 ...
## $ bmi : num [1:5109] 36.6 28.1 32.5 34.4 24 29 27.4 22.8 28.1 24.2 ...
## $ smoking_status : chr [1:5109] "formerly smoked" "never smoked" "never smoked" "smokes" ...
## $ stroke : num [1:5109] 1 1 1 1 1 1 1 1 1 1 ...
##Convert non-numeric variables to factors
stroke_data$stroke <- factor(stroke_data$stroke, levels = c(0,1), labels = c("No", "Yes"))
stroke_data$hypertension <- factor(stroke_data$hypertension, levels = c(0,1), labels = c("No", "Yes"))
stroke_data$heart_disease <- factor(stroke_data$heart_disease, levels = c(0,1), labels = c("No", "Yes"))
d1 <- stroke_data %>%
ggplot(aes(x = gender, fill = gender)) +
geom_bar(fill = c("red", "blue")) +
ggtitle("Gender Distribution") +
geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)
d2 <- stroke_data %>%
ggplot(aes(x = hypertension, fill = hypertension)) +
geom_bar(fill = c("red", "blue")) +
ggtitle("Hypertenstion Distribution") +
geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)
d3 <- stroke_data %>%
ggplot(aes(x = heart_disease, fill = heart_disease)) +
geom_bar(fill = c("red", "blue")) +
ggtitle("Heart Disease Distribution") +
geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)
d4 <- stroke_data %>%
ggplot(aes(x = ever_married, fill = ever_married)) +
geom_bar(fill = c("red","blue")) +
ggtitle("Married distribution") +
geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)
d5 <- stroke_data %>%
ggplot(aes(x = work_type, fill = work_type)) +
geom_bar(fill = c("red", "blue","green","orange","aquamarine")) +
ggtitle("Work type distribution") +
geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)
d6 <- stroke_data %>%
ggplot(aes(x = stroke, fill = stroke)) +
geom_bar(fill = c("red", "blue")) +
ggtitle("Stroke distribution") +
geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)
d7 <- stroke_data %>%
ggplot(aes(x = Residence_type, fill = Residence_type)) +
geom_bar(fill = c("red", "blue")) +
ggtitle("Residence distribution") +
geom_text(aes(label=..count..), stat = "Count", vjust = 1.0)
grid.arrange(d1,d2,d3,d4,d5,d6,d7, ncol=2)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
sample.split(stroke_data$stroke,SplitRatio = 0.8)->split_tag
train<-subset(stroke_data,split_tag==TRUE)
test<-subset(stroke_data,split_tag==FALSE)
dim(train)
## [1] 4087 11
set.seed(123)
rf <- randomForest(formula = stroke~.,data = train)
rf
##
## Call:
## randomForest(formula = stroke ~ ., data = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 4.99%
## Confusion matrix:
## No Yes class.error
## No 3883 5 0.001286008
## Yes 199 0 1.000000000
confusionMatrix(predict(rf,test),test$stroke)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 972 50
## Yes 0 0
##
## Accuracy : 0.9511
## 95% CI : (0.936, 0.9635)
## No Information Rate : 0.9511
## P-Value [Acc > NIR] : 0.5375
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 4.219e-12
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.9511
## Neg Pred Value : NaN
## Prevalence : 0.9511
## Detection Rate : 0.9511
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : No
##
Our model achieves an impressive accuracy rate of over 95%, demonstrating the effectiveness of its training process.