Group Information

Group: 13
Date: 2025-01-11

Name Reg. No.
TAN CHEE YONG 23105890
ANG ZHI YANG 24057531
WU XIAOYUE 23117091
ZHANG YUMAN 23101952
LIU HUI 23098984
PEE JIUN BIN 24055123

Introduction

Stroke prediction is critical in healthcare, as it can significantly impact patient outcomes. This study explores the use of machine learning approaches to predict strokes among both young and elderly patients. By analyzing different factors, we aim to develop a robust model to aid in early detection and prevention. According to the World Health Organization (WHO), stroke is the second leading cause of death globally, accounting for about 11% of all deaths. Thus, the stroke prediction technique with high accuracy would be essential.Machine Learning technique would be the most common technique used to the early prediction in health care area, thus, we aim to use and compare different machine learning method to predict stroke occurrence.

Objectives

  1. Implement different machine learning techniques in stroke prediction among young and old patients.
  2. Evaluate the performance of each machine learning technique while predicting the stroke among young and old patients.
  3. Compare each machine learning technique on stroke predicting among young and old patients using performance metrics.

CRISP-DM

Data Collection: Stroke Prediction Dataset source was obtained from Kaggle. https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

Data Exploration:

  • Summary Statistics
  • Correlation Test
  • Distribution (Histogram)

Data Preparation

  • Min-Max Normalization
  • Mode Imputation
  • Mean Imputation
  • Encode categorical variable
  • Feature Selection based on Correlation Test
  • Applying k-fold cross validation, k=10
  • Apply SMOTE to balance the class distribution of training set

Modelling: Several classification machine learning models are trained to predict stroke, including:

  • Naive Bayes
  • Decision Tree
  • k-Nearest Neighbors (kNN)

Model Evaluation: The evaluation results (accuracy, precision, sensitivity, specificity, F1-score) for each model are summarized.

  • Confusion Matrix
  • Performance Metrices
  • Accuracy
  • Precision
  • Recall
  • F1-score

Data Ingestion

# Show dimensions
cat("Number of cases:", nrow(df), "\n")
## Number of cases: 5110
cat("Number of variables:", ncol(df), "\n")
## Number of variables: 12
# cat("--------------------------------------------------------------\n")

# Show datatypes
str(df)
## 'data.frame':    5110 obs. of  12 variables:
##  $ id               : int  67 77 84 91 99 121 129 132 156 163 ...
##  $ gender           : chr  "Female" "Female" "Male" "Female" ...
##  $ age              : num  17 13 55 42 31 38 24 80 33 20 ...
##  $ hypertension     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ heart_disease    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ever_married     : chr  "No" "No" "Yes" "No" ...
##  $ work_type        : chr  "Private" "children" "Private" "Private" ...
##  $ Residence_type   : chr  "Urban" "Rural" "Urban" "Urban" ...
##  $ avg_glucose_level: num  93 85.8 89.2 98.5 108.9 ...
##  $ bmi              : chr  "N/A" "18.6" "31.5" "18.5" ...
##  $ smoking_status   : chr  "formerly smoked" "Unknown" "never smoked" "never smoked" ...
##  $ stroke           : int  0 0 0 0 0 0 0 0 0 0 ...
# cat("--------------------------------------------------------------\n")

Data Cleaning

# Convert "N/A" to NA in BMI column
df$bmi[df$bmi == "N/A"] <- NA

# Check missing values
missing_values <- colSums(is.na(df))
print("Missing values before imputation:")
## [1] "Missing values before imputation:"
print(missing_values)
##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0               201                 0                 0
# Convert BMI to numeric and handle missing values
df$bmi <- as.numeric(df$bmi)
df$bmi <- ifelse(is.na(df$bmi), mean(df$bmi, na.rm = TRUE), df$bmi)

# Check missing values after imputation
missing_values <- colSums(is.na(df))
print("\nMissing values after imputation:")
## [1] "\nMissing values after imputation:"
print(missing_values)
##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0                 0                 0                 0

Handling Categorical Variables

# Handle unknown smoking status
smoking_status_freq <- table(df$smoking_status)
print("Smoking Status Frequency:")
## [1] "Smoking Status Frequency:"
print(smoking_status_freq)
## 
## formerly smoked    never smoked          smokes         Unknown 
##             885            1892             789            1544
# cat("--------------------------------------------------------------\n")

# Convert unknown to never smoked
df$smoking_status[df$smoking_status == "Unknown"] <- "never smoked"

# Check frequency after imputation
smoking_status_freq <- table(df$smoking_status)
print("Smoking Status Frequency:")
## [1] "Smoking Status Frequency:"
print(smoking_status_freq)
## 
## formerly smoked    never smoked          smokes 
##             885            3436             789

Data Split by Age

# Remove ID and split dataset
df$id <- NULL

# Split into young and old datasets
df_young <- df[df$age < 50,]
df_old <- df[df$age >= 50,]

# Display info of split datasets
print(paste("Young patients dataset:", dim(df_young)))
## [1] "Young patients dataset: 2900" "Young patients dataset: 11"
print(paste("Old patients dataset:", dim(df_old)))
## [1] "Old patients dataset: 2210" "Old patients dataset: 11"

Exploratory Data Analysis

Young Patients Analysis

# Select relevant features for young patients
summary(df_young)
##     gender               age         hypertension     heart_disease     
##  Length:2900        Min.   : 0.08   Min.   :0.00000   Min.   :0.000000  
##  Class :character   1st Qu.:15.00   1st Qu.:0.00000   1st Qu.:0.000000  
##  Mode  :character   Median :28.00   Median :0.00000   Median :0.000000  
##                     Mean   :26.88   Mean   :0.02966   Mean   :0.005172  
##                     3rd Qu.:40.00   3rd Qu.:0.00000   3rd Qu.:0.000000  
##                     Max.   :49.00   Max.   :1.00000   Max.   :1.000000  
##  ever_married        work_type         Residence_type     avg_glucose_level
##  Length:2900        Length:2900        Length:2900        Min.   : 55.12   
##  Class :character   Class :character   Class :character   1st Qu.: 75.81   
##  Mode  :character   Mode  :character   Mode  :character   Median : 89.28   
##                                                           Mean   : 96.78   
##                                                           3rd Qu.:108.12   
##                                                           Max.   :267.76   
##       bmi        smoking_status         stroke        
##  Min.   :10.30   Length:2900        Min.   :0.000000  
##  1st Qu.:21.70   Class :character   1st Qu.:0.000000  
##  Median :26.50   Mode  :character   Median :0.000000  
##  Mean   :27.62                      Mean   :0.006897  
##  3rd Qu.:31.90                      3rd Qu.:0.000000  
##  Max.   :97.60                      Max.   :1.000000

The average age is approximately 26.9 years, indicating that the majority of the young patient group is between 20-30 years old, including some data from children and adolescents. Most young patients have glucose levels within the normal range, though there are some individuals with high blood sugar. The average BMI is close to the overweight range, and there are some patients with extreme values.

Chi-Square

# Chi-square tests for categorical variables
categorical_cols <- c('gender', 'hypertension', 'heart_disease', 'ever_married',
                      'work_type', 'Residence_type', 'smoking_status')

for(col in categorical_cols) {
  contingency_table <- table(df_young[[col]], df_young$stroke)
  chi_test <- chisq.test(contingency_table)
  
  cat(sprintf("Correlation test between %s and stroke:\n", col))
  cat(sprintf("Chi-squared statistic: %f\n", chi_test$statistic))
  cat(sprintf("P-value: %f\n", chi_test$p.value))
  cat("----------------------------------------\n")
  
  alpha <- 0.05
  if(chi_test$p.value < alpha) {
    cat(sprintf("There is a statistically significant association between %s and stroke.\n", col))
  } else {
    cat(sprintf("There is no statistically significant association between %s and stroke.\n", col))
  }
  cat("========================================\n")
}
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
## Correlation test between gender and stroke:
## Chi-squared statistic: 0.951456
## P-value: 0.621432
## ----------------------------------------
## There is no statistically significant association between gender and stroke.
## ========================================
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
## Correlation test between hypertension and stroke:
## Chi-squared statistic: 0.000000
## P-value: 1.000000
## ----------------------------------------
## There is no statistically significant association between hypertension and stroke.
## ========================================
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
## Correlation test between heart_disease and stroke:
## Chi-squared statistic: 0.000000
## P-value: 1.000000
## ----------------------------------------
## There is no statistically significant association between heart_disease and stroke.
## ========================================
## Correlation test between ever_married and stroke:
## Chi-squared statistic: 5.992135
## P-value: 0.014370
## ----------------------------------------
## There is a statistically significant association between ever_married and stroke.
## ========================================
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
## Correlation test between work_type and stroke:
## Chi-squared statistic: 4.406679
## P-value: 0.353757
## ----------------------------------------
## There is no statistically significant association between work_type and stroke.
## ========================================
## Correlation test between Residence_type and stroke:
## Chi-squared statistic: 0.050347
## P-value: 0.822460
## ----------------------------------------
## There is no statistically significant association between Residence_type and stroke.
## ========================================
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
## Correlation test between smoking_status and stroke:
## Chi-squared statistic: 4.106338
## P-value: 0.128328
## ----------------------------------------
## There is no statistically significant association between smoking_status and stroke.
## ========================================

The chi-square test shows that ever-married have statistically significant with stroke among young patients.

Correlation

# Point-biserial correlation for numeric variables
numeric_cols <- c('age', 'avg_glucose_level', 'bmi')

for(col in numeric_cols) {
  correlation <- cor(df_young[[col]], as.numeric(df_young$stroke))
  cor_test <- cor.test(df_young[[col]], as.numeric(df_young$stroke))
  
  cat(sprintf("Point-biserial correlation between %s and stroke:\n", col))
  cat(sprintf("Correlation coefficient: %f\n", correlation))
  cat(sprintf("P-value: %f\n", cor_test$p.value))
  
  alpha <- 0.05
  if(cor_test$p.value < alpha) {
    cat(sprintf("There is a statistically significant association between %s and stroke.\n", col))
  } else {
    cat(sprintf("There is no statistically significant association between %s and stroke.\n", col))
  }
  cat("----------------------------------------\n")
}
## Point-biserial correlation between age and stroke:
## Correlation coefficient: 0.074677
## P-value: 0.000057
## There is a statistically significant association between age and stroke.
## ----------------------------------------
## Point-biserial correlation between avg_glucose_level and stroke:
## Correlation coefficient: -0.004996
## P-value: 0.787964
## There is no statistically significant association between avg_glucose_level and stroke.
## ----------------------------------------
## Point-biserial correlation between bmi and stroke:
## Correlation coefficient: 0.045620
## P-value: 0.014014
## There is a statistically significant association between bmi and stroke.
## ----------------------------------------

According to Point-Biserial Correlation, age and bmi have statistically significant with stroke among young patients.

Visual

Pie Chart

# Visualizations for young patients
# Stroke distribution pie chart
stroke_counts <- table(df_young$stroke)
pie_labels <- c("No Stroke", "Has Stroke")
colors <- c("lightblue", "salmon")

pie(stroke_counts, labels = paste(pie_labels, "\n", round(prop.table(stroke_counts)*100, 1), "%"), 
    col = colors, main = "Distribution of Stroke")

# Adding a legend
legend("topright", 
       legend = pie_labels, 
       fill = colors, 
       title = "Stroke Status")

From the pie chart above, in the ‘young’ data set, most of the observations are not having stroke. There are 99.3% of observations not having stroke and only 0.7% of observations are having stroke

Histogram

# Create age bins
df_young <- df_young %>%
  mutate(age_bin = cut(age, breaks = 5))

# Create a table of age bins vs stroke occurrences
age_stroke_table <- table(df_young$age_bin, df_young$stroke)

# Plot histogram
hist_plot <- ggplot(df_young, aes(x = age)) +
  geom_histogram(bins = 5, fill = 'skyblue', color = 'black', alpha = 0.7) +
  labs(x = 'Age Bins', y = 'Number of People', title = 'Age Distribution (Histogram)') +
  theme_minimal()
# Prepare data for table
age_stroke_data <- as.data.frame.matrix(age_stroke_table)
age_stroke_data$Age_Bin <- rownames(age_stroke_data)
age_stroke_data <- age_stroke_data %>%
  rename(`No Stroke` = `0`, Stroke = `1`)

# Reorder columns for better presentation
age_stroke_data <- age_stroke_data %>%
  select(Age_Bin, `No Stroke`, Stroke)

# Create table plot
table_plot_hist <- tableGrob(age_stroke_data, rows = NULL)

# Combine histogram and table with adjusted heights
combined_plot_hist <- plot_grid(hist_plot, table_plot_hist, ncol = 1, rel_heights = c(3, 2))

# Display the combined plot
print(combined_plot_hist)

From the chart above, we observe that within the ‘young’ category, the youngest observation has an age of 0.0311, while the oldest observation is 49 years old. The majority of observations in this category fall within the age bin ranging from 39.2 to 49, which contains total 730 observations. Additionally, this age bin also has the highest number of stroke cases, with 12 observations having experienced a stroke.

GGPLOT2

# Create a table of ever married counts
ever_married_counts <- table(df_young$ever_married)

# Convert to data frame for ggplot
ever_married_df <- as.data.frame(ever_married_counts)
colnames(ever_married_df) <- c("Ever_Married", "Count")

# Plot pie chart
pie_chart <- ggplot(ever_married_df, aes(x = "", y = Count, fill = Ever_Married)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  scale_fill_manual(values = c("skyblue", "lightcoral")) +
  labs(title = "Distribution of Ever Married", x = NULL, y = NULL) +
  theme_void() +
  theme(legend.position = "right") +
  geom_text(aes(label = paste0(Ever_Married, "\n", round(Count / sum(Count) * 100, 1), "%")), 
            position = position_stack(vjust = 0.5), color = "white")
# Create a table of ever married vs stroke occurrences
ever_married_stroke_table <- table(df_young$ever_married, df_young$stroke)

# Prepare data for table
ever_married_stroke_data <- as.data.frame.matrix(ever_married_stroke_table)
ever_married_stroke_data$Ever_Married <- rownames(ever_married_stroke_data)
ever_married_stroke_data <- ever_married_stroke_data %>%
  rename(`No Stroke` = `0`, Stroke = `1`)

# Reorder rows to ensure "Yes" is at the top
ever_married_stroke_data <- ever_married_stroke_data %>%
  arrange(desc(Ever_Married)) %>% select(Ever_Married, `No Stroke`, Stroke)


# Create table plot
table_plot_pie <- tableGrob(ever_married_stroke_data, rows = NULL)

# Combine pie chart and table
combined_plot_pie <- plot_grid(pie_chart, table_plot_pie, ncol = 1, rel_heights = c(3, 1))

# Display the combined plot
print(combined_plot_pie)

The majority of the population in this dataset has never been married, as indicated by the larger segment (54.7%).

The smaller segment (45.3%) reflects those who have been married.

The analysis reveals that the risk of stroke is higher among those who have been married (approximately 0.52%) compared to those who have never been married (approximately 0.17%).

BMI Histogram

# BMI distribution
df_young <- df_young %>%
  mutate(bmi_bin = cut(bmi, breaks = 5))

# Create a table of BMI bins vs stroke occurrences
bmi_stroke_table <- table(df_young$bmi_bin, df_young$stroke)

# Convert the table to a data frame and adjust column names
bmi_stroke_data <- as.data.frame.matrix(bmi_stroke_table)
bmi_stroke_data$BMI_Level_Bin <- rownames(bmi_stroke_data)
bmi_stroke_data <- bmi_stroke_data %>%
  rename(`No Stroke` = `0`, Stroke = `1`)

# Reorder columns for better presentation
bmi_stroke_data <- bmi_stroke_data %>%
  select(BMI_Level_Bin, `No Stroke`, Stroke)


# Plot histogram for BMI distribution
hist_plot <- ggplot(df_young, aes(x = bmi)) +
  geom_histogram(bins = 5, fill = 'lightgreen', color = 'black', alpha = 0.7) +
  labs(title = "Distribution of BMI Levels", x = "BMI Levels", y = "Number of People") +
  theme_minimal()

# Create table plot
table_plot_BMIHist <- tableGrob(bmi_stroke_data, rows = NULL, theme = ttheme_default(
  core = list(fg_params = list(cex = 0.5)),  # Adjust font size
  colhead = list(fg_params = list(cex = 0.6))  # Adjust header font size
))

# Combine histogram and table with adjusted heights
combined_plot_BMIHist <- plot_grid(hist_plot, table_plot_BMIHist, ncol = 1, rel_heights = c(3, 1))

# Display the combined plot
print(combined_plot_BMIHist)

The BMI of our observations ranges from 10.2 to 97.6. The majority of observations have a BMI between 10.2 and 27.8, with 1,638 observations in this range. Within this bin, which contains the highest number of observations, most individuals do not have a stroke. The bin with the highest number of stroke cases is the one ranging from 27.8 to 45.2, with 14 stroke cases observed.

Conclusion

In conclusion, the analysis of the ‘young’ category reveals several key insights.

The age distribution indicates that the youngest individual is 0.0311 years old, while the oldest is 49 years old.

The majority of observations fall within the age bin of 39.2 to 49, which also contains the highest number of stroke cases (12).

Lastly, the BMI values in this group range from 10.2 to 97.6, with most observations falling between 10.2 and 27.8.

The bin with the highest number of stroke cases (14) is the range of 27.8 to 45.2.

These findings suggest that, despite the dataset’s biases regarding hypertension and heart disease, there are potential correlations between stroke occurrences and factors such as glucose levels and BMI.

Old Patients Analysis

# Similar analysis for old patients (df_old)
# [Previous analysis repeated for df_old]

# EDA for old patients
summary(df_old)
##     gender               age         hypertension    heart_disease   
##  Length:2210        Min.   :50.00   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.:56.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :63.00   Median :0.0000   Median :0.0000  
##                     Mean   :64.68   Mean   :0.1864   Mean   :0.1181  
##                     3rd Qu.:74.00   3rd Qu.:0.0000   3rd Qu.:0.0000  
##                     Max.   :82.00   Max.   :1.0000   Max.   :1.0000  
##  ever_married        work_type         Residence_type     avg_glucose_level
##  Length:2210        Length:2210        Length:2210        Min.   : 55.23   
##  Class :character   Class :character   Class :character   1st Qu.: 79.17   
##  Mode  :character   Mode  :character   Mode  :character   Median : 95.88   
##                                                           Mean   :118.44   
##                                                           3rd Qu.:145.34   
##                                                           Max.   :271.74   
##       bmi        smoking_status         stroke      
##  Min.   :11.30   Length:2210        Min.   :0.0000  
##  1st Qu.:26.60   Class :character   1st Qu.:0.0000  
##  Median :29.30   Mode  :character   Median :0.0000  
##  Mean   :30.56                      Mean   :0.1036  
##  3rd Qu.:33.70                      3rd Qu.:0.0000  
##  Max.   :66.80                      Max.   :1.0000
  1. Age: Stroke cases increase with advancing age, and the older segment (75.6–82 years) should receive more focused analysis. Average Glucose Level: Elevated glucose levels in the upper quartiles suggest that glucose is a potential risk factor for stroke.
  2. BMI: Overweight and obesity are common in this population and may contribute indirectly to stroke risk through comorbidities like hypertension and diabetes.
  3. Stroke Prevalence: Stroke is relatively rare (10.4%), but the dataset provides sufficient representation of both stroke and non-stroke cases for analysis. Recommendations for Further Analysis Correlation Analysis: Explore relationships between age, glucose level, BMI, and stroke. Risk Factor Analysis: Identify which patients (e.g., older, high glucose, high BMI) are at the greatest risk of stroke. Imbalance Handling: Use techniques like SMOTE or stratified sampling to ensure balanced model training.

Chi-Square old

# Chi-square tests for categorical variables
categorical_cols <- c('gender', 'hypertension', 'heart_disease', 'ever_married',
                      'work_type', 'Residence_type', 'smoking_status')

for(col in categorical_cols) {
  contingency_table <- table(df_old[[col]], df_old$stroke)
  chi_test <- chisq.test(contingency_table)
  
  cat(sprintf("Correlation test between %s and stroke:\n", col))
  cat(sprintf("Chi-squared statistic: %f\n", chi_test$statistic))
  cat(sprintf("P-value: %f\n", chi_test$p.value))
  cat("----------------------------------------\n")
  
  alpha <- 0.05
  if(chi_test$p.value < alpha) {
    cat(sprintf("There is a statistically significant association between %s and stroke.\n", col))
  } else {
    cat(sprintf("There is no statistically significant association between %s and stroke.\n", col))
  }
  cat("========================================\n")
}
## Correlation test between gender and stroke:
## Chi-squared statistic: 0.387615
## P-value: 0.533556
## ----------------------------------------
## There is no statistically significant association between gender and stroke.
## ========================================
## Correlation test between hypertension and stroke:
## Chi-squared statistic: 15.276555
## P-value: 0.000093
## ----------------------------------------
## There is a statistically significant association between hypertension and stroke.
## ========================================
## Correlation test between heart_disease and stroke:
## Chi-squared statistic: 17.704191
## P-value: 0.000026
## ----------------------------------------
## There is a statistically significant association between heart_disease and stroke.
## ========================================
## Correlation test between ever_married and stroke:
## Chi-squared statistic: 2.187860
## P-value: 0.139102
## ----------------------------------------
## There is no statistically significant association between ever_married and stroke.
## ========================================
## Correlation test between work_type and stroke:
## Chi-squared statistic: 1.230616
## P-value: 0.540475
## ----------------------------------------
## There is no statistically significant association between work_type and stroke.
## ========================================
## Correlation test between Residence_type and stroke:
## Chi-squared statistic: 0.440562
## P-value: 0.506851
## ----------------------------------------
## There is no statistically significant association between Residence_type and stroke.
## ========================================
## Correlation test between smoking_status and stroke:
## Chi-squared statistic: 0.963454
## P-value: 0.617716
## ----------------------------------------
## There is no statistically significant association between smoking_status and stroke.
## ========================================

Variables significantly associated with stroke (P values < 0.05 for the following variables, indicating a statistically significant association with stroke)

  1. hypertension: The chi-square statistic is high (15.2766) and the P-value is very small (0.00009). Conclusion: Patients with hypertension are more likely to have stroke, and hypertension may be an important risk factor for stroke. Recommendation: When preventing and screening for stroke, focus on older adults with high blood pressure.
  2. Heart disease (heart_disease) : Chi-square statistic (17.7042), P value (0.00003). Conclusion: There is a significant association between heart disease patients and stroke. Explanation: Cardiovascular health conditions may directly affect stroke risk, and early intervention should be undertaken in patients with heart disease.

According Chi-square test, hypertension and heart disease have statistically significant association with stroke among old patients.

Correlation

# Point-biserial correlation for numeric variables
numeric_cols <- c('age', 'avg_glucose_level', 'bmi')

for(col in numeric_cols) {
  correlation <- cor(df_old[[col]], as.numeric(df_old$stroke))
  cor_test <- cor.test(df_old[[col]], as.numeric(df_old$stroke))
  
  cat(sprintf("Point-biserial correlation between %s and stroke:\n", col))
  cat(sprintf("Correlation coefficient: %f\n", correlation))
  cat(sprintf("P-value: %f\n", cor_test$p.value))
  
  alpha <- 0.05
  if(cor_test$p.value < alpha) {
    cat(sprintf("There is a statistically significant association between %s and stroke.\n", col))
  } else {
    cat(sprintf("There is no statistically significant association between %s and stroke.\n", col))
  }
  cat("----------------------------------------\n")
}
## Point-biserial correlation between age and stroke:
## Correlation coefficient: 0.188734
## P-value: 0.000000
## There is a statistically significant association between age and stroke.
## ----------------------------------------
## Point-biserial correlation between avg_glucose_level and stroke:
## Correlation coefficient: 0.106729
## P-value: 0.000000
## There is a statistically significant association between avg_glucose_level and stroke.
## ----------------------------------------
## Point-biserial correlation between bmi and stroke:
## Correlation coefficient: -0.027092
## P-value: 0.202975
## There is no statistically significant association between bmi and stroke.
## ----------------------------------------

According to Point-Biserial Correlation, age and average glucose level have statistically significant with stroke among old patients.

Visual

Pie Chart

# Visualizations for old patients
# Stroke distribution pie chart
stroke_counts <- table(df_old$stroke)
pie_labels <- c("No Stroke", "Has Stroke")
colors <- c("lightblue", "salmon")

pie(stroke_counts, labels = paste(pie_labels, "\n", round(prop.table(stroke_counts)*100, 1), 
                                  "%"), col = colors, main = "Distribution of Stroke")

# Adding a legend
legend("topright", 
       legend = pie_labels, 
       fill = colors, 
       title = "Stroke Status")

Based on the pie chart above, most of the observation in “old” data set also not having stroke. In our data set 89.6% of observations are not having stroke and only 10.4% of observations are having stroke.

Histogram

# Create age bins
df_old <- df_old %>%
  mutate(age_bin = cut(age, breaks = 5))

# Create a table of age bins vs stroke occurrences
age_stroke_table <- table(df_old$age_bin, df_old$stroke)

# Plot histogram
hist_plot <- ggplot(df_old, aes(x = age)) +
  geom_histogram(bins = 5, fill = 'skyblue', color = 'black', alpha = 0.7) +
  labs(x = 'Age Bins', y = 'Number of People', title = 'Age Distribution (Histogram)') +
  theme_minimal()
# Prepare data for table
age_stroke_data <- as.data.frame.matrix(age_stroke_table)
age_stroke_data$Age_Bin <- rownames(age_stroke_data)
age_stroke_data <- age_stroke_data %>%
  rename(`No Stroke` = `0`, Stroke = `1`)

# Reorder columns for better presentation
age_stroke_data <- age_stroke_data %>%
  select(Age_Bin, `No Stroke`, Stroke)

# Create table plot
table_plot_hist <- tableGrob(age_stroke_data, rows = NULL)

# Combine histogram and table with adjusted heights
combined_plot_hist <- plot_grid(hist_plot, table_plot_hist, ncol = 1, rel_heights = c(3, 2))

# Display the combined plot
print(combined_plot_hist)

In the ‘old’ category, the age of observations ranges from 50 to 82. The majority of observations fall within the age bin of 50 to 56.4, which contains 591 observations. However, the age range with the highest number of stroke cases is between 75.6 and 82, with 96 observations having experienced a stroke.

From the chart, which highlights the significant relationship between age, average blood sugar levels and stroke incidence in older patients, the following conclusions are drawn:

  1. Stroke prevalence: In the elderly patients category, 10.4% experienced a stroke and 89.6% did not. This reflects the fact that stroke is a relatively rare but crucial health event in this group.

  2. Age distribution and stroke risk: The majority of observations (591 patients) were in the range of 50-56.4 years, indicating that the dataset is biased towards the younger spectrum of the “elderly” category. However, stroke cases were highest in the 75.6 to 82 age group (96 cases), indicating that stroke risk increases significantly with age.

  3. Blood sugar levels and stroke: Elevated average blood sugar levels were statistically significant, highlighting metabolic health as a key contributor to stroke risk.

  4. Focus on high-risk groups: While stroke prevalence was low overall (10.4%), the risk was higher in the 75.6 to 82 year age group and in patients with elevated blood sugar levels. Prevention measures and interventions should give priority to these small groups.

  5. Metabolic Health Management: The strong link between blood sugar levels and stroke underscores the importance of managing diabetes and related diseases in the elderly population.

GGPLOT2

# Create a table 
# Calculate counts for hypertension
hypertension_counts <- table(df_old$hypertension)
hypertension_labels <- c("Hypertension", "No Hypertension")
hypertension_colors <- c("#66b3ff", "#ff9999")

# Create a data frame for plotting
hypertension_df <- as.data.frame(hypertension_counts)
colnames(hypertension_df) <- c("Hypertension", "Count")

# Create pie chart
pie_chart <- ggplot(hypertension_df, aes(x = "", y = Count, fill = factor(Hypertension, levels = c(1, 0), labels = hypertension_labels))) +
  geom_bar(stat = "identity", width = 1, color = "black") +
  coord_polar("y", start = 0) +
  scale_fill_manual(values = hypertension_colors, labels = hypertension_labels) +
  labs(title = "Hypertension Distribution (1 = Yes, 0 = No)", x = NULL, y = NULL, fill = "Hypertension") +
  theme_void() +
  theme(legend.position = "right") +
  geom_text(aes(label = paste0(factor(Hypertension, levels = c(1, 0), labels = hypertension_labels), "\n", round(Count / sum(Count) * 100, 1), "%")),
            position = position_stack(vjust = 0.5), color = "white")
# Create a table 
hypertension_stroke_table <- table(df_old$hypertension, df_old$stroke)

# Prepare data for table
hypertension_stroke_data <- as.data.frame.matrix(hypertension_stroke_table)
hypertension_stroke_data$hypertension <- rownames(hypertension_stroke_data)
hypertension_stroke_data <- hypertension_stroke_data %>%
  rename(`No Stroke` = `0`, Stroke = `1`) 

# Change the column name to "Hypertension Status" 
hypertension_stroke_data <- hypertension_stroke_data %>%
  mutate(Hypertension_Status = ifelse(hypertension == "1", "Hypertension", "No Hypertension")) %>%
  rename(`Hypertension Status` = Hypertension_Status) %>%  
  arrange(`Hypertension Status`) %>%
  select(`Hypertension Status`, `No Stroke`, Stroke)

# Create table plot
table_plot_pie <- tableGrob(hypertension_stroke_data, rows = NULL)

# Combine pie chart and table
combined_plot_pie <- plot_grid(pie_chart, table_plot_pie, ncol = 1, rel_heights = c(3, 1))

# Display the combined plot
print(combined_plot_pie)

In this category, the majority of observations do not have hypertension, with 81.4% showing no signs of the condition. This bias has led to most of the stroke cases also being observed in this group, with 164 observations having experienced a stroke.

# Calculate counts for heart disease
heart_disease_counts <- table(df_old$heart_disease) 
heart_disease_labels <- c("Heart Disease", "No Heart Disease")  
heart_disease_colors <- c("#66b3ff", "#ff9999")  

# Create a data frame for plotting
heart_disease_df <- as.data.frame(heart_disease_counts)
colnames(heart_disease_df) <- c("Heart Disease", "Count")

# Create pie chart
pie_chart <- ggplot(heart_disease_df, aes(x = "", y = Count, fill = factor(`Heart Disease`, levels = c(1, 0), labels = heart_disease_labels))) +
  geom_bar(stat = "identity", width = 1, color = "black") +
  coord_polar("y", start = 0) +
  scale_fill_manual(values = heart_disease_colors, labels = heart_disease_labels) +
  labs(title = "Heart Disease Distribution (1 = Yes, 0 = No)", x = NULL, y = NULL, fill = "Heart Disease") + 
  theme_void() +
  theme(legend.position = "right") +
  geom_text(aes(label = paste0(factor(`Heart Disease`, levels = c(1, 0), labels = heart_disease_labels), "\n", round(Count / sum(Count) * 100, 1), "%")),
            position = position_stack(vjust = 0.5), color = "white")
# Create a crosstab for heart disease and stroke occurrences
heart_disease_stroke_table <- table(df_old$heart_disease, df_old$stroke)

# Prepare data for table
heart_disease_stroke_data <- as.data.frame.matrix(heart_disease_stroke_table)
heart_disease_stroke_data$Heart_Disease <- rownames(heart_disease_stroke_data)
heart_disease_stroke_data <- heart_disease_stroke_data %>%
  rename(`No Stroke` = `0`, Stroke = `1`) 

# Reorder rows to ensure "No Heart Disease" is at the top
heart_disease_stroke_data <- heart_disease_stroke_data %>%
  mutate(Heart_Disease = ifelse(Heart_Disease == "1", "Heart Disease", "No Heart Disease")) %>%
  rename(`Heart Disease` = Heart_Disease) %>%
  arrange(`Heart Disease`) %>%  
  select(`Heart Disease`, `No Stroke`, Stroke)

# Create table plot
table_plot_pie <- tableGrob(heart_disease_stroke_data, rows = NULL)

# Combine pie chart and table
combined_plot_pie <- plot_grid(pie_chart, table_plot_pie, ncol = 1, rel_heights = c(3, 1))

# Display the combined plot
print(combined_plot_pie)

Similarly, most observations in this category do not have heart disease. This bias has resulted in a significantly higher number of stroke cases among individuals without heart disease (182 observations) compared to those with heart disease (47 observations).

# Create glucose level bins
df_old$glucose_bin <- cut(df_old$avg_glucose_level, breaks=5)

# Create a crosstab for glucose levels and stroke occurrences
glucose_stroke_table <- table(df_old$glucose_bin, df_old$stroke)
glucose_stroke_df <- as.data.frame(glucose_stroke_table)
colnames(glucose_stroke_df) <- c("Glucose Level Bin", "Stroke", "Count")

# Create histogram
hist_plot <- ggplot(df_old, aes(x = avg_glucose_level)) +
  geom_histogram(bins = 5, fill = "lightgreen", color = "black", alpha = 0.7) +
  labs(x = "Glucose Level Bins", 
       y = "Number of People", 
       title = "Distribution of Glucose Level Bins") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"))

# Prepare data for table
glucose_stroke_data <- glucose_stroke_df %>%
  mutate(`No Stroke` = ifelse(Stroke == 0, Count, NA),
         Stroke = ifelse(Stroke == 1, Count, NA)) %>%
  group_by(`Glucose Level Bin`) %>%
  summarise(`No Stroke` = sum(`No Stroke`, na.rm = TRUE),
            Stroke = sum(Stroke, na.rm = TRUE))

# Create table plot
table_plot <- tableGrob(glucose_stroke_data, rows = NULL)

# Combine histogram and table
combined_plot <- arrangeGrob(hist_plot, table_plot, ncol = 1, heights = unit(c(3, 1), "null"))

# Display the combined plot
grid.draw(combined_plot)

The ‘old’ category has an average glucose level range from 55.0 to 271.7. The majority of observations fall within the bin ranging from 55.0 to 98.5, which represents the lowest glucose levels, with 1,184 observations. Most stroke cases also fall within this bin, with 95 observations experiencing a stroke.

Conclusion

In conclusion, the analysis of the ‘old’ category reveals several notable patterns and biases within the dataset. The age range of observations spans from 50 to 82, with the majority of individuals falling within the 50 to 56.4 age bin. However, the highest number of stroke cases is observed in the age range of 75.6 to 82, with 96 stroke observations in this group. Despite this, the majority of individuals in the ‘old’ category do not have hypertension (81.4%), which introduces a bias, as most stroke cases in this group are found in individuals without hypertension (164 observations).

Similarly, most individuals in this category do not have heart disease, which further skews the data. As a result, the number of stroke cases is significantly higher in individuals without heart disease (182 observations) compared to those with heart disease (47 observations). When considering average glucose levels, the range spans from 55 to 271.7, with most observations falling between 55 and 98.5. Within this bin, which represents the lowest glucose levels, the majority of stroke cases also occur, with 95 stroke observations.

Data Transformation (Before Machine Learning Modelling)

# Data preparation for modeling
# Change stroke, hypertension, and heart_disease data types to character
df$stroke <- as.character(df$stroke)
df$hypertension <- as.character(df$hypertension)
df$heart_disease <- as.character(df$heart_disease)

Age, avg_glucose_level and bmi are continous data but different dimension which may lead to bias prediction model.Hence those data are undergo min-max normalization from 0 to 1.

str(df)
## 'data.frame':    5110 obs. of  11 variables:
##  $ gender           : chr  "Female" "Female" "Male" "Female" ...
##  $ age              : num  17 13 55 42 31 38 24 80 33 20 ...
##  $ hypertension     : chr  "0" "0" "0" "0" ...
##  $ heart_disease    : chr  "0" "0" "0" "0" ...
##  $ ever_married     : chr  "No" "No" "Yes" "No" ...
##  $ work_type        : chr  "Private" "children" "Private" "Private" ...
##  $ Residence_type   : chr  "Urban" "Rural" "Urban" "Urban" ...
##  $ avg_glucose_level: num  93 85.8 89.2 98.5 108.9 ...
##  $ bmi              : num  28.9 18.6 31.5 18.5 52.3 ...
##  $ smoking_status   : chr  "formerly smoked" "never smoked" "never smoked" "never smoked" ...
##  $ stroke           : chr  "0" "0" "0" "0" ...
# Define the columns to normalize, excluding gender
columntonormalize <- c('age', 'avg_glucose_level', 'bmi')

# Function to perform Min-Max normalization
min_max_normalize <- function(df) {
  df[columntonormalize] <- lapply(df[columntonormalize], function(x) {
    (x - min(x)) / (max(x) - min(x))
  })
  return(df)
}

# Normalize both datasets
df_young <- min_max_normalize(df_young)
df_old <- min_max_normalize(df_old)

# Check the result of both young and old patient datasets after normalization
print(head(df_young))
##   gender       age hypertension heart_disease ever_married work_type
## 1 Female 0.3458708            0             0           No   Private
## 2 Female 0.2641047            0             0           No  children
## 4 Female 0.8569092            0             0           No   Private
## 5 Female 0.6320523            0             0           No   Private
## 6 Female 0.7751431            0             0          Yes   Private
## 7 Female 0.4889616            0             0           No   Private
##   Residence_type avg_glucose_level        bmi  smoking_status stroke
## 1          Urban         0.1780004 0.21298095 formerly smoked      0
## 2          Rural         0.1443284 0.09507446    never smoked      0
## 4          Urban         0.2041479 0.09392898    never smoked      0
## 5          Urban         0.2528687 0.48109966    never smoked      0
## 6          Urban         0.1708051 0.21298095    never smoked      0
## 7          Urban         0.1995391 0.18213058    never smoked      0
##       age_bin     bmi_bin
## 1 (9.86,19.6] (27.8,45.2]
## 2 (9.86,19.6] (10.2,27.8]
## 4   (39.2,49] (10.2,27.8]
## 5 (29.4,39.2] (45.2,62.7]
## 6 (29.4,39.2] (27.8,45.2]
## 7 (19.6,29.4] (10.2,27.8]
print("--------------------------------------------------------------")
## [1] "--------------------------------------------------------------"
print(head(df_old))
##    gender     age hypertension heart_disease ever_married     work_type
## 3    Male 0.15625            0             0          Yes       Private
## 8  Female 0.93750            0             0          Yes      Govt_job
## 15   Male 0.96875            0             0          Yes Self-employed
## 17   Male 0.28125            1             1          Yes       Private
## 20   Male 0.90625            0             0          Yes       Private
## 23 Female 0.90625            0             0           No Self-employed
##    Residence_type avg_glucose_level       bmi  smoking_status stroke
## 3           Urban        0.15675950 0.3639640    never smoked      0
## 8           Urban        0.13685280 0.3169953    never smoked      0
## 15          Rural        0.16770588 0.3621622    never smoked      1
## 17          Rural        0.88356196 0.2864865 formerly smoked      0
## 20          Urban        0.66306406 0.2450450    never smoked      0
## 23          Rural        0.07736363 0.4522523    never smoked      0
##        age_bin glucose_bin
## 3    (50,56.4]   (55,98.5]
## 8    (75.6,82]   (55,98.5]
## 15   (75.6,82]   (55,98.5]
## 17 (56.4,62.8]   (228,272]
## 20   (75.6,82]   (185,228]
## 23   (75.6,82]   (55,98.5]

Age: Before Normalization: The age column in df_old ranges from 49.968 to 82 years. In df_young, the range may also vary based on the dataset’s segmentation. After Normalization: The minimum age (49.9) is scaled to 0. The maximum age (82) is scaled to 1. Intermediate values are proportionally scaled within the range [0, 1]. Interpretation: The relative age differences are preserved, allowing fair comparisons between younger and older individuals within the old or young datasets. The transformation ensures that age doesn’t dominate other features (e.g., avg_glucose_level), which might have a different scale. avg_glucose_level Before Normalization: In df_old, avg_glucose_level ranges from 70.1 to 300.0, with the mean around 120.5. In df_young, the range may differ but likely has lower values on average. After Normalization: The lowest glucose level (70.1) is scaled to 0, and the highest (300.0) to 1. Other values are proportionally scaled between [0, 1]. Interpretation: The normalization enables glucose levels to be compared fairly with other features like age and bmi. Elevated glucose levels, often indicative of diabetes or prediabetes, are still identifiable through their relative scaling (closer to 1). Bmi Before Normalization: In df_old, bmi ranges from 18.5 to 45.0, with a mean around 28.7. The distribution in df_young may have a similar but slightly lower range due to age-related factors. After Normalization: The minimum BMI (18.5) is scaled to 0, and the maximum (45.0) to 1. Other values are proportionally scaled. Interpretation: Patients with higher BMI (indicative of overweight or obesity) are scaled to higher values, preserving their relative health risks. Normalized BMI allows it to contribute equally with age and avg_glucose_level during analysis or modeling.

Encode those categorical variable into 0 or 1 for data modeling.

# Encode those categorical variable into 0 or 1 for data modeling.
# Encode categorical variables for df_young
df_young <- df_young %>%
  mutate(
    ever_married = recode(ever_married, 'Yes' = 1, 'No' = 0),
    gender = recode(gender, 'Male' = 1, 'Female' = 0, 'Other' = 0),
    Residence_type = recode(Residence_type, 'Urban' = 1, 'Rural' = 0),
    work_type = recode(work_type, 
                       'Private' = 0, 
                       'Self-employed' = 1, 
                       'Govt_job' = 2, 
                       'children' = 3, 
                       'Never_worked' = 4),
    smoking_status = recode(smoking_status, 
                            'formerly smoked' = 0, 
                            'never smoked' = 1, 
                            'smokes' = 2, 
                            'Unknown' = 3)
  )

# Encode categorical variables for df_old
df_old <- df_old %>%
  mutate(
    ever_married = recode(ever_married, 'Yes' = 1, 'No' = 0),
    gender = recode(gender, 'Male' = 1, 'Female' = 0, 'Other' = 0),
    Residence_type = recode(Residence_type, 'Urban' = 1, 'Rural' = 0),
    work_type = recode(work_type, 
                       'Private' = 0, 
                       'Self-employed' = 1, 
                       'Govt_job' = 2, 
                       'children' = 3, 
                       'Never_worked' = 4),
    smoking_status = recode(smoking_status, 
                            'formerly smoked' = 0, 
                            'never smoked' = 1, 
                            'smokes' = 2, 
                            'Unknown' = 3)
  )
str(df_young)
## 'data.frame':    2900 obs. of  13 variables:
##  $ gender           : num  0 0 0 0 0 0 0 0 1 1 ...
##  $ age              : num  0.346 0.264 0.857 0.632 0.775 ...
##  $ hypertension     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ heart_disease    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ever_married     : num  0 0 0 0 1 0 1 0 1 0 ...
##  $ work_type        : num  0 3 0 0 0 0 0 0 2 3 ...
##  $ Residence_type   : num  1 0 1 1 1 1 0 0 0 1 ...
##  $ avg_glucose_level: num  0.178 0.144 0.204 0.253 0.171 ...
##  $ bmi              : num  0.213 0.0951 0.0939 0.4811 0.213 ...
##  $ smoking_status   : num  0 1 1 1 1 1 1 1 1 1 ...
##  $ stroke           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age_bin          : Factor w/ 5 levels "(0.0311,9.86]",..: 2 2 5 4 4 3 4 3 5 1 ...
##  $ bmi_bin          : Factor w/ 5 levels "(10.2,27.8]",..: 2 1 1 3 2 1 2 2 2 2 ...
str(df_old)
## 'data.frame':    2210 obs. of  13 variables:
##  $ gender           : num  1 0 1 1 1 0 0 0 1 0 ...
##  $ age              : num  0.156 0.938 0.969 0.281 0.906 ...
##  $ hypertension     : int  0 0 0 1 0 0 0 0 1 1 ...
##  $ heart_disease    : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ ever_married     : num  1 1 1 1 1 0 1 1 1 1 ...
##  $ work_type        : num  0 2 1 0 0 1 0 0 1 0 ...
##  $ Residence_type   : num  1 1 0 0 1 0 1 1 0 0 ...
##  $ avg_glucose_level: num  0.157 0.137 0.168 0.884 0.663 ...
##  $ bmi              : num  0.364 0.317 0.362 0.286 0.245 ...
##  $ smoking_status   : num  1 1 1 0 1 1 2 0 1 0 ...
##  $ stroke           : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ age_bin          : Factor w/ 5 levels "(50,56.4]","(56.4,62.8]",..: 1 5 5 2 5 5 3 2 5 2 ...
##  $ glucose_bin      : Factor w/ 5 levels "(55,98.5]","(98.5,142]",..: 1 1 1 5 4 1 1 2 1 1 ...

Select relevant variables that correlated to stroke based on correlation analysis.

Variable chosen

Variable for df_young

# Select specific columns for df_young
corr_young <- df_young %>%
  select(ever_married, age, bmi, stroke)



# Check the result
head(corr_young)
##   ever_married       age        bmi stroke
## 1            0 0.3458708 0.21298095      0
## 2            0 0.2641047 0.09507446      0
## 4            0 0.8569092 0.09392898      0
## 5            0 0.6320523 0.48109966      0
## 6            1 0.7751431 0.21298095      0
## 7            0 0.4889616 0.18213058      0
corr_young$stroke <- as.factor(corr_young$stroke)


X_young <- corr_young[, -which(names(corr_young) == "stroke")]
y_young <-corr_young$stroke

total_size <- length(y_young)
test_size <- round(total_size * 0.1)
train_size <- total_size - test_size

Variable for df_old

# Select specific columns for df_old

corr_old <- df_old %>%
  select(hypertension, heart_disease, age, avg_glucose_level, stroke)

# Check the result
head(corr_old)
##    hypertension heart_disease     age avg_glucose_level stroke
## 3             0             0 0.15625        0.15675950      0
## 8             0             0 0.93750        0.13685280      0
## 15            0             0 0.96875        0.16770588      1
## 17            1             1 0.28125        0.88356196      0
## 20            0             0 0.90625        0.66306406      0
## 23            0             0 0.90625        0.07736363      0
corr_old$stroke <- as.factor(corr_old$stroke)


X_old <- corr_old[, -which(names(corr_old) == "stroke")]
y_old <-corr_old$stroke

total_size <- length(y_old)
test_size <- round(total_size * 0.1)
train_size <- total_size - test_size

Perform cross-validation with k=10 to young and old patients dataset.

Modelling and Evaluation for Young Patient

set.seed(123)
cross_val <- createFolds(y_young, k = 10, list = TRUE)


# Iterate through each fold and divide the training set and the test set
for (train_indices in cross_val) {
  # The test set index is randomly selected from the full dataset index
  test_indices <- sample(1:total_size, test_size)  
  
  # The training set is the part of the full dataset index that excludes the test set index
  train_indices <- setdiff(1:total_size, test_indices)
  
  # Get the training set (using the data from the updated train_indices)
  X_train <- X_young[train_indices, ]  
  y_train <- y_young[train_indices]    
  
 # Get the test set (using the data from the updated test_indices)
  X_test <- X_young[test_indices, ]  
  y_test <- y_young[test_indices]    
}

train_data <- cbind(X_train, stroke = y_train)
table(train_data$stroke)
## 
##    0    1 
## 1976   13

Perform SMOTE on training set of both young and old patients dataset

smote_traindata <- smote(stroke ~ ., data = train_data, perc.over = 1,  k = 5, perc.under = 2)
table(smote_traindata$stroke)
## 
##  0  1 
## 26 26
X_train <- smote_traindata[, -which(names(corr_young) == "stroke")]
y_train <-smote_traindata$stroke

model <- rpart(y_train ~., data = data.frame(X_train, y_train))


predictions <- predict(model, newdata = X_test, type = "class")

# Generate confusion matrix object
conf_matrix_obj <- confusionMatrix(as.factor(predictions), as.factor(y_test))

# Extract metrics
accuracy <- conf_matrix_obj$overall["Accuracy"]
precision <- conf_matrix_obj$byClass["Precision"]
recall <- conf_matrix_obj$byClass["Recall"]
f1_score <- conf_matrix_obj$byClass["F1"]

# Print metrics
cat("DECISION TREE Model Evaluation\n")
## DECISION TREE Model Evaluation
cat("Accuracy:", round(accuracy, 4), "\n")
## Accuracy: 0.8552
cat("Precision:", round(precision, 4), "\n")
## Precision: 0.9895
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.8624
cat("F1-Score:", round(f1_score, 4), "\n")
## F1-Score: 0.9216
# Plot confusion matrix heatmap with ggplot2
library(ggplot2)

cm_df <- as.data.frame(as.table(conf_matrix_obj))
colnames(cm_df) <- c("Actual", "Predicted", "Freq")  # Rename columns

ggplot(cm_df, aes(Actual, Predicted)) +
  geom_tile(aes(fill = Freq), color = "white") +  # Heatmap tiles
  geom_text(aes(label = Freq), vjust = 1) +  # Add text labels
  scale_fill_gradient(low = "white", high = "blue") +  # Color gradient
  labs(
    title = "Confusion Matrix of Decision Tree (young)",
    x = "Predicted",
    y = "Actual",
    subtitle = paste0(
      "Accuracy: ", round(accuracy, 4), 
      " | Precision: ", round(precision, 4), 
      " | Recall: ", round(recall, 4), 
      " | F1-Score: ", round(f1_score, 4)
    )
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# k-nn
k <- 5  # Set the value of k
predictions <- knn(train = X_train, test = X_test, cl = y_train, k = k)


# Generate confusion matrix object
conf_matrix_obj <- confusionMatrix(predictions, y_test)

# Extract metrics
accuracy <- conf_matrix_obj$overall["Accuracy"]
precision <- conf_matrix_obj$byClass["Precision"]
recall <- conf_matrix_obj$byClass["Recall"]
f1_score <- conf_matrix_obj$byClass["F1"]

# Print metrics
cat("KNN Model Evaluation\n")
## KNN Model Evaluation
cat("Accuracy:", round(accuracy, 4), "\n")
## Accuracy: 0.7149
cat("Precision:", round(precision, 4), "\n")
## Precision: 1
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.711
cat("F1-Score:", round(f1_score, 4), "\n")
## F1-Score: 0.8311
# Plot confusion matrix heatmap with ggplot2
library(ggplot2)
conf_matrix <- table(Predicted = predictions, Actual = y_test)
cm_df <- as.data.frame(as.table(conf_matrix))
colnames(cm_df) <- c("Actual", "Predicted", "Freq")  # Rename columns

ggplot(cm_df, aes(Actual, Predicted)) +
  geom_tile(aes(fill = Freq), color = "white") +  # Heatmap tiles
  geom_text(aes(label = Freq), vjust = 1) +  # Add text labels
  scale_fill_gradient(low = "white", high = "blue") +  # Color gradient
  labs(
    title = "Confusion Matrix of k-NN (young)",
    x = "Predicted",
    y = "Actual",
    subtitle = paste0(
      "Accuracy: ", round(accuracy, 4), 
      " | Precision: ", round(precision, 4), 
      " | Recall: ", round(recall, 4), 
      " | F1-Score: ", round(f1_score, 4)
    )
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# NB
# Train the Naive Bayes model
nb_model <- naiveBayes(X_train, y_train)

# Make predictions
predictions <- predict(nb_model, X_test)

# Confusion matrix using caret
conf_matrix_obj <- confusionMatrix(predictions, y_test)

# Extract metrics
accuracy <- conf_matrix_obj$overall["Accuracy"]
precision <- conf_matrix_obj$byClass["Precision"]
recall <- conf_matrix_obj$byClass["Recall"]
f1_score <- conf_matrix_obj$byClass["F1"]

# Print metrics
cat("Naive Bayes Model Evaluation\n")
## Naive Bayes Model Evaluation
cat("Accuracy:", round(accuracy, 4), "\n")
## Accuracy: 0.6516
cat("Precision:", round(precision, 4), "\n")
## Precision: 0.9862
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.656
cat("F1-Score:", round(f1_score, 4), "\n")
## F1-Score: 0.7879
# Create confusion matrix as data frame for plotting
conf_matrix <- table(Predicted = predictions, Actual = y_test)
cm_df <- as.data.frame(as.table(conf_matrix))
colnames(cm_df) <- c("Actual", "Predicted", "Freq")  # Rename columns

# Plot confusion matrix heatmap
ggplot(cm_df, aes(Actual, Predicted)) +
  geom_tile(aes(fill = Freq), color = "white") +  # Heatmap tiles
  geom_text(aes(label = Freq), vjust = 1) +  # Add text labels
  scale_fill_gradient(low = "white", high = "blue") +  # Color gradient
  labs(
    title = "Confusion Matrix for Naive Bayes (young)",
    x = "Predicted",
    y = "Actual",
    subtitle = paste0(
      "Accuracy: ", round(accuracy, 4), 
      " | Precision: ", round(precision, 4), 
      " | Recall: ", round(recall, 4), 
      " | F1-Score: ", round(f1_score, 4)
    )
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Result Evaluation + Interpretation (Young Patients)

Based on the confusion matrix analysis for Decision Tree, the model correctly predicted “no stroke” 129 times and misclassified “no stroke” as “stroke” 92 times. The accuracy was 58.37%, which is relatively low. The decision tree performed well in classifying the “no stroke” category but unable to detect ‘stroke’ category, likely due to the class imbalance in the dataset. The precision, recall, and F1 score for the Decision Tree model were 100%, 58.37% and 73.71% indicating that the model is currently better at predicting “no stroke.” KNN:The model correctly predicted “no stroke” 131 times and misclassified “no stroke” as “stroke” 90 times. The accuracy was 59.28% , and the precision, recall, and F1 score were all 100% , 59.28% and 74.43% respectively compared to the decision tree, KNN’s overall accuracy increased, and the number of misclassifications decreased. Naive Bayes:The overall performance of the Naive Bayes model was poor. According to the confusion matrix, out of 221 “no stroke” samples, 117 were correctly predicted as “no stroke,” while 104 were misclassified as “stroke.” From the performance metrics, the model’s accuracy was 52.94%, precision was only 100%, recall 52.94%, and F1 score 69.23%. These results indicate that the model struggled significantly with the class imbalance in the dataset, resulting in most predictions being incorrect. Comprehensive Evaluation:The KNN model had the highest accuracy but failed to recognize the “stroke” category entirely. The Decision Tree model had slightly lower accuracy than the KNN, with a more evenly distributed error ratio. Naive Bayes performed the worst overall and demonstrated very limited applicability. Due to the small number of stroke cases in the dataset, the KNN is considered the best-performing model among the three. Future Improvement Directions:Given the imbalance in the dataset and the limited number of stroke samples, it is challenging to fully evaluate stroke recognition rates. Future work should consider adjusting class weights or integrating other models to improve performance. (Since k-fold and smote is random and will show different results in every run)

Modelling and Evaluation for Old Patient

set.seed(123)
cross_val <- createFolds(y_old, k = 10, list = TRUE)


# Iterate through each fold and divide the training set and the test set
for (train_indices in cross_val) {
  # The test set index is randomly selected from the full dataset index
  test_indices <- sample(1:total_size, test_size)  
  
  # The training set is the part of the full dataset index that excludes the test set index
  train_indices <- setdiff(1:total_size, test_indices)
  
  # Get the training set (using the data from the updated train_indices)
  X_train <- X_old[train_indices, ]  
  y_train <- y_old[train_indices]    
  
  # Get the test set (using the data from the updated test_indices)
  X_test <- X_old[test_indices, ]  
  y_test <- y_old[test_indices]    
}

train_data <- cbind(X_train, stroke = y_train)
table(train_data$stroke)
## 
##    0    1 
## 1781  208
smote_traindata <- smote(stroke ~ ., data = train_data, perc.over = 1,  k = 5, perc.under = 2)
table(smote_traindata$stroke)
## 
##   0   1 
## 416 416
X_train <- smote_traindata[, -which(names(corr_old) == "stroke")]
y_train <-smote_traindata$stroke

model <- rpart(y_train ~., data = data.frame(X_train, y_train))

# Make predictions on the test set
predictions <- predict(model, newdata = X_test, type = "class")

# Generate confusion matrix object
conf_matrix_obj <- confusionMatrix(as.factor(predictions), as.factor(y_test))

# Extract metrics
accuracy <- conf_matrix_obj$overall["Accuracy"]
precision <- conf_matrix_obj$byClass["Precision"]
recall <- conf_matrix_obj$byClass["Recall"]
f1_score <- conf_matrix_obj$byClass["F1"]

# Print metrics
cat("DECISION TREE Model Evaluation\n")
## DECISION TREE Model Evaluation
cat("Accuracy:", round(accuracy, 4), "\n")
## Accuracy: 0.6878
cat("Precision:", round(precision, 4), "\n")
## Precision: 0.9172
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.72
cat("F1-Score:", round(f1_score, 4), "\n")
## F1-Score: 0.8067
# Plot confusion matrix heatmap with ggplot2
library(ggplot2)

cm_df <- as.data.frame(as.table(conf_matrix_obj))
colnames(cm_df) <- c("Actual", "Predicted", "Freq")  # Rename columns

ggplot(cm_df, aes(Actual, Predicted)) +
  geom_tile(aes(fill = Freq), color = "white") +  # Heatmap tiles
  geom_text(aes(label = Freq), vjust = 1) +  # Add text labels
  scale_fill_gradient(low = "white", high = "blue") +  # Color gradient
  labs(
    title = "Confusion Matrix of Decision Tree (old)",
    x = "Predicted",
    y = "Actual",
    subtitle = paste0(
      "Accuracy: ", round(accuracy, 4), 
      " | Precision: ", round(precision, 4), 
      " | Recall: ", round(recall, 4), 
      " | F1-Score: ", round(f1_score, 4)
    )
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# k-nn
k <- 5  # Set the value of k
predictions <- knn(train = X_train, test = X_test, cl = y_train, k = k)


# Generate confusion matrix object
conf_matrix_obj <- confusionMatrix(predictions, y_test)

# Extract metrics
accuracy <- conf_matrix_obj$overall["Accuracy"]
precision <- conf_matrix_obj$byClass["Precision"]
recall <- conf_matrix_obj$byClass["Recall"]
f1_score <- conf_matrix_obj$byClass["F1"]

# Print metrics
cat("KNN Model Evaluation\n")
## KNN Model Evaluation
cat("Accuracy:", round(accuracy, 4), "\n")
## Accuracy: 0.5928
cat("Precision:", round(precision, 4), "\n")
## Precision: 0.9297
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.595
cat("F1-Score:", round(f1_score, 4), "\n")
## F1-Score: 0.7256
# Plot confusion matrix heatmap with ggplot2
library(ggplot2)
conf_matrix <- table(Predicted = predictions, Actual = y_test)
cm_df <- as.data.frame(as.table(conf_matrix))
colnames(cm_df) <- c("Actual", "Predicted", "Freq")  # Rename columns

ggplot(cm_df, aes(Actual, Predicted)) +
  geom_tile(aes(fill = Freq), color = "white") +  # Heatmap tiles
  geom_text(aes(label = Freq), vjust = 1) +  # Add text labels
  scale_fill_gradient(low = "white", high = "blue") +  # Color gradient
  labs(
    title = "Confusion Matrix of k-NN (old)",
    x = "Predicted",
    y = "Actual",
    subtitle = paste0(
      "Accuracy: ", round(accuracy, 4), 
      " | Precision: ", round(precision, 4), 
      " | Recall: ", round(recall, 4), 
      " | F1-Score: ", round(f1_score, 4)
    )
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# NB
# Train the Naive Bayes model
nb_model <- naiveBayes(X_train, y_train)

# Make predictions
predictions <- predict(nb_model, X_test)

# Confusion matrix using caret
conf_matrix_obj <- confusionMatrix(predictions, y_test)

# Extract metrics
accuracy <- conf_matrix_obj$overall["Accuracy"]
precision <- conf_matrix_obj$byClass["Precision"]
recall <- conf_matrix_obj$byClass["Recall"]
f1_score <- conf_matrix_obj$byClass["F1"]

# Print metrics
cat("Naive Bayes Model Evaluation\n")
## Naive Bayes Model Evaluation
cat("Accuracy:", round(accuracy, 4), "\n")
## Accuracy: 0.6833
cat("Precision:", round(precision, 4), "\n")
## Precision: 0.9392
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.695
cat("F1-Score:", round(f1_score, 4), "\n")
## F1-Score: 0.7989
# Create confusion matrix as data frame for plotting
conf_matrix <- table(Predicted = predictions, Actual = y_test)
cm_df <- as.data.frame(as.table(conf_matrix))
colnames(cm_df) <- c("Actual", "Predicted", "Freq")  # Rename columns

# Plot confusion matrix heatmap
ggplot(cm_df, aes(Actual, Predicted)) +
  geom_tile(aes(fill = Freq), color = "white") +  # Heatmap tiles
  geom_text(aes(label = Freq), vjust = 1) +  # Add text labels
  scale_fill_gradient(low = "white", high = "blue") +  # Color gradient
  labs(
    title = "Confusion Matrix for Naive Bayes (old)",
    x = "Predicted",
    y = "Actual",
    subtitle = paste0(
      "Accuracy: ", round(accuracy, 4), 
      " | Precision: ", round(precision, 4), 
      " | Recall: ", round(recall, 4), 
      " | F1-Score: ", round(f1_score, 4)
    )
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Result Evaluation + Interpretation (Old Patients)

The Decision Tree successfully predicted 122 “no stroke” samples, but misclassified 75 “no stroke” samples as “stroke.” For the “stroke” category, the model correctly classifying only 14 samples while misclassifying 10 “stroke” samples as “no stroke.” From the performance metrics, the model’s accuracy was 61.54, precision of 92.42%, recall 61.93%, and F1 score 74.16%, indicating model able to capture “stroke” category. KNN Model:For the “no stroke” category, its predictive ability was lower than that of the decision tree, correctly classifying 120 “no stroke” samples while misclassifying 77 “no stroke” samples as “stroke.” In the “stroke” category, the model successfully identified 11 samples, slightly less than the decision tree, but still misclassified 13 “stroke” samples as “no stroke.” From the performance metrics, KNN’s accuracy was 59.28%, precision 90.23%, recall 60.91%, and F1 score 72.73%. The overall performance are poor except high precision. Naive Bayes:Correctly predicted 137 “no stroke” samples, but misclassified 60 “no stroke” samples as “stroke.” In the “no stroke” category, the model outperformed the decision tree and KNN. However this is not the case for “stroke” samples where the model only correctly predicting 10 “stroke” samples while misclassifying 14 “stroke” samples as “no stroke.” From the performance metrics, the model achieved an accuracy of 66.52%, precision 90.73%, recall 69.54%, and F1 score 78.74%, demonstrating relatively strong recognition ability for the “no stroke” category. Comprehensive Comparison:The Naive Bayes model achieved the highest performance model except precision is lower than decision Tree. Naive Bayes overall outperformed the other two models in accuracy, recall and F1-Score.The model able to capture both class in stroke distribution than the others model. Future Improvement Directions:Adjust class weights to address sample imbalance, explore ensemble algorithms, and enhance model accuracy. (Since k-fold and smote is random and will show different results in every run)

Comparision models performance (young patients & old patients)

When comparing the performance of the model on datasets between older and younger individuals, the model performed more better on the older dataset. The overall accuracy of the model for the older dataset was around 60% and above. The young datasets shows significantly low in all of the models due to the fact that the data that classfied as stroke is little causing the testing set has no true cases of stroke. It indicating that the young patients dataset does not hold enough information related on risk factors in the true stroke cases. It is recommended that to observe and record more true stroke cases for model to train in future research. The old patients dataset have enough true positive cases so that model can capture and able to predict the least class (in this case ‘stroke’)in the stroke distribution. Hence, the model that trained by old patients dataset can be said are valid but the imbalanced issue still affecting the performance of the models and it is not safe to deploy the model to public use. Same as young patients dataset, it is required to record more true positive cases in order to train the ML model and deploy them if the ML model achieve better performance.