stroke_anteior.knit

1 | Dataset Overview

This dataset is intended to predict whether a patient is likely to suffer from a stroke based on several input parameters. These parameters include demographic information (such as age, gender, and marital status), medical history (including hypertension, heart disease), lifestyle factors (smoking status, work type), and physiological measurements (BMI, average glucose level).

Each row in the dataset represents a single patient and provides a snapshot of relevant details that can help predict the likelihood of experiencing a stroke. The goal is to use these features to train machine learning models that can identify patterns and make predictions about stroke occurrence.

Below is a detailed description of the available variables in the dataset, outlining each attribute and its corresponding meaning:

Dataset Attribute Information

Attribute	Description
id	Unique identifier for each patient
gender	“Male”, “Female” or “Other”
age	Age of the patient in years
hypertension	0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
heart_disease	0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
ever_married	“No” or “Yes”, indicating marital status
work_type	Type of employment: “children”, “Govt_job”, “Never_worked”, “Private”, “Self-employed”
Residence_type	“Rural” or “Urban”, indicating the residence type
avg_glucose_level	Average glucose level in the blood
bmi	Body mass index (BMI)
smoking_status	Smoking status: “formerly smoked”, “never smoked”, “smokes”, or “Unknown”
stroke	1 if the patient had a stroke, 0 if not (target variable)

The dataset was retrieved from Kaggle and is intended for educational and research purposes. The information is publicly available here.

2 | Exploratory Data Analysis (EDA)

Introduction

This report presents an Exploratory Data Analysis (EDA) of a stroke dataset, containing various attributes of patients that may be related to their likelihood of experiencing a stroke. This analysis aims to provide insights into the key characteristics of the data and relationships among different variables, which will aid in understanding factors contributing to strokes.

Data Loading and Cleaning

The dataset used in this analysis is named stroke.csv. Below, we load the necessary libraries, load the dataset, and perform data cleaning to ensure all columns are in appropriate data types.

library(ggplot2)
library(dplyr)
library(knitr)
library(kableExtra)
library(plotly)
library(RColorBrewer)
library(gridExtra)

stroke_data <- read.csv("stroke.csv")

We label the variables for better readability and map categorical variables to human-readable levels.

variable_labels <- list(
  "id" = "ID",
  "gender" = "Gender",
  "age" = "Age",
  "hypertension" = "Hypertension",
  "heart_disease" = "Heart Disease",
  "ever_married" = "Ever Married",
  "work_type" = "Work Type",
  "Residence_type" = "Residence Type",
  "avg_glucose_level" = "Average Glucose Level",
  "bmi" = "Body Mass Index (BMI)",
  "smoking_status" = "Smoking Status",
  "stroke" = "Stroke"
)

# Data Cleaning: Convert columns to appropriate data types
stroke_data$id <- as.factor(stroke_data$id)
stroke_data$gender <- as.factor(stroke_data$gender)
stroke_data$ever_married <- as.factor(stroke_data$ever_married)
stroke_data$work_type <- as.factor(stroke_data$work_type)
stroke_data$Residence_type <- as.factor(stroke_data$Residence_type)
stroke_data$smoking_status <- as.factor(stroke_data$smoking_status)
stroke_data$stroke <- as.factor(stroke_data$stroke)
stroke_data$hypertension <- as.factor(stroke_data$hypertension)
stroke_data$heart_disease <- as.factor(stroke_data$heart_disease)

# Ensure numerical columns are numeric
stroke_data$age <- as.numeric(stroke_data$age)
stroke_data$avg_glucose_level <- as.numeric(stroke_data$avg_glucose_level)
stroke_data$bmi <- as.numeric(stroke_data$bmi)

Handling Missing Values

To handle missing values, we imputed missing bmi values using the median.

stroke_data$bmi[is.na(stroke_data$bmi)] <- median(stroke_data$bmi, na.rm = TRUE)

Univariate Analysis

Numerical Variables

The dataset contains numerical variables such as age, avg_glucose_level, and bmi. Below, we visualize the distribution of these variables using density plots.

numerical_columns <- c("age", "avg_glucose_level", "bmi")

for (col in numerical_columns) {
  label <- variable_labels[[gsub("\\.", "_", col)]]
  print(
    ggplot(stroke_data, aes_string(x = col)) +
      geom_density(fill = "steelblue", alpha = 0.7) +
      theme_minimal() +
      labs(title = paste("Density Plot of", label), x = label, y = "Density")
  )
}

Summary Statistics

Summary statistics provide insights into the central tendencies and spread of the numerical variables.

numerical_summary <- stroke_data %>%
  summarise(
    Age_Min = min(age, na.rm = TRUE), Age_Median = median(age, na.rm = TRUE), Age_Mean = mean(age, na.rm = TRUE), Age_Max = max(age, na.rm = TRUE), Age_SD = sd(age, na.rm = TRUE),
    Glucose_Min = min(avg_glucose_level, na.rm = TRUE), Glucose_Median = median(avg_glucose_level, na.rm = TRUE), Glucose_Mean = mean(avg_glucose_level, na.rm = TRUE), Glucose_Max = max(avg_glucose_level, na.rm = TRUE), Glucose_SD = sd(avg_glucose_level, na.rm = TRUE),
    BMI_Min = min(bmi, na.rm = TRUE), BMI_Median = median(bmi, na.rm = TRUE), BMI_Mean = mean(bmi, na.rm = TRUE), BMI_Max = max(bmi, na.rm = TRUE), BMI_SD = sd(bmi, na.rm = TRUE)
  )

kable(numerical_summary, "html") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condensed"))

Age_Min	Age_Median	Age_Mean	Age_Max	Age_SD	Glucose_Min	Glucose_Median	Glucose_Mean	Glucose_Max	Glucose_SD	BMI_Min	BMI_Median	BMI_Mean	BMI_Max	BMI_SD
0.08	45	43.22661	82	22.61265	55.12	91.885	106.1477	271.74	45.28356	10.3	28.1	28.86204	97.6	7.699562

Categorical Variables

The frequency distributions of categorical variables provide insights into the proportion of patients in different categories.

categorical_columns <- c("gender", "hypertension", "heart_disease", "ever_married", "work_type", "Residence_type", "smoking_status", "stroke")

for (col in categorical_columns) {
  label <- variable_labels[[gsub("\\.", "_", col)]]
  print(
    ggplot(stroke_data, aes_string(x = col, fill = col)) +
      geom_bar(alpha = 0.8) +
      geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5, size = 3) +
      theme_minimal() +
      labs(title = paste("Frequency Distribution of", label), x = label, y = "Count", fill = label) +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))
  )
}

Bivariate Analysis

Scatter Plots for Numerical Variables

The relationships between pairs of numerical variables are explored using scatter plots.

scatter_pairs <- combn(numerical_columns, 2)

for (i in 1:ncol(scatter_pairs)) {
  var1 <- scatter_pairs[1, i]
  var2 <- scatter_pairs[2, i]
  label1 <- variable_labels[[gsub("\\.", "_", var1)]]
  label2 <- variable_labels[[gsub("\\.", "_", var2)]]
  print(
    ggplot(stroke_data, aes_string(x = var1, y = var2)) +
      geom_point(alpha = 0.6, color = "darkgray") +
      geom_smooth(method = "lm", se = FALSE, color = "blue") +
      labs(title = paste("Scatter Plot of", label1, "vs", label2), x = label1, y = label2) +
      theme_minimal()
  )
}

Multivariate Analysis

Box Plots to Compare Categorical and Numerical Variables

To further explore the data, we compare numerical variables across different categories using box plots.

for (num_col in numerical_columns) {
  num_label <- variable_labels[[gsub("\\.", "_", num_col)]]
  for (cat_col in categorical_columns) {
    cat_label <- variable_labels[[gsub("\\.", "_", cat_col)]]
    print(
      ggplot(stroke_data, aes_string(x = cat_col, y = num_col, fill = cat_col)) +
        geom_boxplot(alpha = 0.7) +
        theme_minimal() +
        labs(title = paste("Box Plot of", num_label, "by", cat_label), x = cat_label, y = num_label, fill = cat_label) +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
    )
  }
}

Correlation Analysis

A correlation matrix is computed to explore the linear relationships between numerical variables. Below, we present significant correlations.

library(reshape2)
num_data <- stroke_data %>% select(all_of(numerical_columns))

# Ensure that num_data only contains numeric columns
num_data <- num_data %>% mutate_if(is.factor, as.numeric)

cor_matrix <- cor(num_data, use = "complete.obs")

upper_triangle <- cor_matrix
upper_triangle[lower.tri(upper_triangle, diag = TRUE)] <- NA

cor_data <- melt(upper_triangle, na.rm = TRUE)
cor_data <- cor_data %>%
  filter(!is.na(value)) %>%
  arrange(desc(abs(value)))

colnames(cor_data) <- c("Variable 1", "Variable 2", "Correlation")

kable(cor_data, "html") %>%
  kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condensed"))

Variable 1	Variable 2	Correlation
age	bmi	0.3242957
age	avg_glucose_level	0.2381711
avg_glucose_level	bmi	0.1668757

Conclusion

This EDA provides a detailed overview of the stroke dataset, focusing on both the individual characteristics of each variable and the relationships between variables. The visualizations and analyses highlight key factors that may contribute to stroke incidence and will serve as the foundation for further analysis and predictive modeling.