1 | Dataset Overview
This dataset is intended to predict whether a patient is likely to suffer from a stroke based on several input parameters. These parameters include demographic information (such as age, gender, and marital status), medical history (including hypertension, heart disease), lifestyle factors (smoking status, work type), and physiological measurements (BMI, average glucose level).
Each row in the dataset represents a single patient and provides a snapshot of relevant details that can help predict the likelihood of experiencing a stroke. The goal is to use these features to train machine learning models that can identify patterns and make predictions about stroke occurrence.
Below is a detailed description of the available variables in the dataset, outlining each attribute and its corresponding meaning:
Dataset Attribute Information
| Attribute | Description |
|---|---|
| id | Unique identifier for each patient |
| gender | “Male”, “Female” or “Other” |
| age | Age of the patient in years |
| hypertension | 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension |
| heart_disease | 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease |
| ever_married | “No” or “Yes”, indicating marital status |
| work_type | Type of employment: “children”, “Govt_job”, “Never_worked”, “Private”, “Self-employed” |
| Residence_type | “Rural” or “Urban”, indicating the residence type |
| avg_glucose_level | Average glucose level in the blood |
| bmi | Body mass index (BMI) |
| smoking_status | Smoking status: “formerly smoked”, “never smoked”, “smokes”, or “Unknown” |
| stroke | 1 if the patient had a stroke, 0 if not (target variable) |
The dataset was retrieved from Kaggle and is intended for educational and research purposes. The information is publicly available here.
2 | Exploratory Data Analysis (EDA)
This report presents an Exploratory Data Analysis (EDA) of a stroke dataset, containing various attributes of patients that may be related to their likelihood of experiencing a stroke. This analysis aims to provide insights into the key characteristics of the data and relationships among different variables, which will aid in understanding factors contributing to strokes.
The dataset used in this analysis is named stroke.csv.
Below, we load the necessary libraries, load the dataset, and perform
data cleaning to ensure all columns are in appropriate data types.
library(ggplot2)
library(dplyr)
library(knitr)
library(kableExtra)
library(plotly)
library(RColorBrewer)
library(gridExtra)
stroke_data <- read.csv("stroke.csv")We label the variables for better readability and map categorical variables to human-readable levels.
variable_labels <- list(
"id" = "ID",
"gender" = "Gender",
"age" = "Age",
"hypertension" = "Hypertension",
"heart_disease" = "Heart Disease",
"ever_married" = "Ever Married",
"work_type" = "Work Type",
"Residence_type" = "Residence Type",
"avg_glucose_level" = "Average Glucose Level",
"bmi" = "Body Mass Index (BMI)",
"smoking_status" = "Smoking Status",
"stroke" = "Stroke"
)
# Data Cleaning: Convert columns to appropriate data types
stroke_data$id <- as.factor(stroke_data$id)
stroke_data$gender <- as.factor(stroke_data$gender)
stroke_data$ever_married <- as.factor(stroke_data$ever_married)
stroke_data$work_type <- as.factor(stroke_data$work_type)
stroke_data$Residence_type <- as.factor(stroke_data$Residence_type)
stroke_data$smoking_status <- as.factor(stroke_data$smoking_status)
stroke_data$stroke <- as.factor(stroke_data$stroke)
stroke_data$hypertension <- as.factor(stroke_data$hypertension)
stroke_data$heart_disease <- as.factor(stroke_data$heart_disease)
# Ensure numerical columns are numeric
stroke_data$age <- as.numeric(stroke_data$age)
stroke_data$avg_glucose_level <- as.numeric(stroke_data$avg_glucose_level)
stroke_data$bmi <- as.numeric(stroke_data$bmi)To handle missing values, we imputed missing bmi values
using the median.
The dataset contains numerical variables such as age,
avg_glucose_level, and bmi. Below, we
visualize the distribution of these variables using density plots.
numerical_columns <- c("age", "avg_glucose_level", "bmi")
for (col in numerical_columns) {
label <- variable_labels[[gsub("\\.", "_", col)]]
print(
ggplot(stroke_data, aes_string(x = col)) +
geom_density(fill = "steelblue", alpha = 0.7) +
theme_minimal() +
labs(title = paste("Density Plot of", label), x = label, y = "Density")
)
}Summary statistics provide insights into the central tendencies and spread of the numerical variables.
numerical_summary <- stroke_data %>%
summarise(
Age_Min = min(age, na.rm = TRUE), Age_Median = median(age, na.rm = TRUE), Age_Mean = mean(age, na.rm = TRUE), Age_Max = max(age, na.rm = TRUE), Age_SD = sd(age, na.rm = TRUE),
Glucose_Min = min(avg_glucose_level, na.rm = TRUE), Glucose_Median = median(avg_glucose_level, na.rm = TRUE), Glucose_Mean = mean(avg_glucose_level, na.rm = TRUE), Glucose_Max = max(avg_glucose_level, na.rm = TRUE), Glucose_SD = sd(avg_glucose_level, na.rm = TRUE),
BMI_Min = min(bmi, na.rm = TRUE), BMI_Median = median(bmi, na.rm = TRUE), BMI_Mean = mean(bmi, na.rm = TRUE), BMI_Max = max(bmi, na.rm = TRUE), BMI_SD = sd(bmi, na.rm = TRUE)
)
kable(numerical_summary, "html") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condensed"))| Age_Min | Age_Median | Age_Mean | Age_Max | Age_SD | Glucose_Min | Glucose_Median | Glucose_Mean | Glucose_Max | Glucose_SD | BMI_Min | BMI_Median | BMI_Mean | BMI_Max | BMI_SD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.08 | 45 | 43.22661 | 82 | 22.61265 | 55.12 | 91.885 | 106.1477 | 271.74 | 45.28356 | 10.3 | 28.1 | 28.86204 | 97.6 | 7.699562 |
The frequency distributions of categorical variables provide insights into the proportion of patients in different categories.
categorical_columns <- c("gender", "hypertension", "heart_disease", "ever_married", "work_type", "Residence_type", "smoking_status", "stroke")
for (col in categorical_columns) {
label <- variable_labels[[gsub("\\.", "_", col)]]
print(
ggplot(stroke_data, aes_string(x = col, fill = col)) +
geom_bar(alpha = 0.8) +
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5, size = 3) +
theme_minimal() +
labs(title = paste("Frequency Distribution of", label), x = label, y = "Count", fill = label) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
)
}The relationships between pairs of numerical variables are explored using scatter plots.
scatter_pairs <- combn(numerical_columns, 2)
for (i in 1:ncol(scatter_pairs)) {
var1 <- scatter_pairs[1, i]
var2 <- scatter_pairs[2, i]
label1 <- variable_labels[[gsub("\\.", "_", var1)]]
label2 <- variable_labels[[gsub("\\.", "_", var2)]]
print(
ggplot(stroke_data, aes_string(x = var1, y = var2)) +
geom_point(alpha = 0.6, color = "darkgray") +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(title = paste("Scatter Plot of", label1, "vs", label2), x = label1, y = label2) +
theme_minimal()
)
}To further explore the data, we compare numerical variables across different categories using box plots.
for (num_col in numerical_columns) {
num_label <- variable_labels[[gsub("\\.", "_", num_col)]]
for (cat_col in categorical_columns) {
cat_label <- variable_labels[[gsub("\\.", "_", cat_col)]]
print(
ggplot(stroke_data, aes_string(x = cat_col, y = num_col, fill = cat_col)) +
geom_boxplot(alpha = 0.7) +
theme_minimal() +
labs(title = paste("Box Plot of", num_label, "by", cat_label), x = cat_label, y = num_label, fill = cat_label) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
)
}
}A correlation matrix is computed to explore the linear relationships between numerical variables. Below, we present significant correlations.
library(reshape2)
num_data <- stroke_data %>% select(all_of(numerical_columns))
# Ensure that num_data only contains numeric columns
num_data <- num_data %>% mutate_if(is.factor, as.numeric)
cor_matrix <- cor(num_data, use = "complete.obs")
upper_triangle <- cor_matrix
upper_triangle[lower.tri(upper_triangle, diag = TRUE)] <- NA
cor_data <- melt(upper_triangle, na.rm = TRUE)
cor_data <- cor_data %>%
filter(!is.na(value)) %>%
arrange(desc(abs(value)))
colnames(cor_data) <- c("Variable 1", "Variable 2", "Correlation")
kable(cor_data, "html") %>%
kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condensed"))| Variable 1 | Variable 2 | Correlation |
|---|---|---|
| age | bmi | 0.3242957 |
| age | avg_glucose_level | 0.2381711 |
| avg_glucose_level | bmi | 0.1668757 |
This EDA provides a detailed overview of the stroke dataset, focusing on both the individual characteristics of each variable and the relationships between variables. The visualizations and analyses highlight key factors that may contribute to stroke incidence and will serve as the foundation for further analysis and predictive modeling.