Heart attacks among young individuals are becoming a significant health concern. This analysis aims to identify risk factors and behavioral patterns associated with heart attack likelihood using association rule mining.
library(arules)
library(arulesViz)
library(readxl)
library(vcd)
library(ggplot2)
library(reshape2)
Data was downloaded from kaggle and then transformed. It has plenty of variables, both categorical and numerical describing lifes of fairly young adults from India.
df <- read.csv("heart_attack_youngsters_india.csv")
# rename columns for better visibility
colnames(df) <- c(
"age", "gender", "region", "urban_rural", "socioeconomic_status",
"smoking_status", "alcohol_consumption", "diet_type",
"physical_activity_level", "screen_time_hours_per_day",
"sleep_duration_hours_per_day", "family_history_heart_disease",
"diabetes", "hypertension", "cholesterol_levels_mg_dl", "bmi_kg_m2",
"stress_level", "blood_pressure_sys_dia_mmhg", "resting_heart_rate_bpm",
"ecg_results", "chest_pain_type", "max_heart_rate_achieved",
"exercise_induced_angina", "blood_oxygen_levels_spo2",
"triglyceride_levels_mg_dl", "heart_attack_likelihood"
)
summary(df)
## age gender region urban_rural
## Min. :18.0 Length:10000 Length:10000 Length:10000
## 1st Qu.:22.0 Class :character Class :character Class :character
## Median :27.0 Mode :character Mode :character Mode :character
## Mean :26.6
## 3rd Qu.:31.0
## Max. :35.0
## socioeconomic_status smoking_status alcohol_consumption diet_type
## Length:10000 Length:10000 Length:10000 Length:10000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## physical_activity_level screen_time_hours_per_day sleep_duration_hours_per_day
## Length:10000 Min. : 0.000 Min. : 3.00
## Class :character 1st Qu.: 4.000 1st Qu.: 4.00
## Mode :character Median : 8.000 Median : 6.00
## Mean : 7.511 Mean : 6.49
## 3rd Qu.:12.000 3rd Qu.: 8.00
## Max. :15.000 Max. :10.00
## family_history_heart_disease diabetes hypertension
## Length:10000 Length:10000 Length:10000
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## cholesterol_levels_mg_dl bmi_kg_m2 stress_level
## Min. :100.0 Min. :15.00 Length:10000
## 1st Qu.:150.0 1st Qu.:21.20 Class :character
## Median :199.0 Median :27.50 Mode :character
## Mean :199.6 Mean :27.44
## 3rd Qu.:249.0 3rd Qu.:33.70
## Max. :300.0 Max. :40.00
## blood_pressure_sys_dia_mmhg resting_heart_rate_bpm ecg_results
## Length:10000 Min. : 60.00 Length:10000
## Class :character 1st Qu.: 74.00 Class :character
## Mode :character Median : 90.00 Mode :character
## Mean : 89.49
## 3rd Qu.:104.00
## Max. :119.00
## chest_pain_type max_heart_rate_achieved exercise_induced_angina
## Length:10000 Min. :100.0 Length:10000
## Class :character 1st Qu.:129.0 Class :character
## Mode :character Median :160.0 Mode :character
## Mean :159.7
## 3rd Qu.:190.0
## Max. :220.0
## blood_oxygen_levels_spo2 triglyceride_levels_mg_dl heart_attack_likelihood
## Min. : 90.00 Min. : 50 Length:10000
## 1st Qu.: 92.40 1st Qu.:164 Class :character
## Median : 94.90 Median :277 Mode :character
## Mean : 94.94 Mean :275
## 3rd Qu.: 97.40 3rd Qu.:385
## Max. :100.00 Max. :500
Screen time and sleep duration were transformed into categorical values by spliting them into quantiles and then code them as: “Low”, “Medium”, “High”, “Very High”.
Healthy resting heart rate was assumed as value under 100 bmp, values below that point were classified as “normal”, those above as “high”.
Age was categorized as “under_20”, “20_29”, “over_30”
df$screen_time_hours_per_day <- as.numeric(df$screen_time_hours_per_day)
df$sleep_duration_hours_per_day <- as.numeric(df$sleep_duration_hours_per_day)
df$heart_attack_likelihood <- as.factor(df$heart_attack_likelihood)
# transform resting_heart_rate
df$resting_heart_rate_category <- cut(df$resting_heart_rate_bpm,
breaks = c(-Inf, 100, Inf),
labels = c("normal", "high"))
# transform sreen_time
screen_time_quartiles <- quantile(df$screen_time_hours_per_day, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
df$screen_time_category <- cut(df$screen_time_hours_per_day,
breaks = c(min(df$screen_time_hours_per_day, na.rm = TRUE),
screen_time_quartiles[1],
screen_time_quartiles[2],
screen_time_quartiles[3],
max(df$screen_time_hours_per_day, na.rm = TRUE)),
labels = c("Low", "Medium", "High", "Very High"),
include.lowest = TRUE)
# transform sleep_duration
sleep_duration_quartiles <- quantile(df$sleep_duration_hours_per_day, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
df$sleep_duration_category <- cut(df$sleep_duration_hours_per_day,
breaks = c(min(df$sleep_duration_hours_per_day, na.rm = TRUE),
sleep_duration_quartiles[1],
sleep_duration_quartiles[2],
sleep_duration_quartiles[3],
max(df$sleep_duration_hours_per_day, na.rm = TRUE)),
labels = c("Low", "Medium", "High", "Very High"),
include.lowest = TRUE)
# transform age
df$age_category <- cut(df$age,
breaks = c(-Inf, 19, 29, Inf),
labels = c("under_20", "20_29", "over_30"),
include.lowest = TRUE)
Only categorical values were selected for analysis. Region was ommited due to this analysis being performed by person with not deep understanding of geographical and socological measures of India, so this did not add any understandable information.
# select categorical variables for analysis
categorical_vars <- c("age_category", "gender", "urban_rural", "socioeconomic_status",
"smoking_status", "alcohol_consumption", "diet_type",
"physical_activity_level", "stress_level",
"exercise_induced_angina", "screen_time_category",
"sleep_duration_category", "resting_heart_rate_category", "heart_attack_likelihood")
df_selected <- df[, categorical_vars]
df_selected[] <- lapply(df_selected, as.factor)
transacs <- as(df_selected, "transactions")
The Apriori algorithm was chosen because it is well-suited for analyzing frequent item sets in categorical data. The following parameter choices were made:
supp = 0.005): This threshold
represents the minimum fraction of transactions that must contain an
itemset for it to be considered frequent. A lower support value helps to
capture rarer patterns, but too low a value may introduce noise. Given
the large dataset 0.5% seems reasonable for support.conf = 0.3): Confidence measures
the conditional probability that if the antecedent occurs, the
consequent also occurs. A 30% threshold ensures that the discovered
rules have a reasonable likelihood of being meaningful while allowing
for less frequent associations.minlen = 4): This ensures that each
rule contains at least four items (antecedents and consequents
combined), filtering out overly simplistic rules that may not provide
deep insights.lift > 1.5): Lift measures how much
more likely the consequent is to occur given the antecedent, compared to
its baseline probability. A lift greater than 1.5 indicates that the
rule provides a meaningful association beyond chance.rules <- apriori(transacs, parameter = list(supp = 0.005, conf = 0.3, minlen = 4),
appearance = list(rhs = c("heart_attack_likelihood=Yes", "heart_attack_likelihood=No"), default="lhs"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.005 4
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 50
##
## set item appearances ...[2 item(s)] done [0.00s].
## set transactions ...[40 item(s), 10000 transaction(s)] done [0.00s].
## sorting and recoding items ... [40 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## done [0.36s].
## writing ... [182815 rule(s)] done [0.01s].
## creating S4 object ... done [0.03s].
## lhs rhs support confidence coverage lift count
## [1] {gender=Male,
## urban_rural=Rural,
## physical_activity_level=Moderate,
## sleep_duration_category=Low} => {heart_attack_likelihood=Yes} 0.0061 0.3426966 0.0178 1.681534 61
## [2] {gender=Male,
## diet_type=Vegetarian,
## screen_time_category=Low,
## resting_heart_rate_category=high} => {heart_attack_likelihood=Yes} 0.0061 0.3297297 0.0185 1.617908 61
## [3] {gender=Male,
## stress_level=Medium,
## screen_time_category=Low,
## sleep_duration_category=Low} => {heart_attack_likelihood=Yes} 0.0050 0.3246753 0.0154 1.593108 50
## [4] {urban_rural=Urban,
## alcohol_consumption=Occasionally,
## stress_level=Medium,
## exercise_induced_angina=No,
## sleep_duration_category=Medium} => {heart_attack_likelihood=Yes} 0.0055 0.3216374 0.0171 1.578201 55
## [5] {gender=Male,
## diet_type=Vegetarian,
## exercise_induced_angina=No,
## screen_time_category=Low,
## resting_heart_rate_category=high} => {heart_attack_likelihood=Yes} 0.0056 0.3163842 0.0177 1.552425 56
## [6] {gender=Male,
## diet_type=Vegetarian,
## sleep_duration_category=Low,
## resting_heart_rate_category=high} => {heart_attack_likelihood=Yes} 0.0052 0.3151515 0.0165 1.546376 52
## [7] {gender=Male,
## urban_rural=Rural,
## physical_activity_level=Moderate,
## exercise_induced_angina=No,
## sleep_duration_category=Low} => {heart_attack_likelihood=Yes} 0.0051 0.3148148 0.0162 1.544724 51
## [8] {urban_rural=Rural,
## physical_activity_level=Moderate,
## stress_level=Medium,
## screen_time_category=Low,
## resting_heart_rate_category=normal} => {heart_attack_likelihood=Yes} 0.0050 0.3144654 0.0159 1.543010 50
## [9] {urban_rural=Rural,
## smoking_status=Never,
## stress_level=Medium,
## exercise_induced_angina=No,
## screen_time_category=High} => {heart_attack_likelihood=Yes} 0.0056 0.3128492 0.0179 1.535079 56
## [10] {alcohol_consumption=Occasionally,
## stress_level=Medium,
## exercise_induced_angina=No,
## sleep_duration_category=Medium,
## resting_heart_rate_category=normal} => {heart_attack_likelihood=Yes} 0.0065 0.3125000 0.0208 1.533366 65
Certain variables appeared more frequently in association rules, indicating their strong impact on heart attack likelihood. These include: - Stress Level: High stress levels frequently co-occur with other risk factors. - Sedentary Lifestyle: Lack of physical activity appears frequently in high-lift rules. - Sleep Duration: Short sleep duration is strongly associated with increased heart attack risk. - Urban vs Rural Differences: Some distinctions between urban and rural populations were noted, though they were less pronounced than would be expected with mainstream portrial of rural vs urban life.
Interestingly, some variables that were expected to appear, such as diet type and alcohol consumption, did not frequently appear in strong association rules, suggesting their lesser impact compared to other factors, at least in obtained data.
One would expect rule solely with alcohol, smoking, stress and sleep variables with worst values possible, but it is not seen. Further analysis should be performed, preferably also with different methods to broaden the understanding of this data and consequences of choices regarding life.