Introduction

Heart attacks among young individuals are becoming a significant health concern. This analysis aims to identify risk factors and behavioral patterns associated with heart attack likelihood using association rule mining.

Load necessary libraries

library(arules)
library(arulesViz)
library(readxl)
library(vcd)
library(ggplot2)
library(reshape2)

Load the dataset

Data was downloaded from kaggle and then transformed. It has plenty of variables, both categorical and numerical describing lifes of fairly young adults from India.

df <- read.csv("heart_attack_youngsters_india.csv")

# rename columns for better visibility
colnames(df) <- c(
    "age", "gender", "region", "urban_rural", "socioeconomic_status",
    "smoking_status", "alcohol_consumption", "diet_type",
    "physical_activity_level", "screen_time_hours_per_day",
    "sleep_duration_hours_per_day", "family_history_heart_disease",
    "diabetes", "hypertension", "cholesterol_levels_mg_dl", "bmi_kg_m2",
    "stress_level", "blood_pressure_sys_dia_mmhg", "resting_heart_rate_bpm",
    "ecg_results", "chest_pain_type", "max_heart_rate_achieved",
    "exercise_induced_angina", "blood_oxygen_levels_spo2",
    "triglyceride_levels_mg_dl", "heart_attack_likelihood"
)

summary(df)
##       age          gender             region          urban_rural       
##  Min.   :18.0   Length:10000       Length:10000       Length:10000      
##  1st Qu.:22.0   Class :character   Class :character   Class :character  
##  Median :27.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :26.6                                                           
##  3rd Qu.:31.0                                                           
##  Max.   :35.0                                                           
##  socioeconomic_status smoking_status     alcohol_consumption  diet_type        
##  Length:10000         Length:10000       Length:10000        Length:10000      
##  Class :character     Class :character   Class :character    Class :character  
##  Mode  :character     Mode  :character   Mode  :character    Mode  :character  
##                                                                                
##                                                                                
##                                                                                
##  physical_activity_level screen_time_hours_per_day sleep_duration_hours_per_day
##  Length:10000            Min.   : 0.000            Min.   : 3.00               
##  Class :character        1st Qu.: 4.000            1st Qu.: 4.00               
##  Mode  :character        Median : 8.000            Median : 6.00               
##                          Mean   : 7.511            Mean   : 6.49               
##                          3rd Qu.:12.000            3rd Qu.: 8.00               
##                          Max.   :15.000            Max.   :10.00               
##  family_history_heart_disease   diabetes         hypertension      
##  Length:10000                 Length:10000       Length:10000      
##  Class :character             Class :character   Class :character  
##  Mode  :character             Mode  :character   Mode  :character  
##                                                                    
##                                                                    
##                                                                    
##  cholesterol_levels_mg_dl   bmi_kg_m2     stress_level      
##  Min.   :100.0            Min.   :15.00   Length:10000      
##  1st Qu.:150.0            1st Qu.:21.20   Class :character  
##  Median :199.0            Median :27.50   Mode  :character  
##  Mean   :199.6            Mean   :27.44                     
##  3rd Qu.:249.0            3rd Qu.:33.70                     
##  Max.   :300.0            Max.   :40.00                     
##  blood_pressure_sys_dia_mmhg resting_heart_rate_bpm ecg_results       
##  Length:10000                Min.   : 60.00         Length:10000      
##  Class :character            1st Qu.: 74.00         Class :character  
##  Mode  :character            Median : 90.00         Mode  :character  
##                              Mean   : 89.49                           
##                              3rd Qu.:104.00                           
##                              Max.   :119.00                           
##  chest_pain_type    max_heart_rate_achieved exercise_induced_angina
##  Length:10000       Min.   :100.0           Length:10000           
##  Class :character   1st Qu.:129.0           Class :character       
##  Mode  :character   Median :160.0           Mode  :character       
##                     Mean   :159.7                                  
##                     3rd Qu.:190.0                                  
##                     Max.   :220.0                                  
##  blood_oxygen_levels_spo2 triglyceride_levels_mg_dl heart_attack_likelihood
##  Min.   : 90.00           Min.   : 50               Length:10000           
##  1st Qu.: 92.40           1st Qu.:164               Class :character       
##  Median : 94.90           Median :277               Mode  :character       
##  Mean   : 94.94           Mean   :275                                      
##  3rd Qu.: 97.40           3rd Qu.:385                                      
##  Max.   :100.00           Max.   :500

Data Transformation

Screen time and sleep duration were transformed into categorical values by spliting them into quantiles and then code them as: “Low”, “Medium”, “High”, “Very High”.

Healthy resting heart rate was assumed as value under 100 bmp, values below that point were classified as “normal”, those above as “high”.

Age was categorized as “under_20”, “20_29”, “over_30”

df$screen_time_hours_per_day <- as.numeric(df$screen_time_hours_per_day)
df$sleep_duration_hours_per_day <- as.numeric(df$sleep_duration_hours_per_day)
df$heart_attack_likelihood <- as.factor(df$heart_attack_likelihood)

# transform resting_heart_rate
df$resting_heart_rate_category <- cut(df$resting_heart_rate_bpm,
                                      breaks = c(-Inf, 100, Inf),
                                      labels = c("normal", "high"))

# transform sreen_time
screen_time_quartiles <- quantile(df$screen_time_hours_per_day, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
df$screen_time_category <- cut(df$screen_time_hours_per_day, 
                               breaks = c(min(df$screen_time_hours_per_day, na.rm = TRUE), 
                                          screen_time_quartiles[1], 
                                          screen_time_quartiles[2], 
                                          screen_time_quartiles[3], 
                                          max(df$screen_time_hours_per_day, na.rm = TRUE)), 
                               labels = c("Low", "Medium", "High", "Very High"),
                               include.lowest = TRUE)

# transform sleep_duration
sleep_duration_quartiles <- quantile(df$sleep_duration_hours_per_day, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
df$sleep_duration_category <- cut(df$sleep_duration_hours_per_day, 
                                  breaks = c(min(df$sleep_duration_hours_per_day, na.rm = TRUE), 
                                             sleep_duration_quartiles[1], 
                                             sleep_duration_quartiles[2], 
                                             sleep_duration_quartiles[3], 
                                             max(df$sleep_duration_hours_per_day, na.rm = TRUE)), 
                                  labels = c("Low", "Medium", "High", "Very High"),
                                  include.lowest = TRUE)

# transform age
df$age_category <- cut(df$age,
                       breaks = c(-Inf, 19, 29, Inf),
                       labels = c("under_20", "20_29", "over_30"),
                       include.lowest = TRUE)

Association Rules Mining

Only categorical values were selected for analysis. Region was ommited due to this analysis being performed by person with not deep understanding of geographical and socological measures of India, so this did not add any understandable information.

# select categorical variables for analysis
categorical_vars <- c("age_category", "gender", "urban_rural", "socioeconomic_status",
                      "smoking_status", "alcohol_consumption", "diet_type",
                      "physical_activity_level", "stress_level",
                      "exercise_induced_angina", "screen_time_category",
                      "sleep_duration_category", "resting_heart_rate_category", "heart_attack_likelihood")

df_selected <- df[, categorical_vars]
df_selected[] <- lapply(df_selected, as.factor)
transacs <- as(df_selected, "transactions")

Generating Association Rules

The Apriori algorithm was chosen because it is well-suited for analyzing frequent item sets in categorical data. The following parameter choices were made:

  • Minimum support (supp = 0.005): This threshold represents the minimum fraction of transactions that must contain an itemset for it to be considered frequent. A lower support value helps to capture rarer patterns, but too low a value may introduce noise. Given the large dataset 0.5% seems reasonable for support.
  • Minimum confidence (conf = 0.3): Confidence measures the conditional probability that if the antecedent occurs, the consequent also occurs. A 30% threshold ensures that the discovered rules have a reasonable likelihood of being meaningful while allowing for less frequent associations.
  • Minimum length (minlen = 4): This ensures that each rule contains at least four items (antecedents and consequents combined), filtering out overly simplistic rules that may not provide deep insights.
  • Lift filter (lift > 1.5): Lift measures how much more likely the consequent is to occur given the antecedent, compared to its baseline probability. A lift greater than 1.5 indicates that the rule provides a meaningful association beyond chance.
rules <- apriori(transacs, parameter = list(supp = 0.005, conf = 0.3, minlen = 4),
        appearance = list(rhs = c("heart_attack_likelihood=Yes", "heart_attack_likelihood=No"), default="lhs"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5   0.005      4
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 50 
## 
## set item appearances ...[2 item(s)] done [0.00s].
## set transactions ...[40 item(s), 10000 transaction(s)] done [0.00s].
## sorting and recoding items ... [40 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
##  done [0.36s].
## writing ... [182815 rule(s)] done [0.01s].
## creating S4 object  ... done [0.03s].

Results Interpretation

Top Rules

##      lhs                                     rhs                           support confidence coverage     lift count
## [1]  {gender=Male,                                                                                                   
##       urban_rural=Rural,                                                                                             
##       physical_activity_level=Moderate,                                                                              
##       sleep_duration_category=Low}        => {heart_attack_likelihood=Yes}  0.0061  0.3426966   0.0178 1.681534    61
## [2]  {gender=Male,                                                                                                   
##       diet_type=Vegetarian,                                                                                          
##       screen_time_category=Low,                                                                                      
##       resting_heart_rate_category=high}   => {heart_attack_likelihood=Yes}  0.0061  0.3297297   0.0185 1.617908    61
## [3]  {gender=Male,                                                                                                   
##       stress_level=Medium,                                                                                           
##       screen_time_category=Low,                                                                                      
##       sleep_duration_category=Low}        => {heart_attack_likelihood=Yes}  0.0050  0.3246753   0.0154 1.593108    50
## [4]  {urban_rural=Urban,                                                                                             
##       alcohol_consumption=Occasionally,                                                                              
##       stress_level=Medium,                                                                                           
##       exercise_induced_angina=No,                                                                                    
##       sleep_duration_category=Medium}     => {heart_attack_likelihood=Yes}  0.0055  0.3216374   0.0171 1.578201    55
## [5]  {gender=Male,                                                                                                   
##       diet_type=Vegetarian,                                                                                          
##       exercise_induced_angina=No,                                                                                    
##       screen_time_category=Low,                                                                                      
##       resting_heart_rate_category=high}   => {heart_attack_likelihood=Yes}  0.0056  0.3163842   0.0177 1.552425    56
## [6]  {gender=Male,                                                                                                   
##       diet_type=Vegetarian,                                                                                          
##       sleep_duration_category=Low,                                                                                   
##       resting_heart_rate_category=high}   => {heart_attack_likelihood=Yes}  0.0052  0.3151515   0.0165 1.546376    52
## [7]  {gender=Male,                                                                                                   
##       urban_rural=Rural,                                                                                             
##       physical_activity_level=Moderate,                                                                              
##       exercise_induced_angina=No,                                                                                    
##       sleep_duration_category=Low}        => {heart_attack_likelihood=Yes}  0.0051  0.3148148   0.0162 1.544724    51
## [8]  {urban_rural=Rural,                                                                                             
##       physical_activity_level=Moderate,                                                                              
##       stress_level=Medium,                                                                                           
##       screen_time_category=Low,                                                                                      
##       resting_heart_rate_category=normal} => {heart_attack_likelihood=Yes}  0.0050  0.3144654   0.0159 1.543010    50
## [9]  {urban_rural=Rural,                                                                                             
##       smoking_status=Never,                                                                                          
##       stress_level=Medium,                                                                                           
##       exercise_induced_angina=No,                                                                                    
##       screen_time_category=High}          => {heart_attack_likelihood=Yes}  0.0056  0.3128492   0.0179 1.535079    56
## [10] {alcohol_consumption=Occasionally,                                                                              
##       stress_level=Medium,                                                                                           
##       exercise_induced_angina=No,                                                                                    
##       sleep_duration_category=Medium,                                                                                
##       resting_heart_rate_category=normal} => {heart_attack_likelihood=Yes}  0.0065  0.3125000   0.0208 1.533366    65

Certain variables appeared more frequently in association rules, indicating their strong impact on heart attack likelihood. These include: - Stress Level: High stress levels frequently co-occur with other risk factors. - Sedentary Lifestyle: Lack of physical activity appears frequently in high-lift rules. - Sleep Duration: Short sleep duration is strongly associated with increased heart attack risk. - Urban vs Rural Differences: Some distinctions between urban and rural populations were noted, though they were less pronounced than would be expected with mainstream portrial of rural vs urban life.

Interestingly, some variables that were expected to appear, such as diet type and alcohol consumption, did not frequently appear in strong association rules, suggesting their lesser impact compared to other factors, at least in obtained data.

One would expect rule solely with alcohol, smoking, stress and sleep variables with worst values possible, but it is not seen. Further analysis should be performed, preferably also with different methods to broaden the understanding of this data and consequences of choices regarding life.

Strongest association rules