Introduction

Cardiovascular diseases (CVDs) are the leading cause of death globally, being responsible for 17.9 million deaths each year, which is equivalent to 31% of all deaths worldwide. Heart failure is a common side effect of CVDs and dataset examined in this notebook contains twelve features that are associated with mortality due to this condition.

For those who already have cardiovascular disease or who are at an elevated risk due to risk factors such as hypertension, diabetes, and hyperlipidemia, early detection and management are essential. In this regard, both unsupervised and supervised learning tools can be of great help.

In this work I would like to present an association rule mining approach to evaluate sets of values of factors (itemsets) such as “Male, Aged 18-25, Low Cholesterol, Oldpeak normal …” associated with higher chance of heart disease reflected by cooccuring values of a target variable (“HeartDisease”). Presented set of interactive visualisation of rules allow for detailed, contex-related investigations.

The used type of association algorithm was standard apriori algorithm, where the positive value of the target variable served as the right hand side (consequent) of examined rules. Apriori algorithm consequtively analyses the itemsets from one-item sets to the most complex ones, and works iteratively. In each iteration it selects the rules, which match predefined constraints, and then the itemsets are increased by one, from 1->2, 2->3, 3->4 etc.

In case of this study following constraints were imposed on the measures of the interestingness of rules:
Support>=0.1, Confidence>=0.4 (these measures are explained in the section titled “Description of Association Rule Mining method:”)
The dataset comes from open kaggle database and consists of features describing different risk factors associated with cardiovascular health,and increased chanced of developing a heart disease.


Dataset: 918 observations of both healthy/unhealthy patients


Dataset Source: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download


Description of Features:

-Age : Age of the patient

-Sex : Sex of the patient

-ExerciseAngina: exercise induced angina (1 = yes; 0 = no)

-ChestPainType : Chest Pain type chest pain type:

Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic

-RestingBP : Resting blood pressure (in mm Hg)

-Cholesterol : Cholesterol in mg/dl fetched via BMI sensor

-FastingBS : (Fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

-RestingECG : Resting electrocardiographic results:

Value 0: normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria

-Max Heart Rate : Maximum heart rate achieved

-Oldpeak : ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot) Values: “Oldpeak normal”, “Oldpeak risk”, “Oldpeak terrible”

-ST slope : The slope of the peak exercise ST segment — 0: downsloping; 1: flat; 2: upsloping; 0: downsloping; 1: flat; 2: upsloping

-HeartDisease (target variable) : 0= normal, 1= heart disease

Description of Association Rule Mining method:

Let D = {t1 , t2, . . . , tm} be a set of transactions called the database, and let I = {i1, i2, . . . , in} be the set of all items considered in the database

Each transaction in D has a unique transaction (combination of factors for example: “Male, Aged 18-25, Low Cholesterol, Oldpeak normal, Heart Disease” ), ID and contains a subset of the items in I.

A rule is defined as an expression X ⇒ Y where X,Y ⊆ I and X ∩ Y = ∅

Measures of rules’ importance:

Support of an itemset X (supp(X)) is defined as the proportion of transactions in the data set which contain the itemset X
Confidence - interpreted as a frequency of coappearence of antecedent and subsequent in all cases where subsequent itemset is present
conf(X ⇒ Y) = supp(X ∪ Y)/supp(X)
Lift – compares frequency of antecedent appearing together with subsequent with benchmark frequency e.g. the frequency expected if the subsequent was appearing independently from consequent
lift(X ⇒ Y) = supp(X ∪ Y)/ (supp(X)*supp(Y))


Loading Kaggle database:

# Load the data for the purpose of categorisation
df <- read.csv("heart.csv")


Discretising/Categorising variables in compliance with medical standards/demographic standards:

#Age
df$Age <- cut(df$Age, breaks = c(19, 35, 50, 65, Inf), labels = c("Age 19-35", "Age 36-50", "Age 51-65", "Age 66-80"))

#Sex
df$Sex <- ifelse(df$Sex=="M", "Male",
ifelse(df$Sex=="F", "Female", NA))

#ExerciseAngina
df$ExerciseAngina <- ifelse(df$ExerciseAngina=="Y", "Exercise induced angina present",
ifelse(df$ExerciseAngina=="N", "Exercise induced angina not present", NA))

#ChestPainType
df$ChestPainType <- ifelse(df$ChestPainType=="TA", "ChestPainType = typical angina",
ifelse(df$ChestPainType=="ATA", "ChestPainType = atypical angina",
ifelse(df$ChestPainType=="NAP", "ChestPainType = non-anginal pain", ifelse(df$ChestPainType=="ASY", "ChestPainType = asymptomatic", NA))))

#RestingBP
df$RestingBP <- cut(df$RestingBP, breaks = c(0, 90, 120, 140, 160, Inf), labels = c("RestingBP 0-90", "RestingBP 91-120", "RestingBP 121-140", "RestingBP 141-160", "RestingBP 161+"))

#Cholesterol
df$Cholesterol <- cut(df$Cholesterol, breaks = c(0, 200, 240, 280, Inf), labels = c("Cholesterol 0-200", "Cholesterol 201-240", "Cholesterol 241-280", "Cholesterol 281+"))

#FastingBS
df$FastingBS <- ifelse(df$FastingBS=="0", "FastingBS <= 120 mg/dl",
ifelse(df$FastingBS=="1", "FastingBS > 120 mg/dl", NA))

#RestingECG
df$RestingECG <- ifelse(df$RestingECG=="Normal", "RestingECG normal",
ifelse(df$RestingECG=="ST", "RestingECG ST-T wave abnormality",
ifelse(df$RestingECG=="LVH", "RestingECG left ventricular hypertrophy", NA)))
       
#Max Heart Rate
df$MaxHR <- cut(df$MaxHR, breaks = c(0, 100, 120, 140, 160, Inf), labels = c("MaxHR 0-100", "MaxHR 101-120", "MaxHR 121-140", "MaxHR 141-160", "MaxHR 161+"))

#Oldpeak
df$Oldpeak <- cut(df$Oldpeak, breaks = c(-2.6,2,2.55,Inf), labels = c("Oldpeak normal", "Oldpeak risk", "Oldpeak terrible"))

#ST slope 
df$ST_Slope  <- ifelse(df$ST_Slope=="Up", "ST slope UP",
ifelse(df$ST_Slope=="Flat", "ST slope Flat",
ifelse(df$ST_Slope =="Down", "ST slope Down", NA)))

#target variable - Heart Disease 
df$HeartDisease <- ifelse(df$HeartDisease==0, "normal",
ifelse(df$HeartDisease==1, "heart disease", NA))

write.csv(df, file="heart_categorised.csv")


Creating dataset with transactions from prepared data:

#loading neccessary packages
packages <- c("arules", "arulesViz", "arulesCBA","dplyr","stringr","tidyr")                                   
not_installed <- packages[!(packages %in% installed.packages()[ , "Package"])]    
if(length(not_installed)) install.packages(not_installed) 

library(arules)
library(arulesViz)
library(arulesCBA)
library(dplyr)
library(stringr)
library(tidyr)

# reading the file as transactions
transactions<-read.transactions("heart_categorised.csv", format="basket", sep=",", skip=0) 

# cleaning the data from raw observations
transactions<-transactions[, itemFrequency(transactions)>0.05]


Sorting to find frequencies of distinct values of factors in the population:

#eliminating characteristics appearing in less frequent than in 5% of the cases.
transactions<-transactions[, itemFrequency(transactions)>0.05]
sort(itemFrequency(transactions), decreasing = TRUE)
##                          Oldpeak normal                                    Male 
##                              0.88900979                              0.78890098 
##                  FastingBS <= 120 mg/dl                       RestingECG normal 
##                              0.76605005                              0.60065288 
##     Exercise induced angina not present                               Age 51-65 
##                              0.59521219                              0.56583243 
##                           heart disease            ChestPainType = asymptomatic 
##                              0.55277476                              0.53971708 
##                           ST slope Flat                                  normal 
##                              0.50054407                              0.44613711 
##                       RestingBP 121-140                             ST slope UP 
##                              0.44069641                              0.42981502 
##         Exercise induced angina present                        RestingBP 91-120 
##                              0.40369967                              0.31664853 
##                               Age 36-50                           MaxHR 121-140 
##                              0.30903156                              0.27312296 
##                     Cholesterol 201-240                           MaxHR 141-160 
##                              0.26224157                              0.25027203 
##                   FastingBS > 120 mg/dl        ChestPainType = non-anginal pain 
##                              0.23286181                              0.22089227 
##                                  Female RestingECG left ventricular hypertrophy 
##                              0.21001088                              0.20457018 
##                     Cholesterol 241-280        RestingECG ST-T wave abnormality 
##                              0.19912949                              0.19368879 
##                       RestingBP 141-160                           MaxHR 101-120 
##                              0.19260065                              0.19151251 
##                              MaxHR 161+         ChestPainType = atypical angina 
##                              0.19151251                              0.18824810 
##                        Cholesterol 281+                       Cholesterol 0-200 
##                              0.18715996                              0.16322089 
##                             MaxHR 0-100                               Age 66-80 
##                              0.09249184                              0.08922742 
##                        Oldpeak terrible                           ST slope Down 
##                              0.07725789                              0.06855277 
##          ChestPainType = typical angina 
##                              0.05005441


From output of above cell, we can read the main characteristics of the studied population allowing for an interesting study. Firstly, the frequencies of males (80%) and people aged 51-65 (56%) in a population reflect the straightforward correlation of target variables with these characteristics. If we consider the 3 most frequent medical-related items in the transaction dataset, we could observe that more than 59% subjects are in a relatively good physical condition (as indicated by exercise angina not present in 59%, FastingBloodSugar below risky level for 76%, and normal results of ECG for 60%). These characteristics give promise of curious results for explaining causes of heart-diseases in a population, where, despite good characteristics ,more than a half of the population (56%) was labeled as struggling with a heart-disease.


In the analysis two sets of rules were considered: “cleaned” rules which satisfied conditions of: significance, non-redundancy and maximality, and, separately, simple rules with one-item subsequents.


Finding clean rules and inspecting them in an interactive table:

#rules
rules <- apriori(data=transactions, parameter=list(supp=0.1, conf=0.4), appearance=list(default="lhs", rhs="heart disease"), control=list(verbose=F)) 
summary(rules) #see how many rules and what are the parameters of support and confidence
## set of 331 rules
## 
## rule length distribution (lhs + rhs):sizes
##   1   2   3   4   5   6   7 
##   1  20  72 113  92  31   2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   4.136   5.000   7.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift       
##  Min.   :0.1001   Min.   :0.4049   Min.   :0.1012   Min.   :0.7325  
##  1st Qu.:0.1153   1st Qu.:0.6953   1st Qu.:0.1431   1st Qu.:1.2578  
##  Median :0.1371   Median :0.8281   Median :0.1872   Median :1.4981  
##  Mean   :0.1648   Mean   :0.7831   Mean   :0.2228   Mean   :1.4166  
##  3rd Qu.:0.1893   3rd Qu.:0.8998   3rd Qu.:0.2568   3rd Qu.:1.6278  
##  Max.   :0.5528   Max.   :1.0000   Max.   :1.0000   Max.   :1.8091  
##      count      
##  Min.   : 92.0  
##  1st Qu.:106.0  
##  Median :126.0  
##  Mean   :151.5  
##  3rd Qu.:174.0  
##  Max.   :508.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions           919     0.1        0.4
##                                                                                                                                                            call
##  apriori(data = transactions, parameter = list(supp = 0.1, conf = 0.4), appearance = list(default = "lhs", rhs = "heart disease"), control = list(verbose = F))
#POSTPROCESSING: 

#eliminating rules in which case ther is a more general one with the same or higher confidence value
rules.clean<-rules[!is.redundant(rules)] 


# Significance of rules is tested with Fisher's exact test with correction for multiple comparisons to test the null hypothesis that the LHS and the RHS of the rule are independent

rules.clean<-rules.clean[is.significant(rules.clean, trans1)] 

# Set is maximal if it does not contain a superset
# Maximal associations allow the discovery of associations pertaining to items that most often do not appear alone, but rather together with closely related items, and hence associations relevant only to these items tend to obtain low confidence.

rules.clean<-rules.clean[is.maximal(rules.clean)] 
rules.clean<-sort(rules.clean, by="confidence", decreasing=TRUE)

#interactive table
inspectDT(rules.clean)


Scatterplot for clean rules:

#Scatterplot for clean rules 
plot(rules.clean, method = NULL, measure = "support", shading = "lift", engine = "htmlwidget",
col = c("#00008B","#EEEEEEFF"))


Map of clean rules:

#graph allowing exploring clean rules: 
plot(rules.clean, method = "graph", engine = "htmlwidget", nodeCol = c("#ADD8E6"))

Analysis of “clean” rules


As indicated by the summary, from 331 rules only 24 survived post-processing with all imposed conditions. Sorting by confidence value turned out to be equivalent with sorting by lift value. The three rules with highest confidence (above 90%) and lift (1,7 to 1,8 in case of the first rule) values contained following subsequents:

1.{ChestPainType = asymptomatic,FastingBS > 120 mg/dl,ST slope Flat}
2.{FastingBS > 120 mg/dl,Oldpeak normal,ST slope Flat}
3.{ChestPainType = asymptomatic,Exercise induced angina present,Male,Oldpeak normal,ST slope Flat}


It is interesting to note that regardless of two factors with unalarming values (Oldpeak normal and ChestPainType = asymptomatic) these subsequents were most accurate predictors of heart-disease. Flat, downsloping ST segment of ECG usually suggests coronary ischemia, but its flatness can be thought to be more risky. Another alarming feature was fasting blood sugar, whose high levels might be interpreted as a highly precise predictor of heart-disease and associated risks. High frequency of the item “ChestPainType = asymptomatic” can be explained by the fact that almost half of all heart attacks have no symptoms at all — but that doesn’t mean they’re any less deadly than heart attacks with symptoms. As expected, “female” or names of younger cohorts did not appear in the top 10 of the rules.


Now let’s first retrieve and than analyse, one-item rules to compare the risk associated with distinct items:

#Retrieving the vector of indices of one-item rules

#Coercing rules into data frame
rules_df <- as(rules, "data.frame")

  
rules_df$lhs <- rules_df$rules %>% str_sub(.,-nchar(.),-20)
rules_df$rhs <- rules_df$rules %>% str_sub(.,-15,-1) 

rules_df <- rules_df[,!names(rules_df) %in% c("rules")]


rules_df <- rules_df[-c(1),] # the only rule with empty lhs

rules_oneitem_lhs <- rules_df[0,]

#Adding only one-item rules to the df
for (i in seq(1,length(rules_df$lhs))){
  if (length(strsplit(rules_df[["lhs"]][i],",")[[1]])==1){
    rules_oneitem_lhs[i,] <- rules_df[i,]
  }
}

#retrieving indices of rows with 1-item rules
one_item_rules_indices <- rownames(rules_oneitem_lhs) 


Table with one-item rules:

#Only one-item rules
rules_oneitem_lhs <- rules[one_item_rules_indices]
rules_oneitem_lhs <-sort(rules_oneitem_lhs , by="confidence", decreasing=TRUE)

#table with rules sorted by confidence and support 
inspectDT(rules_oneitem_lhs)


Scatterplot for one-item rules

#scatterplot for one-item
plot(rules[one_item_rules_indices], method = NULL, measure = "support", shading = "lift", engine = "htmlwidget",col = c("#00008B","#EEEEEEFF"))


Grouped matrix for one-item rules

#other plots to consider for visualisation of the rules
plot(rules[one_item_rules_indices], method = "grouped matrix", engine = "default", col = c("#00008B","#EEEEEEFF"))

Analysis of one-item rules

After visualising most important/interesting one-item rules, it should be noted that the second, third and fourth most important subsequents from these rules: {ST slope Flat}, {FastingBS > 120 mg/dl} and {ChestPainType = asymptomatic} correspond with the factors most commonly found amid the three of the most important “clean” rules. However the factor: {Exercise induced angina present} does not belong to this group of consequentes and yet it is the one characterised by highest confidence (0.852) and lift values (1.541). Its high lift value suggests that subjects experiencing angina after exercise experience heart diseases 54% more often than if it were two independent events (in which case lift would be equal to 1).

Additionaly, it is really interesting to note, how low of a lift value did high cholesterol (appearing in many Adverts as the leading cause of cardiovascular unhealth) reach (only 1.052).

Acknowledgements

Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020).

Rules visualisations:
https://journal.r-project.org/archive/2017/RJ-2017-047/RJ-2017-047.pdf