Background

Heart disease is the major cause of morbidity and mortality globally: it accounts for more deaths annually than any other cause. According to the WHO, an estimated 17.9 million people died from heart disease in 2016, representing 31% of all global deaths.

The silver lining is that heart attacks are highly preventable and simple lifestyle modifications coupled with early treatment greatly improves its prognosis. It is, however, difficult to identify high risk patients because of the multi-factorial nature of several contributory risk factors such as diabetes, high blood pressure, high cholesterol, etc. Therefore, doctors and scientists alike have turned to machine learning techniques to develop screening tools. One challenge on Kaggle aims to predict the heart attack bu machine learning. (https://www.kaggle.com/nareshbhat/health-care-data-set-on-heart-attack-possibility)

The goal of this project is to explore the database and find some important features which might help for this prediction.

Methods

Goal: Explore the database and find important features which show differences between high and low risk heart disease subjects.

Dataset: This dataset contains 303 subjects and each of them contains 13 features, which were used to classified the objects into two categories.

no/less chance of heart attack (0)
more chance of heart attack (1)

heart <- read.csv('heart.csv')
names(heart)

##  [1] "age"      "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"

Methods

These features can be divided into two categories, discrete values or continues values.

Features with Discrete values: sex, chest pain type, fasting blood sugar, resting electrocardiographic results, exercise induced angina, the slope of the peak exercise ST segment, number of major vessels, and thal.

summary(heart[,c(2,3,6,7,9,11,12,13)])

##       sex               cp             fbs            restecg      
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.000   Median :0.0000   Median :1.0000  
##  Mean   :0.6832   Mean   :0.967   Mean   :0.1485   Mean   :0.5281  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :3.000   Max.   :1.0000   Max.   :2.0000  
##      exang            slope             ca              thal      
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.0000   Median :1.000   Median :0.0000   Median :2.000  
##  Mean   :0.3267   Mean   :1.399   Mean   :0.7294   Mean   :2.314  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :2.000   Max.   :4.0000   Max.   :3.000

Methods

Features with Continuous values: age, resting blood pressure, serum cholestoral, maximum heart rate achieved, oldpeak.

summary(heart[,c(1,4,5,8,10)])

##       age           trestbps          chol          thalach         oldpeak    
##  Min.   :29.00   Min.   : 94.0   Min.   :126.0   Min.   : 71.0   Min.   :0.00  
##  1st Qu.:47.50   1st Qu.:120.0   1st Qu.:211.0   1st Qu.:133.5   1st Qu.:0.00  
##  Median :55.00   Median :130.0   Median :240.0   Median :153.0   Median :0.80  
##  Mean   :54.37   Mean   :131.6   Mean   :246.3   Mean   :149.6   Mean   :1.04  
##  3rd Qu.:61.00   3rd Qu.:140.0   3rd Qu.:274.5   3rd Qu.:166.0   3rd Qu.:1.60  
##  Max.   :77.00   Max.   :200.0   Max.   :564.0   Max.   :202.0   Max.   :6.20

Methods

Data cleansing: From the summary, we can find that there is no ‘NA’ or pretty high values (e.g. 9999) in this dataset, which implies there is no junk information.

All values are numerical. Here, I factorized the target to low and high chance of heart attack for intuitive understanding.

heart$target <- factor(heart$target, labels = c('low','high'))

Visualization methods: Two different visualization methods were used to explore the features with continuous or discrete values and see their distribution. For features with discrete values, histogram applied to see the difference between high and low risk heart disease subjects. For features with continuous values, boxplot was applied to see the difference between high and low risk heart disease subjects.

Results

The barplot of features with discrete values (gray column: no/less chance of heart attack, dark column: high chance of heart attack)

charts <- par(mfrow = c(2,4))
for (i in c(2,3,6,7,9,11,12,13))
{
  counts <- table(heart$target, heart[,i])
  counts[1,] <- counts[1,]/table(heart$target)[1]
  counts[2,] <- counts[2,]/table(heart$target)[2]
  barplot(counts, main=names(heart)[i], xlab=paste('heart$',names(heart)[i]), beside=TRUE)
}

As shown in the plots, chest pain type (‘cp’), the slope of the peak exercise ST segment(‘slope’), number of major vessels(‘ca’) and thal (‘thal’) have very different distribution between high and low risk heart disease subjects, which implies that they might be a good predictor or an important factor to distinguish the possibilities of heart disease.

Results

The boxplots of features with continuous values:

fivechart <- par(mfrow = c(1,5))
boxplot(heart$age ~ heart$target, main = 'age')
boxplot(heart$trestbps ~ heart$target, main = 'resting blood pressure')
boxplot(heart$chol ~ heart$target, main = 'serum cholestoral')
boxplot(heart$thalach ~ heart$target, main ='maximum heart rate achieved')
boxplot(heart$oldpeak ~ heart$target, main = 'old peak')

As shown in the plots, maximum heart rate achieved and old peak have very different distribution between high and low risk heart disease subjects, which implies that they might be a good predictor or an important factor to distinguish the possibilities of heart disease.

Results

As a further exploration, I plot the subjects according to two factors I found usable maximum heart rate achieved and old peak and to see if there is any possibility to classify the low and high risk of heart disease by them.

library(ggplot2)
qplot(thalach, oldpeak, colour = target, data = heart)

As shown in the scatter, it’s hard to set a threshold/function to classify the risk of heart disease only by two features. But it is still promising to add more features to get a good classification.

Discussion/Future Directions

The last figures is only one example to localize the subjects by two features, which is easy to show in 2D. By adding more features, especially the ones behave differently between low and high risk of heart disease, the subjects can be localized in the high dimension space, which is good for the use of machine learning.

The next step could be trying some simple ML methods, such as multiple-linear regression, SVM or PCA. Since this dataset is not such big, some simple ML model should work.

Midterm: Heart Attack Possibility

Background

Methods

Methods

Methods

Methods

Results

Results

Results

Discussion/Future Directions