About the Data

The Heart Attack Analysis and Prediction Dataset is available on kaggle. The dataset consists of 303 observations and 13 predictions.

The dataset poses a classification problem. The response variable is binary with 1 representing a higher risk of heart attack and 0 representing a lower risk of heart attack.

The predictors are a mix of continuous and categorical variables. The descriptions listed below are taken from the dataset’s kaggle page:

  • Age : Age of the patient

  • Sex : Sex of the patient

  • exang: exercise induced angina (1 = yes; 0 = no)

  • ca: number of major vessels (0-3)

  • cp : Chest Pain type chest pain type Value 1: typical angina Value 2: atypical angina Value 3: non-anginal pain Value 4: asymptomatic

  • trtbps : resting blood pressure (in mm Hg)

  • chol : cholestoral in mg/dl fetched via BMI sensor

  • fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

  • rest_ecg : resting electrocardiographic results Value 0: normal Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria

  • thalach : maximum heart rate achieved

  • target : 0= less chance of heart attack 1= more chance of heart attack

The problems at hand are;

  1. Conduct an exploratory data analysis to determine the characteristics of each group.

  2. Construct a predictive model that identifies observations at higher risk for heart attacks.

Exploratory Data Analysis

Output Variable

Slightly over half of the observations are at high risk for a heart attack. This means the data is well-balanced and suitable for fitting machine learning models.

Correlation Plot

  • There is a negative correlation (-0.577) between the variables oldpeak and slp

  • A weak positive correlation (0.387) exists slp and thall

  • A weak negative correlation (-0.344) exists between oldpeak and thalachh, the maximum heart rate

  • There is a negative correlation (-0.399) between age and thallach representing a decrease in the maximum heart rate with age

Density Plots

Green represents those observations with lower risk, while red represents observations with a higher risk of heart attack.

There is visible separation between the groups in the previous peak and the maximum heart beat. Those at risk generally have a high maximum heartbeat and a low previous peak.

However, some of the information in the maximum heart beat could be due the variable’s correlation with age.

In the cholesterol plot one can see an outlier with a cholesterol over 500 mg/dl. Otherwise the plot aligns with the generally accepted science that the total cholesterol level is not at indicative as the component parts, which are not included in this data set.

The resting blood pressure density chart shows little separation between the groups.

Sex and Age

About two-thirds of the data is from male patients; however, a higher proportion of female patients included in the data set are at high risk for a heart attack.

The density plot above shows that those at low rest tend to be older, which goes against what one would expect.

Other Categorical Variables

Predictive Modeling

KNN

A K-nearest neighbors algorithm was fit using a 70/30 test/training split and 5-fold cross validation to select the value of K.

As shown in the plot above k = 19 has the highest training accuracy rate. The results for the testing set are as follows:

Accuracy Sensitivity Specificity
0.87 0.94 0.76
Predicted Actual
Low Risk High Risk
Low Risk 28 3
High Risk 9 51

The KNN model performed well. Eighty-seven percent of the testing observations were correctly classified including 51 out of 54 of the high risk patients. The model performed slightly worse at classifying low risk patients.

Logistic Regression

Logistic regression was performed with a 70/30 train/test split. It was determined that the full model, the model containing all variables, performed better in test accuracy than a similar model chosen through stepwise AIC selection. The results from the full model are listed below:

Accuracy Sensitivity Specificity
0.88 0.94 0.78
Predicted Actual
Low Risk High Risk
Low Risk 29 3
High Risk 8 51

The relative importance of each variable to the model is measured by the absolute value of the estimated coefficients t-score shown in the plot above. cp, the type of chest pain, was the most important predictor of higher risk for heard attack.

Support Vector Machine

A support vector machine was fitted to the data using a 70/30 training/test split. A sigmoid kernel was used with tuning parameters cost = 2, coef0 = .1 and gamma = .01 selected by 10-fold cross validation over a grid search. The SVM gave the same result as logistic regression, however the logistic regression model is preferred because it has greater interpretability.

Accuracy Sensitivity Specificity
0.88 0.94 0.78
Predicted Actual
Low Risk High Risk
Low Risk 29 3
High Risk 8 51

XG Boosting

The eXtreme Gradient Boosting model was fit using cross validation to determine the tuning parameter nround to be 456. The XG boost performed slightly worse than the other models.

Accuracy Sensitivity Specificity
0.82 0.85 0.78
Predicted Actual
Low Risk High Risk
Low Risk 29 8
High Risk 8 46

Take Aways

  • The logistic regression model was able to classify roughly 88% of test observations accurately with higher interpretability than other similar performing models. Furthermore, 94% of patients at high risk for heart attack were successfully classified.

  • The presence of chest pain, specifically atypical angina or non-anginal pain were the most important predictors of risk for heart attack in the chosen model.

  • Females present in this data set appear to be at a higher risk of heart attack than males.

  • Having a higher ‘caa’, the number of major blood vessels appears to reduce the probability of being higher risk for a heart attack

  • Overall the data present creates a strong classification model.