Heart Disease Prediction

Objective

The objective of the project is to build a decision tree to assess if a individual is likely to have heart attack or not based on data collected from a hospital in Cleveland.

The dataset contains the following arrtibutes:

  1. age
  2. sex
  3. chest pain type (4 values)
  4. resting blood pressure
  5. serum cholestoral in mg/dl
  6. fasting blood sugar > 120 mg/dl
  7. resting electrocardiographic results (values 0,1,2)
  8. maximum heart rate achieved
  9. exercise induced angina
  10. oldpeak = ST depression induced by exercise relative to rest
  11. the slope of the peak exercise ST segment
  12. number of major vessels (0-3) colored by flourosopy
  13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

EDA

library(corrgram)
library(MASS)
library(glmnet)
library(rpart)
library(rpart.plot)
library(rattle)
library(ROCR)

After using the above libraries, next let’s import the data.

#Importing the data
dataset=read.csv("heart.csv")

#Renaming the column ï..age to Age 
names(dataset)[1] = "Age"
colnames(dataset)
##  [1] "Age"      "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"
After importing the data, check for the dimensions of the dataset.
dataset=read.csv("heart.csv")
dim(dataset)
## [1] 303  14
Inference:

The dataset has 303 rows and 14 columns meaning that there are 303 patients with 14 attributes. Next let’s plot the correlation between the variables.

sum(is.na(dataset))
## [1] 0
Inference:

The dataset is free from missing values.

corrgram(dataset, upper.panel = NULL, lower.panel = panel.cor)

Inference:

We observe that a few variables age, sex, exang, oldpeak, thalach, slope, and ca show considerable correlation. Hence, we will include these while building the decision tree.

Decision Tree

A decision tree is an effective machine learning modeling technique for regression and classification problems. A decision tree provides a sequential, hierarchical decision about the response variable based on the predictor data. A hierarchical model is defined by a series of questions that lead to a value when applied to a reading. Once the model is ready, it acts like a protocol in a series of “If then” conditions that produce a particular result from the provided data.

CART stands for classification and regression tree.

  • Types
    • Regression tree: response variable Y is numerical
    • Classification tree: response variable Y is categorical

The decision tree models are classified as Classification trees when the target variable uses a discrete set of values and Regression trees when the response variable is binary. In our case the response variable Playoff has only two outcomes, that is whether team will qualify for the playoffs or not, hence we form a Classification tree. The goal of the decision tree is to select the optical choice at the end of each node. The decision to split at each node is made according to the metric known as purity. A node is 100% impure if split evenly 50/50 and 100% pure when the entire data belongs to a single class. In order to achieve the best model, impurity should be avoided.

Splitting the data
set.seed(13262376)
split = sample(nrow(dataset), nrow(dataset)*0.70)
training_set = dataset[split,]
test_set = dataset[-split,]
Building a Decision Tree Model
tree_mod = rpart(target ~ ï..age + sex + exang + oldpeak + thalach + slope + ca, data=training_set, method="class")

prp(tree_mod)

In the rpart() funcation, the cp(complexity parameter) argument is one of the parameters that is used to control the compexity of the tree. The overall R-square value must increase by cp at each step and smaller the cp value, the larger (complex) tree rpart will attempt to fit.

  • ROC: In order to determine the overall measure of goodness of classification, we use the Receiver Operating Characteristic (ROC) curve. Rather than using an overall misclassification rate, it employs two measures – true positive fraction (TPF) sensitivity and false positive fraction (FPF) 1 - specificity

  • AUC: Area under the curve, as per the industry standards any value of AUC above .70 is considered a good measure.

Training In sample data
pred_tree_train = predict(tree_mod, newdata = training_set)
pred_tree_train = round(pred_tree_train[,2], digits=5)

ROCRpred = prediction(pred_tree_train, training_set$target)
train_plot <- plot(performance(ROCRpred, "tpr", "fpr"), colorize = TRUE, main = "Training Plot")

table(training_set$target, pred_tree_train > .70)
##    
##     FALSE TRUE
##   0    80   11
##   1    31   90
Inference:

In sample AUC: .8018

Testing: Out of sample data
pred_tree_test = predict(tree_mod, newdata = test_set, type = "prob")
pred_tree_test = round(pred_tree_test[,2], digits=5)

ROCRpred = prediction(pred_tree_test, test_set$target)
test_plot<-plot(performance(ROCRpred, "tpr", "fpr"), colorize = TRUE, main = "Testing Plot")

table(test_set$target, pred_tree_test > .70)
##    
##     FALSE TRUE
##   0    37   10
##   1    16   28
Inference:

Out of sample AUC: .7142

Conclusion

The in-sample and out-of-sample Area Under the ROC curve were .8018 and .7142 respectively.