Heart Disease

Heart Disease Prediction

Objective

The objective of the project is to build a decision tree to assess if a individual is likely to have heart attack or not based on data collected from a hospital in Cleveland.

The dataset contains the following arrtibutes:

age
sex
chest pain type (4 values)
resting blood pressure
serum cholestoral in mg/dl
fasting blood sugar > 120 mg/dl
resting electrocardiographic results (values 0,1,2)
maximum heart rate achieved
exercise induced angina
oldpeak = ST depression induced by exercise relative to rest
the slope of the peak exercise ST segment
number of major vessels (0-3) colored by flourosopy
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

EDA

library(corrgram)
library(MASS)
library(glmnet)
library(rpart)
library(rpart.plot)
library(rattle)
library(ROCR)

After using the above libraries, next let’s import the data.

#Importing the data
dataset=read.csv("heart.csv")

#Renaming the column ï..age to Age 
names(dataset)[1] = "Age"
colnames(dataset)

##  [1] "Age"      "sex"      "cp"       "trestbps" "chol"     "fbs"     
##  [7] "restecg"  "thalach"  "exang"    "oldpeak"  "slope"    "ca"      
## [13] "thal"     "target"

After importing the data, check for the dimensions of the dataset.

dataset=read.csv("heart.csv")
dim(dataset)

## [1] 303  14

Inference:

The dataset has 303 rows and 14 columns meaning that there are 303 patients with 14 attributes. Next let’s plot the correlation between the variables.

sum(is.na(dataset))

## [1] 0

Inference:

The dataset is free from missing values.

corrgram(dataset, upper.panel = NULL, lower.panel = panel.cor)

Inference:

We observe that a few variables age, sex, exang, oldpeak, thalach, slope, and ca show considerable correlation. Hence, we will include these while building the decision tree.

Decision Tree

A decision tree is an effective machine learning modeling technique for regression and classification problems. A decision tree provides a sequential, hierarchical decision about the response variable based on the predictor data. A hierarchical model is defined by a series of questions that lead to a value when applied to a reading. Once the model is ready, it acts like a protocol in a series of “If then” conditions that produce a particular result from the provided data.

CART stands for classification and regression tree.

Types
- Regression tree: response variable Y is numerical
- Classification tree: response variable Y is categorical

The decision tree models are classified as Classification trees when the target variable uses a discrete set of values and Regression trees when the response variable is binary. In our case the response variable Playoff has only two outcomes, that is whether team will qualify for the playoffs or not, hence we form a Classification tree. The goal of the decision tree is to select the optical choice at the end of each node. The decision to split at each node is made according to the metric known as purity. A node is 100% impure if split evenly 50/50 and 100% pure when the entire data belongs to a single class. In order to achieve the best model, impurity should be avoided.

Splitting the data

set.seed(13262376)
split = sample(nrow(dataset), nrow(dataset)*0.70)
training_set = dataset[split,]
test_set = dataset[-split,]

Building a Decision Tree Model

tree_mod = rpart(target ~ ï..age + sex + exang + oldpeak + thalach + slope + ca, data=training_set, method="class")

prp(tree_mod)

In the rpart() funcation, the cp(complexity parameter) argument is one of the parameters that is used to control the compexity of the tree. The overall R-square value must increase by cp at each step and smaller the cp value, the larger (complex) tree rpart will attempt to fit.

ROC: In order to determine the overall measure of goodness of classification, we use the Receiver Operating Characteristic (ROC) curve. Rather than using an overall misclassification rate, it employs two measures – true positive fraction (TPF) sensitivity and false positive fraction (FPF) 1 - specificity
AUC: Area under the curve, as per the industry standards any value of AUC above .70 is considered a good measure.

Training In sample data

pred_tree_train = predict(tree_mod, newdata = training_set)
pred_tree_train = round(pred_tree_train[,2], digits=5)

ROCRpred = prediction(pred_tree_train, training_set$target)
train_plot <- plot(performance(ROCRpred, "tpr", "fpr"), colorize = TRUE, main = "Training Plot")

table(training_set$target, pred_tree_train > .70)

##    
##     FALSE TRUE
##   0    80   11
##   1    31   90

Inference:

In sample AUC: .8018

Testing: Out of sample data

pred_tree_test = predict(tree_mod, newdata = test_set, type = "prob")
pred_tree_test = round(pred_tree_test[,2], digits=5)

ROCRpred = prediction(pred_tree_test, test_set$target)
test_plot<-plot(performance(ROCRpred, "tpr", "fpr"), colorize = TRUE, main = "Testing Plot")

table(test_set$target, pred_tree_test > .70)

##    
##     FALSE TRUE
##   0    37   10
##   1    16   28

Inference:

Out of sample AUC: .7142

Conclusion

The in-sample and out-of-sample Area Under the ROC curve were .8018 and .7142 respectively.

Heart Disease

Sharang Nimbalkar

4/29/2020

Heart Disease Prediction

Objective

EDA

After importing the data, check for the dimensions of the dataset.

Inference:

Inference:

Inference:

Decision Tree

Splitting the data

Building a Decision Tree Model

Training In sample data

Inference:

Testing: Out of sample data

Inference:

Conclusion