The objective of the project is to build a decision tree to assess if a individual is likely to have heart attack or not based on data collected from a hospital in Cleveland.
The dataset contains the following arrtibutes:
library(corrgram)
library(MASS)
library(glmnet)
library(rpart)
library(rpart.plot)
library(rattle)
library(ROCR)
After using the above libraries, next let’s import the data.
#Importing the data
dataset=read.csv("heart.csv")
#Renaming the column ï..age to Age
names(dataset)[1] = "Age"
colnames(dataset)
## [1] "Age" "sex" "cp" "trestbps" "chol" "fbs"
## [7] "restecg" "thalach" "exang" "oldpeak" "slope" "ca"
## [13] "thal" "target"
dataset=read.csv("heart.csv")
dim(dataset)
## [1] 303 14
The dataset has 303 rows and 14 columns meaning that there are 303 patients with 14 attributes. Next let’s plot the correlation between the variables.
sum(is.na(dataset))
## [1] 0
The dataset is free from missing values.
corrgram(dataset, upper.panel = NULL, lower.panel = panel.cor)
We observe that a few variables age, sex, exang, oldpeak, thalach, slope, and ca show considerable correlation. Hence, we will include these while building the decision tree.
A decision tree is an effective machine learning modeling technique for regression and classification problems. A decision tree provides a sequential, hierarchical decision about the response variable based on the predictor data. A hierarchical model is defined by a series of questions that lead to a value when applied to a reading. Once the model is ready, it acts like a protocol in a series of “If then” conditions that produce a particular result from the provided data.
CART stands for classification and regression tree.
The decision tree models are classified as Classification trees when the target variable uses a discrete set of values and Regression trees when the response variable is binary. In our case the response variable Playoff has only two outcomes, that is whether team will qualify for the playoffs or not, hence we form a Classification tree. The goal of the decision tree is to select the optical choice at the end of each node. The decision to split at each node is made according to the metric known as purity. A node is 100% impure if split evenly 50/50 and 100% pure when the entire data belongs to a single class. In order to achieve the best model, impurity should be avoided.
set.seed(13262376)
split = sample(nrow(dataset), nrow(dataset)*0.70)
training_set = dataset[split,]
test_set = dataset[-split,]
tree_mod = rpart(target ~ ï..age + sex + exang + oldpeak + thalach + slope + ca, data=training_set, method="class")
prp(tree_mod)
In the rpart() funcation, the cp(complexity parameter) argument is one of the parameters that is used to control the compexity of the tree. The overall R-square value must increase by cp at each step and smaller the cp value, the larger (complex) tree rpart will attempt to fit.
ROC: In order to determine the overall measure of goodness of classification, we use the Receiver Operating Characteristic (ROC) curve. Rather than using an overall misclassification rate, it employs two measures – true positive fraction (TPF) sensitivity and false positive fraction (FPF) 1 - specificity
AUC: Area under the curve, as per the industry standards any value of AUC above .70 is considered a good measure.
pred_tree_train = predict(tree_mod, newdata = training_set)
pred_tree_train = round(pred_tree_train[,2], digits=5)
ROCRpred = prediction(pred_tree_train, training_set$target)
train_plot <- plot(performance(ROCRpred, "tpr", "fpr"), colorize = TRUE, main = "Training Plot")
table(training_set$target, pred_tree_train > .70)
##
## FALSE TRUE
## 0 80 11
## 1 31 90
In sample AUC: .8018
pred_tree_test = predict(tree_mod, newdata = test_set, type = "prob")
pred_tree_test = round(pred_tree_test[,2], digits=5)
ROCRpred = prediction(pred_tree_test, test_set$target)
test_plot<-plot(performance(ROCRpred, "tpr", "fpr"), colorize = TRUE, main = "Testing Plot")
table(test_set$target, pred_tree_test > .70)
##
## FALSE TRUE
## 0 37 10
## 1 16 28
Out of sample AUC: .7142
The in-sample and out-of-sample Area Under the ROC curve were .8018 and .7142 respectively.