Predicting Heart Disease Risks

Background & Introduction

Cardiovascular diseases remain the leading cause of mortality worldwide, accounting for an estimated 17.9 million deaths each year. In clinical practice, early diagnosis and effective risk stratification are crucial for preventing adverse outcomes associated with heart disease. With the increasing digitization of healthcare records, large datasets have become more accessible for analysis using machine learning techniques. This has opened the door for data-driven approaches to assist clinicians in making timely and accurate decisions.

One such dataset is the Heart Disease dataset from the UCI Machine Learning Repository, which includes a combination of demographic, clinical, and laboratory features collected from patients. These features—such as age, chest pain type, resting blood pressure, cholesterol levels, and ECG results—can provide meaningful insights into the likelihood of heart disease in individuals.

In this study, we explore the application of Decision Tree modeling to the UCI Heart Disease dataset. Decision Trees are widely used in healthcare data mining due to their interpretability, ease of implementation, and ability to handle both categorical and continuous data. By analyzing the dataset using this method, we aim to construct an interpretable model that can help identify key predictors of heart disease and assist clinicians in diagnostic decision-making.

Business Understanding

Problem Statement

The primary objective of this study is to develop a Decision Tree model capable of predicting the presence of heart disease in patients using the UCI Heart Disease dataset. Specifically, we seek to:

Identify the most significant clinical and demographic factors contributing to heart disease prediction.
Construct a Decision Tree classifier that can effectively distinguish between patients with and without heart disease.
Evaluate the model’s accuracy, interpretability, and potential clinical utility.

The motivation for this work stems from the need for transparent, rule-based models in clinical environments, where model decisions must be understandable to healthcare providers. Through this analysis, we aim to contribute toward the integration of machine learning techniques in cardiovascular risk assessment and diagnostic support.

Data Understanding

The dataset contains various features related to patients’ health and demographic information. We will explore the dataset to understand its structure and relationships between variables.

Data Dictionary

The dataset contains 14 key attributes that are either numerical or categorical. These attributes are:

age: Age of the patient (numeric)
sex: Gender of the patient (1 = male, 0 = female)
cp: Chest pain type (categorical: 1-4)
trestbps: Resting blood pressure (numeric)
chol: Serum cholesterol (numeric)
fbs: Fasting blood sugar (1 = true, 0 = false)
restecg: Resting electrocardiographic results (categorical)
thalach: Maximum heart rate achieved (numeric)
exang: Exercise-induced angina (1 = yes, 0 = no)
oldpeak: ST depression induced by exercise (numeric)
slope: The slope of the peak exercise ST segment (categorical)
ca: Number of major vessels (0-3, numeric)
thal: Thalassemia (categorical: 1 = normal, 2 = fixed defect, 3 = reversible defect)
target: Heart disease (1 = disease, 0 = no disease)

Attribute	Type	Description	Contraints/Rules
age	Numerical	The age of the patient in years	Range: 29 - 77 (Based on the dataset statistics)
sex	Categorical	The gender of the patient	Values: 1 = Male, 0 = Female
cp	Categorical	Type of chest pain experienced by the patient	Values: 1 = Typical angina, 2 = Atypical angina, 3 = Non-anginal pain, 4 = Asymptomatic
trestbps	Numerical	Resting blood pressure of the patient, measured in mmHg	Range: Typically, between 94 and 200 mmHg
chol	Numerical	Serum cholesterol level in mg/dl	Range: Typically, between 126 and 564 mg/dl
fbs	Categorical	Fasting blood sugar level > 120 mg/dl	Values: 1 = True, 0 = False
restecg	Categorical	Results of the patient’s resting electrocardiogram	Values: 0 = Normal, 1 = ST-T wave abnormality, 2 = Probable or definite left ventricular hypertrophy
thalach	Numerical	Maximum heart rate achieved during a stress test	Range: Typically, between 71 and 202 bpm
exang	Categorical	Whether the patient experiences exercise-induced angina	Values: 1 = Yes, 0 = No
oldpeak	Numerical	ST depression induced by exercise relative to rest (an ECG measure)	Range: 0.0 to 6.2 (higher values indicate more severe abnormalities)
slope	Categorical	Slope of the peak exercise ST segment	Values: 1 = Upsloping, 2 = Flat, 3 = Downsloping
ca	Numerical	Number of major vessels colored by fluoroscopy	Range: 0-3
thal	Categorical	Blood disorder variable related to thalassemia	Values: 3 = Normal, 6 = Fixed defect, 7 = Reversible defect
target	Categorical	Diagnosis of heart disease	Values: 0 = No heart disease, 1 = Presence of heart disease

Data Preparation

Data Loading

Install and load the following packages if they are not already installed.

RCurl: for data retrieval from the url
dplyr: for tasks involving data manipulation
caret: for
ggplot2: for creating visualization
patchwork: for arranging multiple ggplot2 plots in a single canvas
corrplot: for displaying correlation matrices
rpart: for building decision trees
rpart.plot: for plotting decision trees

# Install the required packages

if (!requireNamespace("RCurl", quietly = TRUE)) {
  install.packages("RCurl")
}
if (!requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
}
if (!requireNamespace("caret", quietly = TRUE)) {
  install.packages("caret")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}
if (!requireNamespace("patchwork", quietly = TRUE)) {
  install.packages("patchwork")
}
if (!requireNamespace("corrplot", quietly = TRUE)) {
  install.packages("corrplot")
}
if (!requireNamespace("rpart", quietly = TRUE)) {
  install.packages("rpart")
}
if (!requireNamespace("rpart.plot", quietly = TRUE)) {
  install.packages("rpart.plot")
}
if (!requireNamespace("rattle", quietly = TRUE)) {
  install.packages("rattle")
}

# Load the required libraries
library(RCurl) # For data retrieval from url
library(ggplot2) # For creating visualizations
library(patchwork) # For arranging multiple ggplot2 plots in a single canvas.
library(corrplot) # For displaying correlation matrices

## corrplot 0.95 loaded

library(dplyr) # For tasks involving data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(caret)

## Loading required package: lattice

library(rpart)
library(rpart.plot)
library(rattle)

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(rpart.plot)
library(RColorBrewer)

Load the dataset from UCI website using RCurl library

# Create url object to retrieve the dataset from UCI Machine Learning Repository
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

# Read the dataset into a dataframe
Heart.df <- read.csv(url(url), header = FALSE, na.strings = "?")

Display dimensions of the dataframe

dim(Heart.df)

## [1] 303  14

View the first six rows of the dataset

head(Heart.df)

##   V1 V2 V3  V4  V5 V6 V7  V8 V9 V10 V11 V12 V13 V14
## 1 63  1  1 145 233  1  2 150  0 2.3   3   0   6   0
## 2 67  1  4 160 286  0  2 108  1 1.5   2   3   3   2
## 3 67  1  4 120 229  0  2 129  1 2.6   2   2   7   1
## 4 37  1  3 130 250  0  0 187  0 3.5   3   0   3   0
## 5 41  0  2 130 204  0  2 172  0 1.4   1   0   3   0
## 6 56  1  2 120 236  0  0 178  0 0.8   1   0   3   0

Data Preprocessing

Renaming the column names for clarity

colnames(Heart.df) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target")

Display the structure of the dataframe

str(Heart.df)

## 'data.frame':    303 obs. of  14 variables:
##  $ age     : num  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : num  1 1 1 1 0 1 0 0 1 1 ...
##  $ cp      : num  1 4 4 3 2 2 4 4 4 4 ...
##  $ trestbps: num  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : num  1 0 0 0 0 0 0 0 0 1 ...
##  $ restecg : num  2 2 2 0 2 0 2 0 2 2 ...
##  $ thalach : num  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : num  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : num  3 2 2 3 1 1 3 1 2 3 ...
##  $ ca      : num  0 3 2 0 0 0 2 0 1 0 ...
##  $ thal    : num  6 3 7 3 3 3 3 3 7 7 ...
##  $ target  : int  0 2 1 0 0 0 3 0 2 1 ...

Display the statistical summary of the dataframe

summary(Heart.df)

##       age             sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :1.000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:120.0  
##  Median :56.00   Median :1.0000   Median :3.000   Median :130.0  
##  Mean   :54.44   Mean   :0.6799   Mean   :3.158   Mean   :131.7  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :4.000   Max.   :200.0  
##                                                                  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :241.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.7   Mean   :0.1485   Mean   :0.9901   Mean   :149.6  
##  3rd Qu.:275.0   3rd Qu.:0.0000   3rd Qu.:2.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##                                                                   
##      exang           oldpeak         slope             ca        
##  Min.   :0.0000   Min.   :0.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.80   Median :2.000   Median :0.0000  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.601   Mean   :0.6722  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.20   Max.   :3.000   Max.   :3.0000  
##                                                  NA's   :4       
##       thal           target      
##  Min.   :3.000   Min.   :0.0000  
##  1st Qu.:3.000   1st Qu.:0.0000  
##  Median :3.000   Median :0.0000  
##  Mean   :4.734   Mean   :0.9373  
##  3rd Qu.:7.000   3rd Qu.:2.0000  
##  Max.   :7.000   Max.   :4.0000  
##  NA's   :2

According to the Data Dictionary, the following attributes should be have binary variables, sex, fbs, exang, and target. But, some shows to have values besides 0’s and 1’s.
Let’s convert binary variables to (0, 1)

Heart.df$sex <- ifelse(Heart.df$sex > 0, 1, 0)
Heart.df$fbs <- ifelse(Heart.df$fbs > 0, 1, 0)
Heart.df$exang <- ifelse(Heart.df$exang > 0, 1, 0)
Heart.df$target <- ifelse(Heart.df$target > 0, 1, 0)

Check to see if there are missing values in the dataframe

sapply(Heart.df, function(x) sum(is.na(x)))

##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        4        2        0

From the summary and the table above, there are some missing values in ca and thal columns.
Let’s handle the missing values using mean/mode imputation method

# If missing values exist in 'ca' or 'thal', handle them using mean/mode imputation
Heart.df$ca[is.na(Heart.df$ca)] <- median(Heart.df$ca, na.rm = TRUE)
Heart.df$ca[Heart.df$ca == "?"] <- median(Heart.df$ca, na.rm = TRUE)
Heart.df$thal[is.na(Heart.df$thal)] <- median(Heart.df$thal, na.rm = TRUE)
Heart.df$thal[Heart.df$thal == "?"] <- median(Heart.df$ca, na.rm = TRUE)

Check for duplicate entries in the dataframe and print them out

dupes <- Heart.df[duplicated(Heart.df) | duplicated(Heart.df, fromLast = TRUE), ]
# Print or inspect the duplicate entries
print(dupes)

##  [1] age      sex      cp       trestbps chol     fbs      restecg  thalach 
##  [9] exang    oldpeak  slope    ca       thal     target  
## <0 rows> (or 0-length row.names)

Convert categorical attributes to factors

# Define a list of categorical columns with their levels and labels
categorical_columns <- list(
  sex = list(levels = c(0, 1), labels = c("Female", "Male")),
  cp = list(levels = c(1, 2, 3, 4), labels = c("Typical Angina", "Atypical Angina", "Non-Angina", "Asymptomatic")),
  fbs = list(levels = c(0, 1), labels = c("False", "True")),
  restecg = list(levels = c(0, 1, 2), labels = c("Normal", "Wave-abnormality", "Probable")),
  exang = list(levels = c(0, 1), labels = c("No", "Yes")),
  slope = list(levels = c(1, 2, 3), labels = c("Upsloping", "Flat", "Downsloping")),
  thal = list(levels = c(3, 6, 7), labels = c("Normal", "Fixed Defect", "Reversible")),
  target = list(levels = c(1, 0), labels = c("Yes", "No"))
)

# Apply the factor transformation using a loop
for (col in names(categorical_columns)) {
  Heart.df[[col]] <- factor(Heart.df[[col]], 
                            levels = categorical_columns[[col]]$levels, 
                            labels = categorical_columns[[col]]$labels)
}

EDA through Visualization

a. Barplots

Create the distribution of heart disease by categorical variables

# Create the plots
g1 <- ggplot(Heart.df, aes(x=target, fill=target))+
  geom_bar() + theme_test() +
  ggtitle("Distribution of Heart Disease") +
  labs(x = "Heart Disease", fill = "Heart Disease")
g2 <- HeartDiseaseBar("sex")
g3 <- HeartDiseaseBar("cp")
g4 <- HeartDiseaseBar("fbs")
g5 <- HeartDiseaseBar("restecg")
g6 <- HeartDiseaseBar("exang")
g7 <- HeartDiseaseBar("slope")
g8 <- HeartDiseaseBar("thal")

# Combine plot using patchwork
(g1 | g2) /
(g3 | g4) /
(g5 | g6) /
(g7 | g8)

b. Histogram Distributions

Create histogram distributions of continuous variables.

# Create the plots
p1 <- HeartDiseaseHist("age")
p2 <- HeartDiseaseHist("trestbps")
p3 <- HeartDiseaseHist("chol")
p4 <- HeartDiseaseHist("thalach")
p5 <- HeartDiseaseHist("oldpeak")

# Combine plot using patchwork
(p1) /
(p2 | p3) /
(p4 | p5)

c. Boxplots

HeartDiseaseBoxplot <- function(var1, var2) {
  ggplot(Heart.df, aes(x = .data[[var1]],
                       y = .data[[var2]],
                       fill = .data[[var1]])) +
    geom_boxplot() + theme_test() +
    labs(title = paste("Boxplot of", var2, "by", var1),
         x = var1, y = var2, fill = "Heart Disease")
}

Create boxplots of continuous variables.

# Create the plots
p1 <- HeartDiseaseBoxplot("target", "age")
p2 <- HeartDiseaseBoxplot("target", "trestbps")
p3 <- HeartDiseaseBoxplot("target", "chol")
p4 <- HeartDiseaseBoxplot("target", "thalach")
p5 <- HeartDiseaseBoxplot("target", "oldpeak")

# Combine plot using patchwork
(p1 | p2) /
(p3 | p4) /
(p5)

d. Scaterplots

HeartDiseaseScatter <- function(point1, point2){
  ggplot(Heart.df, aes(x = .data[[point1]],
                       y = .data[[point2]],
                       color = target)) +
    geom_point(size = 2) + theme_test() +
    geom_smooth(method = "lm", se = FALSE, color = "blue", formula = y ~ x) +
    labs(title = paste("Scatterplot of", point1, "by", point2),
       x = point1, y = point2, color = "Heart Disease")
}

Create scatterplots of continuous variables.

# Create the plots
p1 <- HeartDiseaseScatter("age", "oldpeak")
p2 <- HeartDiseaseScatter("age", "chol")
p3 <- HeartDiseaseScatter("age", "trestbps")
p4 <- HeartDiseaseScatter("age", "thalach")
p5 <- HeartDiseaseScatter("chol", "thalach")
p6 <- HeartDiseaseScatter("trestbps", "chol")
p7 <- HeartDiseaseScatter("thalach", "oldpeak")

# Combine plot using patchwork
(p1 | p2) /
(p3 | p4) /
(p5 | p6) /
(p7)

e. Pair Plots

Pairwise relationship between multiple continuous variables

# Create a colored pair plot for selected variables
ggpairs(Heart.df[, c("age", "trestbps", "chol", 
                     "thalach", "oldpeak", "target")], 
        aes(color = target, fill = target))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

f. Correlation Matrix

Correlation matrix for continuous variables

# Selecting only continuous variables
continuous_vars <- c("age", "trestbps", "chol", "thalach", "oldpeak")
continuous_data <- Heart.df %>% select(all_of(continuous_vars))

# Calculating correlation matrix
correlation_matrix <- cor(continuous_data)

# Plotting the correlation matrix
corrplot(correlation_matrix, method = "circle",
         type = "lower", tl.col = "black")

Modeling

Partition the data into 60% training and 40% validation.

set.seed(123)
training.rows <- sample(1:nrow(Heart.df), nrow(Heart.df) * 0.6)
training.data <- Heart.df[training.rows, ]

head(training.data)

##     age    sex              cp trestbps chol   fbs  restecg thalach exang
## 179  43   Male      Non-Angina      130  315 False   Normal     162    No
## 14   44   Male Atypical Angina      120  263 False   Normal     173    No
## 195  68 Female      Non-Angina      120  211 False Probable     115    No
## 118  35 Female    Asymptomatic      138  183 False   Normal     182    No
## 299  45   Male  Typical Angina      110  264 False   Normal     132    No
## 229  54   Male    Asymptomatic      110  206 False Probable     108   Yes
##     oldpeak     slope ca       thal target
## 179     1.9 Upsloping  1     Normal     No
## 14      0.0 Upsloping  0 Reversible     No
## 195     1.5      Flat  0     Normal     No
## 118     1.4 Upsloping  0     Normal     No
## 299     1.2      Flat  0 Reversible    Yes
## 229     0.0      Flat  1     Normal    Yes

Assign row IDs that are not in the training set into the validation set

validation.rows <- setdiff(1:nrow(Heart.df), training.rows)
validation.data <- Heart.df[validation.rows, ]

head(validation.data)

##    age    sex              cp trestbps chol   fbs  restecg thalach exang
## 2   67   Male    Asymptomatic      160  286 False Probable     108   Yes
## 3   67   Male    Asymptomatic      120  229 False Probable     129   Yes
## 6   56   Male Atypical Angina      120  236 False   Normal     178    No
## 8   57 Female    Asymptomatic      120  354 False   Normal     163   Yes
## 12  56 Female Atypical Angina      140  294 False Probable     153    No
## 15  52   Male      Non-Angina      172  199  True   Normal     162    No
##    oldpeak     slope ca       thal target
## 2      1.5      Flat  3     Normal    Yes
## 3      2.6      Flat  2 Reversible    Yes
## 6      0.8 Upsloping  0     Normal     No
## 8      0.6 Upsloping  0     Normal     No
## 12     1.3      Flat  0     Normal     No
## 15     0.5 Upsloping  0 Reversible     No

Create decision tree model using the training data.

dt_model <- rpart(target ~ ., data = training.data, method = "class",
                       control = rpart.control(minsplit = 20, minbucket = 7,
                                               maxdepth = 10, usesurrogate = 2,
                                               xval = 10)
                       )

dt_model

## n= 181 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 181 85 No (0.46961326 0.53038674)  
##    2) cp=Asymptomatic 85 23 Yes (0.72941176 0.27058824)  
##      4) ca>=0.5 47  4 Yes (0.91489362 0.08510638) *
##      5) ca< 0.5 38 19 Yes (0.50000000 0.50000000)  
##       10) thalach< 147 19  4 Yes (0.78947368 0.21052632) *
##       11) thalach>=147 19  4 No (0.21052632 0.78947368) *
##    3) cp=Typical Angina,Atypical Angina,Non-Angina 96 23 No (0.23958333 0.76041667)  
##      6) age>=55.5 42 19 No (0.45238095 0.54761905)  
##       12) slope=Flat 22  8 Yes (0.63636364 0.36363636)  
##         24) chol>=248.5 9  1 Yes (0.88888889 0.11111111) *
##         25) chol< 248.5 13  6 No (0.46153846 0.53846154) *
##       13) slope=Upsloping,Downsloping 20  5 No (0.25000000 0.75000000) *
##      7) age< 55.5 54  4 No (0.07407407 0.92592593) *

Display statistical summary of the model

summary(dt_model)

## Call:
## rpart(formula = target ~ ., data = training.data, method = "class", 
##     control = rpart.control(minsplit = 20, minbucket = 7, maxdepth = 10, 
##         usesurrogate = 2, xval = 10))
##   n= 181 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.45882353      0 1.0000000 1.0000000 0.07899268
## 2 0.06470588      1 0.5411765 0.5411765 0.06891085
## 3 0.03529412      3 0.4117647 0.4941176 0.06681498
## 4 0.01176471      5 0.3411765 0.5411765 0.06891085
## 5 0.01000000      6 0.3294118 0.5529412 0.06939739
## 
## Variable importance
##       cp  thalach       ca     thal  oldpeak      age    exang trestbps 
##       19       15       13       12        9        9        8        5 
##     chol    slope      sex 
##        4        4        1 
## 
## Node number 1: 181 observations,    complexity param=0.4588235
##   predicted class=No   expected loss=0.4696133  P(node) =1
##     class counts:    85    96
##    probabilities: 0.470 0.530 
##   left son=2 (85 obs) right son=3 (96 obs)
##   Primary splits:
##       cp      splits as  RRRL,      improve=21.63364, (0 missing)
##       thal    splits as  RLL,       improve=16.94180, (0 missing)
##       ca      < 0.5   to the right, improve=16.91387, (0 missing)
##       oldpeak < 0.7   to the right, improve=14.24148, (0 missing)
##       exang   splits as  RL,        improve=12.60729, (0 missing)
##   Surrogate splits:
##       exang   splits as  RL,        agree=0.718, adj=0.400, (0 split)
##       thalach < 148.5 to the left,  agree=0.685, adj=0.329, (0 split)
##       thal    splits as  RLL,       agree=0.674, adj=0.306, (0 split)
##       ca      < 1.5   to the right, agree=0.641, adj=0.235, (0 split)
##       oldpeak < 0.7   to the right, agree=0.619, adj=0.188, (0 split)
## 
## Node number 2: 85 observations,    complexity param=0.06470588
##   predicted class=Yes  expected loss=0.2705882  P(node) =0.4696133
##     class counts:    62    23
##    probabilities: 0.729 0.271 
##   left son=4 (47 obs) right son=5 (38 obs)
##   Primary splits:
##       ca      < 0.5   to the right, improve=7.233792, (0 missing)
##       oldpeak < 0.55  to the right, improve=5.869984, (0 missing)
##       thal    splits as  RLL,       improve=4.439441, (0 missing)
##       exang   splits as  RL,        improve=3.938671, (0 missing)
##       thalach < 146.5 to the left,  improve=3.111732, (0 missing)
##   Surrogate splits:
##       age     < 51.5  to the right, agree=0.706, adj=0.342, (0 split)
##       sex     splits as  RL,        agree=0.635, adj=0.184, (0 split)
##       thal    splits as  RLL,       agree=0.635, adj=0.184, (0 split)
##       chol    < 301   to the left,  agree=0.612, adj=0.132, (0 split)
##       oldpeak < 0.55  to the right, agree=0.600, adj=0.105, (0 split)
## 
## Node number 3: 96 observations,    complexity param=0.03529412
##   predicted class=No   expected loss=0.2395833  P(node) =0.5303867
##     class counts:    23    73
##    probabilities: 0.240 0.760 
##   left son=6 (42 obs) right son=7 (54 obs)
##   Primary splits:
##       age     < 55.5  to the right, improve=6.762235, (0 missing)
##       oldpeak < 1.95  to the right, improve=3.911787, (0 missing)
##       sex     splits as  RL,        improve=3.901389, (0 missing)
##       slope   splits as  RLR,       improve=3.826670, (0 missing)
##       thal    splits as  RLL,       improve=2.823704, (0 missing)
##   Surrogate splits:
##       thalach  < 151.5 to the left,  agree=0.708, adj=0.333, (0 split)
##       trestbps < 136.5 to the right, agree=0.698, adj=0.310, (0 split)
##       ca       < 0.5   to the right, agree=0.677, adj=0.262, (0 split)
##       thal     splits as  RLL,       agree=0.667, adj=0.238, (0 split)
##       oldpeak  < 0.7   to the right, agree=0.646, adj=0.190, (0 split)
## 
## Node number 4: 47 observations
##   predicted class=Yes  expected loss=0.08510638  P(node) =0.2596685
##     class counts:    43     4
##    probabilities: 0.915 0.085 
## 
## Node number 5: 38 observations,    complexity param=0.06470588
##   predicted class=Yes  expected loss=0.5  P(node) =0.2099448
##     class counts:    19    19
##    probabilities: 0.500 0.500 
##   left son=10 (19 obs) right son=11 (19 obs)
##   Primary splits:
##       thalach  < 147   to the left,  improve=6.368421, (0 missing)
##       exang    splits as  RL,        improve=5.277778, (0 missing)
##       oldpeak  < 0.65  to the right, improve=3.454545, (0 missing)
##       thal     splits as  RRL,       improve=3.454545, (0 missing)
##       trestbps < 142   to the right, improve=3.134680, (0 missing)
##   Surrogate splits:
##       thal     splits as  RLL,       agree=0.737, adj=0.474, (0 split)
##       oldpeak  < 0.65  to the right, agree=0.711, adj=0.421, (0 split)
##       chol     < 201.5 to the left,  agree=0.658, adj=0.316, (0 split)
##       trestbps < 126   to the left,  agree=0.632, adj=0.263, (0 split)
##       slope    splits as  RLL,       agree=0.632, adj=0.263, (0 split)
## 
## Node number 6: 42 observations,    complexity param=0.03529412
##   predicted class=No   expected loss=0.452381  P(node) =0.2320442
##     class counts:    19    23
##    probabilities: 0.452 0.548 
##   left son=12 (22 obs) right son=13 (20 obs)
##   Primary splits:
##       slope   splits as  RLR,       improve=3.127706, (0 missing)
##       oldpeak < 1.95  to the right, improve=2.752381, (0 missing)
##       sex     splits as  RL,        improve=2.742857, (0 missing)
##       age     < 66    to the left,  improve=2.181958, (0 missing)
##       ca      < 0.5   to the right, improve=1.664069, (0 missing)
##   Surrogate splits:
##       oldpeak  < 1.1   to the right, agree=0.762, adj=0.50, (0 split)
##       trestbps < 131   to the left,  agree=0.714, adj=0.40, (0 split)
##       thalach  < 161   to the left,  agree=0.690, adj=0.35, (0 split)
##       thal     splits as  RLL,       agree=0.643, adj=0.25, (0 split)
##       age      < 69.5  to the left,  agree=0.619, adj=0.20, (0 split)
## 
## Node number 7: 54 observations
##   predicted class=No   expected loss=0.07407407  P(node) =0.2983425
##     class counts:     4    50
##    probabilities: 0.074 0.926 
## 
## Node number 10: 19 observations
##   predicted class=Yes  expected loss=0.2105263  P(node) =0.1049724
##     class counts:    15     4
##    probabilities: 0.789 0.211 
## 
## Node number 11: 19 observations
##   predicted class=No   expected loss=0.2105263  P(node) =0.1049724
##     class counts:     4    15
##    probabilities: 0.211 0.789 
## 
## Node number 12: 22 observations,    complexity param=0.01176471
##   predicted class=Yes  expected loss=0.3636364  P(node) =0.121547
##     class counts:    14     8
##    probabilities: 0.636 0.364 
##   left son=24 (9 obs) right son=25 (13 obs)
##   Primary splits:
##       chol     < 248.5 to the right, improve=1.9425020, (0 missing)
##       trestbps < 125.5 to the right, improve=1.7175320, (0 missing)
##       ca       < 0.5   to the right, improve=1.4545450, (0 missing)
##       thalach  < 144.5 to the right, improve=1.1219890, (0 missing)
##       age      < 60.5  to the left,  improve=0.9818182, (0 missing)
##   Surrogate splits:
##       age      < 61.5  to the right, agree=0.682, adj=0.222, (0 split)
##       trestbps < 165   to the right, agree=0.682, adj=0.222, (0 split)
##       thalach  < 157   to the right, agree=0.636, adj=0.111, (0 split)
##       exang    splits as  RL,        agree=0.636, adj=0.111, (0 split)
##       thal     splits as  RRL,       agree=0.636, adj=0.111, (0 split)
## 
## Node number 13: 20 observations
##   predicted class=No   expected loss=0.25  P(node) =0.1104972
##     class counts:     5    15
##    probabilities: 0.250 0.750 
## 
## Node number 24: 9 observations
##   predicted class=Yes  expected loss=0.1111111  P(node) =0.04972376
##     class counts:     8     1
##    probabilities: 0.889 0.111 
## 
## Node number 25: 13 observations
##   predicted class=No   expected loss=0.4615385  P(node) =0.0718232
##     class counts:     6     7
##    probabilities: 0.462 0.538

Plot the decision tree

plot(dt_model)
text(dt_model)

Beautify the decision tree

tot_count <- function(x, labs, digits, varlen) {
  paste(labs, "\n\nn", x$frame$n)
}
prp(dt_model, faclen = 0, cex = 0.8, node.fun = tot_count)

Evaluation

Predict on the test set

predictions <- predict(dt_model, validation.data, type = "class")

Confusion matrix

confMatrix <- confusionMatrix(predictions, validation.data$target)
print(confMatrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Yes No
##        Yes  39  9
##        No   15 59
##                                           
##                Accuracy : 0.8033          
##                  95% CI : (0.7216, 0.8697)
##     No Information Rate : 0.5574          
##     P-Value [Acc > NIR] : 1.021e-08       
##                                           
##                   Kappa : 0.5967          
##                                           
##  Mcnemar's Test P-Value : 0.3074          
##                                           
##             Sensitivity : 0.7222          
##             Specificity : 0.8676          
##          Pos Pred Value : 0.8125          
##          Neg Pred Value : 0.7973          
##              Prevalence : 0.4426          
##          Detection Rate : 0.3197          
##    Detection Prevalence : 0.3934          
##       Balanced Accuracy : 0.7949          
##                                           
##        'Positive' Class : Yes             
##

Metrics Observations

Accuracy: The proportion of all correct predictions (both true positives and true negatives) out of all predictions. that considers both false positives and false negatives.

# Extract accuracy
accuracy <- confMatrix$overall['Accuracy']
cat("Accuracy:", accuracy, "\n")

## Accuracy: 0.8032787

Sensitivity (Recall): The proportion of actual positive cases correctly identified as positive (True Positives / (True Positives + False Negatives)). that considers both false positives and false negatives.

# Extract sensitivity (Recall for the positive class)
sensitivity <- confMatrix$byClass['Sensitivity']
cat("Sensitivity (Recall):", sensitivity, "\n")

## Sensitivity (Recall): 0.7222222

Specificity: The proportion of actual negative cases correctly identified as negative (True Negatives / (True Negatives + False Positives)).

# Extract specificity (Recall for the negative class)
specificity <- confMatrix$byClass['Specificity']
cat("Specificity:", specificity, "\n")

## Specificity: 0.8676471

Precision (Positive Predictive Value): The proportion of positive predictions that are actually correct (True Positives / (True Positives + False Positives)).

# Extract Precision (Positive Predictive Value)
precision <- confMatrix$byClass['Precision']
cat("Precision:", precision, "\n")

## Precision: 0.8125

F1 Score: The harmonic mean of Precision and Recall, providing a balanced measure that considers both false positives and false negatives.

# Extract F1 Score
f1_score <- confMatrix$byClass['F1']
cat("F1 Score:", f1_score, "\n")

## F1 Score: 0.7647059