Cardiovascular diseases remain the leading cause of mortality worldwide, accounting for an estimated 17.9 million deaths each year. In clinical practice, early diagnosis and effective risk stratification are crucial for preventing adverse outcomes associated with heart disease. With the increasing digitization of healthcare records, large datasets have become more accessible for analysis using machine learning techniques. This has opened the door for data-driven approaches to assist clinicians in making timely and accurate decisions.
One such dataset is the Heart Disease dataset from the UCI Machine Learning Repository, which includes a combination of demographic, clinical, and laboratory features collected from patients. These features—such as age, chest pain type, resting blood pressure, cholesterol levels, and ECG results—can provide meaningful insights into the likelihood of heart disease in individuals.
In this study, we explore the application of Decision Tree
modeling to the UCI Heart Disease dataset. Decision Trees are
widely used in healthcare data mining due to their interpretability,
ease of implementation, and ability to handle both categorical and
continuous data. By analyzing the dataset using this method, we aim to
construct an interpretable model that can help identify key predictors
of heart disease and assist clinicians in diagnostic
decision-making.
The primary objective of this study is to develop a Decision Tree model capable of predicting the presence of heart disease in patients using the UCI Heart Disease dataset. Specifically, we seek to:
The motivation for this work stems from the need for transparent,
rule-based models in clinical environments, where model decisions must
be understandable to healthcare providers. Through this analysis, we aim
to contribute toward the integration of machine learning techniques in
cardiovascular risk assessment and diagnostic support.
The dataset contains various features related to patients’ health and demographic information. We will explore the dataset to understand its structure and relationships between variables.
The dataset contains 14 key attributes that are either numerical or categorical. These attributes are:
| Attribute | Type | Description | Contraints/Rules |
|---|---|---|---|
| age | Numerical | The age of the patient in years | Range: 29 - 77 (Based on the dataset statistics) |
| sex | Categorical | The gender of the patient | Values: 1 = Male, 0 = Female |
| cp | Categorical | Type of chest pain experienced by the patient | Values: 1 = Typical angina, 2 = Atypical angina, 3 = Non-anginal pain, 4 = Asymptomatic |
| trestbps | Numerical | Resting blood pressure of the patient, measured in mmHg | Range: Typically, between 94 and 200 mmHg |
| chol | Numerical | Serum cholesterol level in mg/dl | Range: Typically, between 126 and 564 mg/dl |
| fbs | Categorical | Fasting blood sugar level > 120 mg/dl | Values: 1 = True, 0 = False |
| restecg | Categorical | Results of the patient’s resting electrocardiogram | Values: 0 = Normal, 1 = ST-T wave abnormality, 2 = Probable or definite left ventricular hypertrophy |
| thalach | Numerical | Maximum heart rate achieved during a stress test | Range: Typically, between 71 and 202 bpm |
| exang | Categorical | Whether the patient experiences exercise-induced angina | Values: 1 = Yes, 0 = No |
| oldpeak | Numerical | ST depression induced by exercise relative to rest (an ECG measure) | Range: 0.0 to 6.2 (higher values indicate more severe abnormalities) |
| slope | Categorical | Slope of the peak exercise ST segment | Values: 1 = Upsloping, 2 = Flat, 3 = Downsloping |
| ca | Numerical | Number of major vessels colored by fluoroscopy | Range: 0-3 |
| thal | Categorical | Blood disorder variable related to thalassemia | Values: 3 = Normal, 6 = Fixed defect, 7 = Reversible defect |
| target | Categorical | Diagnosis of heart disease | Values: 0 = No heart disease, 1 = Presence of heart disease |
Install and load the following packages if they are not already installed.
# Install the required packages
if (!requireNamespace("RCurl", quietly = TRUE)) {
install.packages("RCurl")
}
if (!requireNamespace("dplyr", quietly = TRUE)) {
install.packages("dplyr")
}
if (!requireNamespace("caret", quietly = TRUE)) {
install.packages("caret")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
if (!requireNamespace("patchwork", quietly = TRUE)) {
install.packages("patchwork")
}
if (!requireNamespace("corrplot", quietly = TRUE)) {
install.packages("corrplot")
}
if (!requireNamespace("rpart", quietly = TRUE)) {
install.packages("rpart")
}
if (!requireNamespace("rpart.plot", quietly = TRUE)) {
install.packages("rpart.plot")
}
if (!requireNamespace("rattle", quietly = TRUE)) {
install.packages("rattle")
}
# Load the required libraries
library(RCurl) # For data retrieval from url
library(ggplot2) # For creating visualizations
library(patchwork) # For arranging multiple ggplot2 plots in a single canvas.
library(corrplot) # For displaying correlation matrices
## corrplot 0.95 loaded
library(dplyr) # For tasks involving data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(caret)
## Loading required package: lattice
library(rpart)
library(rpart.plot)
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
library(RColorBrewer)
Load the dataset from UCI website using RCurl library
# Create url object to retrieve the dataset from UCI Machine Learning Repository
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
# Read the dataset into a dataframe
Heart.df <- read.csv(url(url), header = FALSE, na.strings = "?")
Display dimensions of the dataframe
dim(Heart.df)
## [1] 303 14
View the first six rows of the dataset
head(Heart.df)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
## 1 63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
## 2 67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
## 3 67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
## 4 37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
## 5 41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
## 6 56 1 2 120 236 0 0 178 0 0.8 1 0 3 0
Renaming the column names for clarity
colnames(Heart.df) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target")
Display the structure of the dataframe
str(Heart.df)
## 'data.frame': 303 obs. of 14 variables:
## $ age : num 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : num 1 1 1 1 0 1 0 0 1 1 ...
## $ cp : num 1 4 4 3 2 2 4 4 4 4 ...
## $ trestbps: num 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : num 1 0 0 0 0 0 0 0 0 1 ...
## $ restecg : num 2 2 2 0 2 0 2 0 2 2 ...
## $ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : num 0 1 1 0 0 0 0 1 0 1 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : num 3 2 2 3 1 1 3 1 2 3 ...
## $ ca : num 0 3 2 0 0 0 2 0 1 0 ...
## $ thal : num 6 3 7 3 3 3 3 3 7 7 ...
## $ target : int 0 2 1 0 0 0 3 0 2 1 ...
Display the statistical summary of the dataframe
summary(Heart.df)
## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :1.000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:120.0
## Median :56.00 Median :1.0000 Median :3.000 Median :130.0
## Mean :54.44 Mean :0.6799 Mean :3.158 Mean :131.7
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :4.000 Max. :200.0
##
## chol fbs restecg thalach
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
## Median :241.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :246.7 Mean :0.1485 Mean :0.9901 Mean :149.6
## 3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
##
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.00 Min. :1.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.80 Median :2.000 Median :0.0000
## Mean :0.3267 Mean :1.04 Mean :1.601 Mean :0.6722
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.20 Max. :3.000 Max. :3.0000
## NA's :4
## thal target
## Min. :3.000 Min. :0.0000
## 1st Qu.:3.000 1st Qu.:0.0000
## Median :3.000 Median :0.0000
## Mean :4.734 Mean :0.9373
## 3rd Qu.:7.000 3rd Qu.:2.0000
## Max. :7.000 Max. :4.0000
## NA's :2
According to the Data Dictionary, the following attributes should be
have binary variables, sex, fbs,
exang, and target. But, some shows to have
values besides 0’s and 1’s.
Let’s convert binary variables to (0, 1)
Heart.df$sex <- ifelse(Heart.df$sex > 0, 1, 0)
Heart.df$fbs <- ifelse(Heart.df$fbs > 0, 1, 0)
Heart.df$exang <- ifelse(Heart.df$exang > 0, 1, 0)
Heart.df$target <- ifelse(Heart.df$target > 0, 1, 0)
Check to see if there are missing values in the dataframe
sapply(Heart.df, function(x) sum(is.na(x)))
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 4 2 0
From the summary and the table above, there are some missing values
in ca and thal columns.
Let’s handle the missing values using mean/mode imputation method
# If missing values exist in 'ca' or 'thal', handle them using mean/mode imputation
Heart.df$ca[is.na(Heart.df$ca)] <- median(Heart.df$ca, na.rm = TRUE)
Heart.df$ca[Heart.df$ca == "?"] <- median(Heart.df$ca, na.rm = TRUE)
Heart.df$thal[is.na(Heart.df$thal)] <- median(Heart.df$thal, na.rm = TRUE)
Heart.df$thal[Heart.df$thal == "?"] <- median(Heart.df$ca, na.rm = TRUE)
Check for duplicate entries in the dataframe and print them out
dupes <- Heart.df[duplicated(Heart.df) | duplicated(Heart.df, fromLast = TRUE), ]
# Print or inspect the duplicate entries
print(dupes)
## [1] age sex cp trestbps chol fbs restecg thalach
## [9] exang oldpeak slope ca thal target
## <0 rows> (or 0-length row.names)
Convert categorical attributes to factors
# Define a list of categorical columns with their levels and labels
categorical_columns <- list(
sex = list(levels = c(0, 1), labels = c("Female", "Male")),
cp = list(levels = c(1, 2, 3, 4), labels = c("Typical Angina", "Atypical Angina", "Non-Angina", "Asymptomatic")),
fbs = list(levels = c(0, 1), labels = c("False", "True")),
restecg = list(levels = c(0, 1, 2), labels = c("Normal", "Wave-abnormality", "Probable")),
exang = list(levels = c(0, 1), labels = c("No", "Yes")),
slope = list(levels = c(1, 2, 3), labels = c("Upsloping", "Flat", "Downsloping")),
thal = list(levels = c(3, 6, 7), labels = c("Normal", "Fixed Defect", "Reversible")),
target = list(levels = c(1, 0), labels = c("Yes", "No"))
)
# Apply the factor transformation using a loop
for (col in names(categorical_columns)) {
Heart.df[[col]] <- factor(Heart.df[[col]],
levels = categorical_columns[[col]]$levels,
labels = categorical_columns[[col]]$labels)
}
Create the distribution of heart disease by categorical variables
# Create the plots
g1 <- ggplot(Heart.df, aes(x=target, fill=target))+
geom_bar() + theme_test() +
ggtitle("Distribution of Heart Disease") +
labs(x = "Heart Disease", fill = "Heart Disease")
g2 <- HeartDiseaseBar("sex")
g3 <- HeartDiseaseBar("cp")
g4 <- HeartDiseaseBar("fbs")
g5 <- HeartDiseaseBar("restecg")
g6 <- HeartDiseaseBar("exang")
g7 <- HeartDiseaseBar("slope")
g8 <- HeartDiseaseBar("thal")
# Combine plot using patchwork
(g1 | g2) /
(g3 | g4) /
(g5 | g6) /
(g7 | g8)
Create histogram distributions of continuous variables.
# Create the plots
p1 <- HeartDiseaseHist("age")
p2 <- HeartDiseaseHist("trestbps")
p3 <- HeartDiseaseHist("chol")
p4 <- HeartDiseaseHist("thalach")
p5 <- HeartDiseaseHist("oldpeak")
# Combine plot using patchwork
(p1) /
(p2 | p3) /
(p4 | p5)
HeartDiseaseBoxplot <- function(var1, var2) {
ggplot(Heart.df, aes(x = .data[[var1]],
y = .data[[var2]],
fill = .data[[var1]])) +
geom_boxplot() + theme_test() +
labs(title = paste("Boxplot of", var2, "by", var1),
x = var1, y = var2, fill = "Heart Disease")
}
Create boxplots of continuous variables.
# Create the plots
p1 <- HeartDiseaseBoxplot("target", "age")
p2 <- HeartDiseaseBoxplot("target", "trestbps")
p3 <- HeartDiseaseBoxplot("target", "chol")
p4 <- HeartDiseaseBoxplot("target", "thalach")
p5 <- HeartDiseaseBoxplot("target", "oldpeak")
# Combine plot using patchwork
(p1 | p2) /
(p3 | p4) /
(p5)
HeartDiseaseScatter <- function(point1, point2){
ggplot(Heart.df, aes(x = .data[[point1]],
y = .data[[point2]],
color = target)) +
geom_point(size = 2) + theme_test() +
geom_smooth(method = "lm", se = FALSE, color = "blue", formula = y ~ x) +
labs(title = paste("Scatterplot of", point1, "by", point2),
x = point1, y = point2, color = "Heart Disease")
}
Create scatterplots of continuous variables.
# Create the plots
p1 <- HeartDiseaseScatter("age", "oldpeak")
p2 <- HeartDiseaseScatter("age", "chol")
p3 <- HeartDiseaseScatter("age", "trestbps")
p4 <- HeartDiseaseScatter("age", "thalach")
p5 <- HeartDiseaseScatter("chol", "thalach")
p6 <- HeartDiseaseScatter("trestbps", "chol")
p7 <- HeartDiseaseScatter("thalach", "oldpeak")
# Combine plot using patchwork
(p1 | p2) /
(p3 | p4) /
(p5 | p6) /
(p7)
Pairwise relationship between multiple continuous variables
# Create a colored pair plot for selected variables
ggpairs(Heart.df[, c("age", "trestbps", "chol",
"thalach", "oldpeak", "target")],
aes(color = target, fill = target))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Correlation matrix for continuous variables
# Selecting only continuous variables
continuous_vars <- c("age", "trestbps", "chol", "thalach", "oldpeak")
continuous_data <- Heart.df %>% select(all_of(continuous_vars))
# Calculating correlation matrix
correlation_matrix <- cor(continuous_data)
# Plotting the correlation matrix
corrplot(correlation_matrix, method = "circle",
type = "lower", tl.col = "black")
Partition the data into 60% training and 40% validation.
set.seed(123)
training.rows <- sample(1:nrow(Heart.df), nrow(Heart.df) * 0.6)
training.data <- Heart.df[training.rows, ]
head(training.data)
## age sex cp trestbps chol fbs restecg thalach exang
## 179 43 Male Non-Angina 130 315 False Normal 162 No
## 14 44 Male Atypical Angina 120 263 False Normal 173 No
## 195 68 Female Non-Angina 120 211 False Probable 115 No
## 118 35 Female Asymptomatic 138 183 False Normal 182 No
## 299 45 Male Typical Angina 110 264 False Normal 132 No
## 229 54 Male Asymptomatic 110 206 False Probable 108 Yes
## oldpeak slope ca thal target
## 179 1.9 Upsloping 1 Normal No
## 14 0.0 Upsloping 0 Reversible No
## 195 1.5 Flat 0 Normal No
## 118 1.4 Upsloping 0 Normal No
## 299 1.2 Flat 0 Reversible Yes
## 229 0.0 Flat 1 Normal Yes
Assign row IDs that are not in the training set into the validation set
validation.rows <- setdiff(1:nrow(Heart.df), training.rows)
validation.data <- Heart.df[validation.rows, ]
head(validation.data)
## age sex cp trestbps chol fbs restecg thalach exang
## 2 67 Male Asymptomatic 160 286 False Probable 108 Yes
## 3 67 Male Asymptomatic 120 229 False Probable 129 Yes
## 6 56 Male Atypical Angina 120 236 False Normal 178 No
## 8 57 Female Asymptomatic 120 354 False Normal 163 Yes
## 12 56 Female Atypical Angina 140 294 False Probable 153 No
## 15 52 Male Non-Angina 172 199 True Normal 162 No
## oldpeak slope ca thal target
## 2 1.5 Flat 3 Normal Yes
## 3 2.6 Flat 2 Reversible Yes
## 6 0.8 Upsloping 0 Normal No
## 8 0.6 Upsloping 0 Normal No
## 12 1.3 Flat 0 Normal No
## 15 0.5 Upsloping 0 Reversible No
Create decision tree model using the training data.
dt_model <- rpart(target ~ ., data = training.data, method = "class",
control = rpart.control(minsplit = 20, minbucket = 7,
maxdepth = 10, usesurrogate = 2,
xval = 10)
)
dt_model
## n= 181
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 181 85 No (0.46961326 0.53038674)
## 2) cp=Asymptomatic 85 23 Yes (0.72941176 0.27058824)
## 4) ca>=0.5 47 4 Yes (0.91489362 0.08510638) *
## 5) ca< 0.5 38 19 Yes (0.50000000 0.50000000)
## 10) thalach< 147 19 4 Yes (0.78947368 0.21052632) *
## 11) thalach>=147 19 4 No (0.21052632 0.78947368) *
## 3) cp=Typical Angina,Atypical Angina,Non-Angina 96 23 No (0.23958333 0.76041667)
## 6) age>=55.5 42 19 No (0.45238095 0.54761905)
## 12) slope=Flat 22 8 Yes (0.63636364 0.36363636)
## 24) chol>=248.5 9 1 Yes (0.88888889 0.11111111) *
## 25) chol< 248.5 13 6 No (0.46153846 0.53846154) *
## 13) slope=Upsloping,Downsloping 20 5 No (0.25000000 0.75000000) *
## 7) age< 55.5 54 4 No (0.07407407 0.92592593) *
Display statistical summary of the model
summary(dt_model)
## Call:
## rpart(formula = target ~ ., data = training.data, method = "class",
## control = rpart.control(minsplit = 20, minbucket = 7, maxdepth = 10,
## usesurrogate = 2, xval = 10))
## n= 181
##
## CP nsplit rel error xerror xstd
## 1 0.45882353 0 1.0000000 1.0000000 0.07899268
## 2 0.06470588 1 0.5411765 0.5411765 0.06891085
## 3 0.03529412 3 0.4117647 0.4941176 0.06681498
## 4 0.01176471 5 0.3411765 0.5411765 0.06891085
## 5 0.01000000 6 0.3294118 0.5529412 0.06939739
##
## Variable importance
## cp thalach ca thal oldpeak age exang trestbps
## 19 15 13 12 9 9 8 5
## chol slope sex
## 4 4 1
##
## Node number 1: 181 observations, complexity param=0.4588235
## predicted class=No expected loss=0.4696133 P(node) =1
## class counts: 85 96
## probabilities: 0.470 0.530
## left son=2 (85 obs) right son=3 (96 obs)
## Primary splits:
## cp splits as RRRL, improve=21.63364, (0 missing)
## thal splits as RLL, improve=16.94180, (0 missing)
## ca < 0.5 to the right, improve=16.91387, (0 missing)
## oldpeak < 0.7 to the right, improve=14.24148, (0 missing)
## exang splits as RL, improve=12.60729, (0 missing)
## Surrogate splits:
## exang splits as RL, agree=0.718, adj=0.400, (0 split)
## thalach < 148.5 to the left, agree=0.685, adj=0.329, (0 split)
## thal splits as RLL, agree=0.674, adj=0.306, (0 split)
## ca < 1.5 to the right, agree=0.641, adj=0.235, (0 split)
## oldpeak < 0.7 to the right, agree=0.619, adj=0.188, (0 split)
##
## Node number 2: 85 observations, complexity param=0.06470588
## predicted class=Yes expected loss=0.2705882 P(node) =0.4696133
## class counts: 62 23
## probabilities: 0.729 0.271
## left son=4 (47 obs) right son=5 (38 obs)
## Primary splits:
## ca < 0.5 to the right, improve=7.233792, (0 missing)
## oldpeak < 0.55 to the right, improve=5.869984, (0 missing)
## thal splits as RLL, improve=4.439441, (0 missing)
## exang splits as RL, improve=3.938671, (0 missing)
## thalach < 146.5 to the left, improve=3.111732, (0 missing)
## Surrogate splits:
## age < 51.5 to the right, agree=0.706, adj=0.342, (0 split)
## sex splits as RL, agree=0.635, adj=0.184, (0 split)
## thal splits as RLL, agree=0.635, adj=0.184, (0 split)
## chol < 301 to the left, agree=0.612, adj=0.132, (0 split)
## oldpeak < 0.55 to the right, agree=0.600, adj=0.105, (0 split)
##
## Node number 3: 96 observations, complexity param=0.03529412
## predicted class=No expected loss=0.2395833 P(node) =0.5303867
## class counts: 23 73
## probabilities: 0.240 0.760
## left son=6 (42 obs) right son=7 (54 obs)
## Primary splits:
## age < 55.5 to the right, improve=6.762235, (0 missing)
## oldpeak < 1.95 to the right, improve=3.911787, (0 missing)
## sex splits as RL, improve=3.901389, (0 missing)
## slope splits as RLR, improve=3.826670, (0 missing)
## thal splits as RLL, improve=2.823704, (0 missing)
## Surrogate splits:
## thalach < 151.5 to the left, agree=0.708, adj=0.333, (0 split)
## trestbps < 136.5 to the right, agree=0.698, adj=0.310, (0 split)
## ca < 0.5 to the right, agree=0.677, adj=0.262, (0 split)
## thal splits as RLL, agree=0.667, adj=0.238, (0 split)
## oldpeak < 0.7 to the right, agree=0.646, adj=0.190, (0 split)
##
## Node number 4: 47 observations
## predicted class=Yes expected loss=0.08510638 P(node) =0.2596685
## class counts: 43 4
## probabilities: 0.915 0.085
##
## Node number 5: 38 observations, complexity param=0.06470588
## predicted class=Yes expected loss=0.5 P(node) =0.2099448
## class counts: 19 19
## probabilities: 0.500 0.500
## left son=10 (19 obs) right son=11 (19 obs)
## Primary splits:
## thalach < 147 to the left, improve=6.368421, (0 missing)
## exang splits as RL, improve=5.277778, (0 missing)
## oldpeak < 0.65 to the right, improve=3.454545, (0 missing)
## thal splits as RRL, improve=3.454545, (0 missing)
## trestbps < 142 to the right, improve=3.134680, (0 missing)
## Surrogate splits:
## thal splits as RLL, agree=0.737, adj=0.474, (0 split)
## oldpeak < 0.65 to the right, agree=0.711, adj=0.421, (0 split)
## chol < 201.5 to the left, agree=0.658, adj=0.316, (0 split)
## trestbps < 126 to the left, agree=0.632, adj=0.263, (0 split)
## slope splits as RLL, agree=0.632, adj=0.263, (0 split)
##
## Node number 6: 42 observations, complexity param=0.03529412
## predicted class=No expected loss=0.452381 P(node) =0.2320442
## class counts: 19 23
## probabilities: 0.452 0.548
## left son=12 (22 obs) right son=13 (20 obs)
## Primary splits:
## slope splits as RLR, improve=3.127706, (0 missing)
## oldpeak < 1.95 to the right, improve=2.752381, (0 missing)
## sex splits as RL, improve=2.742857, (0 missing)
## age < 66 to the left, improve=2.181958, (0 missing)
## ca < 0.5 to the right, improve=1.664069, (0 missing)
## Surrogate splits:
## oldpeak < 1.1 to the right, agree=0.762, adj=0.50, (0 split)
## trestbps < 131 to the left, agree=0.714, adj=0.40, (0 split)
## thalach < 161 to the left, agree=0.690, adj=0.35, (0 split)
## thal splits as RLL, agree=0.643, adj=0.25, (0 split)
## age < 69.5 to the left, agree=0.619, adj=0.20, (0 split)
##
## Node number 7: 54 observations
## predicted class=No expected loss=0.07407407 P(node) =0.2983425
## class counts: 4 50
## probabilities: 0.074 0.926
##
## Node number 10: 19 observations
## predicted class=Yes expected loss=0.2105263 P(node) =0.1049724
## class counts: 15 4
## probabilities: 0.789 0.211
##
## Node number 11: 19 observations
## predicted class=No expected loss=0.2105263 P(node) =0.1049724
## class counts: 4 15
## probabilities: 0.211 0.789
##
## Node number 12: 22 observations, complexity param=0.01176471
## predicted class=Yes expected loss=0.3636364 P(node) =0.121547
## class counts: 14 8
## probabilities: 0.636 0.364
## left son=24 (9 obs) right son=25 (13 obs)
## Primary splits:
## chol < 248.5 to the right, improve=1.9425020, (0 missing)
## trestbps < 125.5 to the right, improve=1.7175320, (0 missing)
## ca < 0.5 to the right, improve=1.4545450, (0 missing)
## thalach < 144.5 to the right, improve=1.1219890, (0 missing)
## age < 60.5 to the left, improve=0.9818182, (0 missing)
## Surrogate splits:
## age < 61.5 to the right, agree=0.682, adj=0.222, (0 split)
## trestbps < 165 to the right, agree=0.682, adj=0.222, (0 split)
## thalach < 157 to the right, agree=0.636, adj=0.111, (0 split)
## exang splits as RL, agree=0.636, adj=0.111, (0 split)
## thal splits as RRL, agree=0.636, adj=0.111, (0 split)
##
## Node number 13: 20 observations
## predicted class=No expected loss=0.25 P(node) =0.1104972
## class counts: 5 15
## probabilities: 0.250 0.750
##
## Node number 24: 9 observations
## predicted class=Yes expected loss=0.1111111 P(node) =0.04972376
## class counts: 8 1
## probabilities: 0.889 0.111
##
## Node number 25: 13 observations
## predicted class=No expected loss=0.4615385 P(node) =0.0718232
## class counts: 6 7
## probabilities: 0.462 0.538
Plot the decision tree
plot(dt_model)
text(dt_model)
Beautify the decision tree
tot_count <- function(x, labs, digits, varlen) {
paste(labs, "\n\nn", x$frame$n)
}
prp(dt_model, faclen = 0, cex = 0.8, node.fun = tot_count)
Predict on the test set
predictions <- predict(dt_model, validation.data, type = "class")
Confusion matrix
confMatrix <- confusionMatrix(predictions, validation.data$target)
print(confMatrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 39 9
## No 15 59
##
## Accuracy : 0.8033
## 95% CI : (0.7216, 0.8697)
## No Information Rate : 0.5574
## P-Value [Acc > NIR] : 1.021e-08
##
## Kappa : 0.5967
##
## Mcnemar's Test P-Value : 0.3074
##
## Sensitivity : 0.7222
## Specificity : 0.8676
## Pos Pred Value : 0.8125
## Neg Pred Value : 0.7973
## Prevalence : 0.4426
## Detection Rate : 0.3197
## Detection Prevalence : 0.3934
## Balanced Accuracy : 0.7949
##
## 'Positive' Class : Yes
##
# Extract accuracy
accuracy <- confMatrix$overall['Accuracy']
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.8032787
# Extract sensitivity (Recall for the positive class)
sensitivity <- confMatrix$byClass['Sensitivity']
cat("Sensitivity (Recall):", sensitivity, "\n")
## Sensitivity (Recall): 0.7222222
# Extract specificity (Recall for the negative class)
specificity <- confMatrix$byClass['Specificity']
cat("Specificity:", specificity, "\n")
## Specificity: 0.8676471
# Extract Precision (Positive Predictive Value)
precision <- confMatrix$byClass['Precision']
cat("Precision:", precision, "\n")
## Precision: 0.8125
# Extract F1 Score
f1_score <- confMatrix$byClass['F1']
cat("F1 Score:", f1_score, "\n")
## F1 Score: 0.7647059