Heart disease continues to be the leading cause of death for individuals in the middle adult age range despite decades of research addressing treatment for this condition. The Center for Disease Control (CDC, 2024) describes heart disease as a term encompassing several conditions related to heart function, including coronary artery disease which affects blood flow to the heart and may increase the risk of a heart attack. According to the CDC, 22% of deaths in the United States are due to heart disease (CDC, 2024). Due to the high need for advanced healthcare practices to treat this condition, data science has surfaced as a option using machine learning models to aid understanding of the disease for advancements in treatment.
Completion of an exploratory data analysis (EDA) includes several components to gain an understanding of the data used. This understanding is essential in identifying appropriate features for classification models to ensure optimal performance. Data visualization is required for understanding trends in the data, including the presence of outliers, unbalanced groups, or other aspects of the data that may require management prior to model development. The continued presence of heart disease as a major risk factor for many adults (CDC, 2024) has led to increased attention to data analysis in the healthcare sector related to heart disease. An increased understanding of the relevant data is essential to ensure medical practices continue to evolve to effectively treat or prevent heart disease. The purpose of this EDA is to investigate methods for data visualization using the Cleveland Heart dataset. Part 3 of the EDA provides an inspection of four of the features of the dataset (cholesterol, resting blood pressure, fasting blood sugar and exercise-induced angina). The understanding gained from the data visualization techniques used will guide feature selection for subsequent steps of the EDA process for classification of the presence or absence of heart disease.
The Amelia-II package was installed and required specification of the CRAN mirror.
install.packages("Amelia", repos = "https://cloud.r-project.org")
## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'Amelia' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\benke\AppData\Local\Temp\RtmpCYHTtf\downloaded_packages
install.packages("Rmisc", repos = "https://cloud.r-project.org")
## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'Rmisc' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\benke\AppData\Local\Temp\RtmpCYHTtf\downloaded_packages
All previous packages from Part 1 and Part 2 were included.
library(ggplot2) # Load ggplot2 library
library(scales) # Load scales library
library(moments)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
## Warning: package 'readr' was built under R version 4.5.1
##
## Attaching package: 'readr'
## The following object is masked from 'package:scales':
##
## col_factor
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.1
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(tidyr)
The Cleveland Heart dataset was uploaded for analysis.
getwd()
## [1] "C:/Users/benke/OneDrive/NU/DDS 8501"
setwd("C:/Users/benke/OneDrive/NU/DDS 8501")
myheartdata <- read_csv("heart_cleveland_upload.csv")
## Rows: 297 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Information regarding the variables was included in this EDA for reference and clarity.
The Characteristics of The Dataset
This dataset is comprised of the data from the heart_cleveland_upload dataset, which has 14 variables and 297 observations
The variables consist of 13 attributes and a target variable indicating the presence or absence of heart disease. A description of the variables are as follows:
age: patient’s age (Quantitative, Numeric, Continuous, Ratio)
sex: patient’s gender (Qualitative, Categorical, Nominal)
cp: chest pain: This variable includes 4 categories of chest pain (typical angina = 0, atypical angina = 1, non-anginal pain = 2, asympotamic = 3, Qualitative, Categorical, Nominal)
trestbps: patient’s blood pressure at rest (mm/HG, Quantitative, Numeric, Continuous, Ratio)
chol: serum cholesterol (mg/dl, Quantitative, Numeric, Continuous, Ratio)
fbs: fasting blood sugar > 120 mg/dl (Qualitative, Categorical, Nominal)
restecg: electrocardiogram results at rest categorized in 3 values (Normal = 0, ST-T wave abnormality(T wave inversions and/or ST elevation or depression of > 0.05 mV = 1, probable/definite left ventricular hypertrophy = 2, Qualitative, Categorical, Nominal)
thalach: patient’s maximum heart rate (Quantitative, Numeric, Continuous, Interval)
exang: presence/absence of exercise induced angina (Qualitative, Categorical, Nominal)
oldpeak: exercise induced ST-depression compared to rest state (Quantitative, Numeric, Continuous, Interval)
slope: shape of slope of ST segment during peak exercise (Qualitative, Categorical (up, flat, or down, Nominal)
ca: patient’s number of major blood vessles (Qualitative, Categorical, Nominal)
thal: patient’s thalassemia indicating type of defect (Qualitative, Categorical, Normal = 1, Fixed defect = 2, Reversible defect = 3, Nominal)
condition: target: presence or absence of heart disease (Binary, Numerical, Discrete)
Preprocessing code was included in this notebook for reference. In Part 1 of this EDA, categorical variables with numeric values were reclassified to factors.
factor_columns <- c("sex", "cp", "fbs", "restecg",
"exang","slope", "ca", "thal")
myheartdata[factor_columns] <- lapply(myheartdata[factor_columns], function(col) as.factor(as.character(col)))
Another important preprocessing step is identification of missing values and the percent of “NA” values in this dataset. This dataset does not include “NA” values which require preprocessing.
cols_with_nas <- sum(colSums(is.na(myheartdata)) > 0)
Percent_col_NA <- percent(cols_with_nas / length(myheartdata))
cols_with_nas
## [1] 0
Percent_col_NA
## [1] "0%"
The summary obtained from Part 1 was maintained. This includes counts for categorical data. Only numeric variables have mean and median scores.
summary(myheartdata)
## age sex cp trestbps chol fbs
## Min. :29.00 0: 96 0: 23 Min. : 94.0 Min. :126.0 0:254
## 1st Qu.:48.00 1:201 1: 49 1st Qu.:120.0 1st Qu.:211.0 1: 43
## Median :56.00 2: 83 Median :130.0 Median :243.0
## Mean :54.54 3:142 Mean :131.7 Mean :247.4
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:276.0
## Max. :77.00 Max. :200.0 Max. :564.0
## restecg thalach exang oldpeak slope ca thal
## 0:147 Min. : 71.0 0:200 Min. :0.000 0:139 0:174 0:164
## 1: 4 1st Qu.:133.0 1: 97 1st Qu.:0.000 1:137 1: 65 1: 18
## 2:146 Median :153.0 Median :0.800 2: 21 2: 38 2:115
## Mean :149.6 Mean :1.056 3: 20
## 3rd Qu.:166.0 3rd Qu.:1.600
## Max. :202.0 Max. :6.200
## condition
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4613
## 3rd Qu.:1.0000
## Max. :1.0000
Additionally, use of the str() function reveals the categories have been appropriately reclassified as factors.
str(myheartdata)
## spc_tbl_ [297 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:297] 69 69 66 65 64 64 63 61 60 59 ...
## $ sex : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 2 2 1 2 ...
## $ cp : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ trestbps : num [1:297] 160 140 150 138 110 170 145 134 150 178 ...
## $ chol : num [1:297] 234 239 226 282 211 227 233 234 240 270 ...
## $ fbs : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 2 1 1 1 ...
## $ restecg : Factor w/ 3 levels "0","1","2": 3 1 1 3 3 3 3 1 1 3 ...
## $ thalach : num [1:297] 131 151 114 174 144 155 150 145 171 145 ...
## $ exang : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
## $ oldpeak : num [1:297] 0.1 1.8 2.6 1.4 1.8 0.6 2.3 2.6 0.9 4.2 ...
## $ slope : Factor w/ 3 levels "0","1","2": 2 1 3 2 2 2 3 2 1 3 ...
## $ ca : Factor w/ 4 levels "0","1","2","3": 2 3 1 2 1 1 1 3 1 1 ...
## $ thal : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 3 2 1 1 3 ...
## $ condition: num [1:297] 0 0 0 1 0 0 0 1 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. sex = col_double(),
## .. cp = col_double(),
## .. trestbps = col_double(),
## .. chol = col_double(),
## .. fbs = col_double(),
## .. restecg = col_double(),
## .. thalach = col_double(),
## .. exang = col_double(),
## .. oldpeak = col_double(),
## .. slope = col_double(),
## .. ca = col_double(),
## .. thal = col_double(),
## .. condition = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Amelia-II package was installed with the CRAN mirror specified and loaded.
Missingness was initially investigated in this EDA using the missingness map method in the Amelia II package in R to identify any missing values. Inspection of the missingness map reveals no missing values for this dataset.
In Part 2 of this EDA, missingness of the dataset was evaluated using the missingness map method from Ameila II.
myheartdata.missmap <- missmap(myheartdata)
## Warning: Unknown or uninitialised column: `arguments`.
## Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `imputations`.
myheartdata.missmap
## NULL
Additionally, columns were inspected using the colSums(is.na()) function in R. As consistent with the previous analyses, the colSums(is.na()) function returned 100% of 0 counts indicating the absence of missing values in this dataset.
colSums(is.na(myheartdata))
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal condition
## 0 0 0 0 0 0
Data Visualization methods from Part 3 were maintained.
The target feature of this project is the presence or absence of heart disease. Fourteen features, including the target, 5 numeric and 8 categorical variables are available for analysis. The initial inspection of the data requires a comparison of the target classification groups to ensure balance between the groups.
ggplot(myheartdata, aes(x=condition, fill=condition)) +
geom_bar(aes(fill=condition), position = "dodge", fill="turquoise", bin=1) +
labs(x="Group Comparison of Presence or Absence of Heart Disease", y="Count")
## Warning in geom_bar(aes(fill = condition), position = "dodge", fill =
## "turquoise", : Ignoring unknown parameters: `bin`
Rmisc and QQplotr were installed for creation of the 4-plot data visualization method.
library(Rmisc)
## Warning: package 'Rmisc' was built under R version 4.5.1
## Loading required package: lattice
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 4.5.1
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
install.packages("qqplotr", repos = "https://cloud.r-project.org")
## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'qqplotr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\benke\AppData\Local\Temp\RtmpCYHTtf\downloaded_packages
library(qqplotr)
## Warning: package 'qqplotr' was built under R version 4.5.1
##
## Attaching package: 'qqplotr'
## The following objects are masked from 'package:ggplot2':
##
## stat_qq_line, StatQqLine
The determination of features for comparison is another important
part of the EDA. Normawati and Winarti (2018) compared feature selection
using the Cleveland Heart dataset. The researchers compared feature
selection methods based on motivated feature selection (medical
expert-based selection) (MTF), Variable Precision Rough Set (VPRS)
method for feature selection and a combination of both methods. Results
indicated model accuracy improved with use of VPRS and the merger of MTF
with VPRS in feature selection. The authors comment use of MTF alone as
feature selection may result in unanticipated bias therefore use of a
computer-based feature selection method is likely to improve model
performance. The features selected for this EDA used medical
expert-based selection to identify features historically associated with
heart disease based on risk factors reported by the Center for Disease
Control (CDC, 2024). The CDC (2024) reports high blood pressure and
cholesterol as key risk factors in the presence of heart disease.
Feature selection methods in future analysis may reduce risk of bias and
will be considered in future analysis methods.
For this EDA, a 4-plot was developed using the cholesterol and resting blood pressure variables, as these were reported by the CDC as high-risk indicators for heart disease (CDC, 2024). Figure 2 provides data visualization of a run sequence, a lag plot, a histogram of cholesterol, and a normal probability plot. The run and lag plots show the value of resting blood pressure versus the index of cholesterol, grouped by condition (presence or absence of heart disease). Inspection of the run plot and lag plot show notable trends. The plots show a higher concentration of instances of the presence of heart disease group with values above 150 for cholesterol, as seen in both the run and lag plots. Additionally, the lag plot for cholesterol shows two values with high blood pressure above 180, at approximately 190 and 200. Both of these values have the condition of heart disease based on grouping. Because these values were considered to demonstrate expected variation in electronic health records rather than error, they were included in the analysis although likely to be outliers. These plots indicate that cholesterol above 150 might have an association with heart disease. Additionally, a resting blood pressure above 180 may have an association with heart disease, although the two values appear to be outliers.
This histogram shows the feature of cholesterol has a uniform distribution indicating minimal variability across the values in the data making any interpretation of trends difficult. An expected Gaussian distribution would increase the reliability of the data; however, this dataset includes uniform values for cholesterol indicating a likely bias in the dataset. Finally, the probability plot shows a similar shape for both target groups of with and without heart disease based on the cholesterol value, with deviations from normality at the left and right ends of the plot. This result is expected when considering the uniform distribution of values in cholesterol as observed in the histogram.
myheartdata$chol=seq(1:nrow(myheartdata))
p1=ggplot(myheartdata, aes(x=chol,y=trestbps, group=condition))+geom_line(aes(col=as.factor(condition)))+ggtitle('Run: Cholesterol')
p3=ggplot(myheartdata, aes(x=lag(chol), y=trestbps, group=condition))+geom_point(aes(col=as.factor(condition)))+ggtitle('Lag:Cholesterol')
p2=ggplot(myheartdata, aes(x=chol))+geom_histogram(fill='turquoise',col='black', bin=20)+ggtitle('Histogram:Cholesterol')
## Warning in geom_histogram(fill = "turquoise", col = "black", bin = 20):
## Ignoring unknown parameters: `bin`
p4=ggplot(myheartdata, aes(sample=scale(chol),group=condition))+stat_qq_point(aes(col=as.factor(condition)))+
ggtitle('Normal Probability Plot: Cholesterol')
multiplot(p1, p2, p3, p4, cols=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
Outlier management is another important aspect of date preprocessing. Riyaz et al. (2022) addressed outlier management using the Framingham Heart Study dataset when comparing classification model performance based on outlier detection. Riyaz et al. (2022) found five variables with outliers (total cholesterol, systolic blood pressure, diastolic blood pressure, body mass index, heart rate, and glucose level) using visualization methods including but not limited to box plots for data visualization. The authors found that removal of outliers improved the performance of all classification models used except multinominal naive bayes (Riyaz et al, 2022).
In this EDA, box plots for the variables of interest (cholesterol and resting blood pressure) were used to further identify distribution characteristics. The box plot for cholesterol revealed a median of 150, 1st quartile of approximately 75 and a 3rd quartile of approximately 225. Whiskers extend as low as 0 and as high as 300. Although no outliers are evident, a cholesterol of 0 is indicative of error as it is unlikely a cholesterol result in an electronic health record would be 0. Further inspection to determine if hidden missing values are present with values of “0” is warranted in future analyses. The box plot for resting blood pressure indicates a 1st quartile of 120, a median of approximately 135 and the 3rd quartile of approximately 140. Whisker lines indicate a lowest value slightly above 90 with no outliers and a highest value of 175 with 6 outliers above 175. The outliers were considered representative of expected variation and included in the analysis.
ggplot(myheartdata, aes(x = condition, y = chol)) +
geom_boxplot(fill = "salmon", color = "darkblue", width = .25, outlier.color = "turquoise", outlier.size = 2) +
labs(
title = "Box Plot for Cholesterol",
x = "Condition",
y = "Cholesterol"
) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.y = element_text(size = 14),
axis.text.y = element_text(size = 12)
)
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
ggplot(myheartdata, aes(x = condition, y = trestbps)) +
geom_boxplot(fill = "salmon", color = "darkblue", width = .25, outlier.color = "turquoise", outlier.size = 2) +
labs(
title = "Box Plot for Resting Blood Pressure",
x = "Condition",
y = "Resting Blood Pressure"
) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.y = element_text(size = 14),
axis.text.y = element_text(size = 12)
)
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
head(myheartdata)
## # A tibble: 6 × 14
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope
## <dbl> <fct> <fct> <dbl> <int> <fct> <fct> <dbl> <fct> <dbl> <fct>
## 1 69 1 0 160 1 1 2 131 0 0.1 1
## 2 69 0 0 140 2 0 0 151 0 1.8 0
## 3 66 0 0 150 3 0 0 114 0 2.6 2
## 4 65 1 0 138 4 1 2 174 0 1.4 1
## 5 64 1 0 110 5 0 2 144 1 1.8 1
## 6 64 1 0 170 6 0 2 155 0 0.6 1
## # ℹ 3 more variables: ca <fct>, thal <fct>, condition <dbl>
View(myheartdata["trestbps"])
Univariate data visualization is also an important aspect of completion of the EDA to review the distribution of counts in features of interest. Inspection of group characteristics supports identification of reduced balance that could impact the performance of classification models. Kandula (2025) used a stacked bar chart to show asymptomatic chest pain with the highest count of observations with the condition of heart disease compared to typical angina, atypical angina, and non-anginal pain groups. Kandula (2025) points out that the condition of heart disease presents without symptomatic pain. In this EDA, group counts were visualized in a bar chart form for the categorical features of fasting blood sugar and exercise induced angina. Exercise-induced angina was chosen due to the inclusion and discussion of this feature in the literature (Kandula 2025). Fasting blood sugar was chosen because this feature was included in a list of important features by motivated feature selection (Normawati and Winarti, 2018). By visualization of the groups, reduced balance in both features is identified. Less than 25% of the dataset includes a value in the category of fasting blood sugar above 120 mg/dl, showing unequal distribution and a high likelihood of bias by using this variable in future analysis (Figure 5). To a lesser extent, reduced balance is also identified in the counts of exercise-induced angina (Figure 6), with the presence of exercise-induced angina containing approximately half of the number of values as compared to the group without exercise-induced angina. Use of these features could introduce bias in future steps of the EDA due to the unbalanced group counts.
library(dplyr)
ggplot(myheartdata, aes(x=fbs, fill=fbs)) +
geom_bar(aes(fill=fbs), position = "dodge") +
labs(x="Fasting Blood Sugar >120 mg/dl", y="Count")
ggplot(myheartdata, aes(x=exang, fill=exang)) +
geom_bar(aes(fill=exang), position = "dodge") +
labs(x="Exercise Induced Angina", y="Count")
Data visualization techniques for univariate analysis were used to investigate the characteristics of cholesterol, resting blood pressure, fasting blood sugar, and exercise-induced angina in the Cleveland Heart dataset. Variables were chosen for investigation in this report due to the attention these variables received in previous literature using open-source heart disease datasets in the development of heart disease classification models. Both categorical variables were found to have unbalanced groups which could affect the accuracy in model development due to the lack of sufficient data values for the smaller groups. Numerical variables were chosen based on the CDC lists of important risk factors for the development of heart disease (CDC, 2024). The 4-plot data visualization technique demonstrates reduced normality in distribution for both of these variables. The data visualization also shows a possible association between high cholesterol and the presence of heart disease, which requires further investigation.
Data visualization provides insights into the distribution of the values of investigated features prior to model development. Data management requirements, such as outlier management, are identified with univariate data visualization to ensure features selected for future analysis do not introduce bias and negatively affect future model performance.
Multivariate data visualization is another essential component of the exploratory data analysis (EDA). While univariate data visualization provides information regarding the characteristics of a feature, multivariate data visualizations provide meaningful information regarding relationships and associations between features (Alghamdi et al., 2024). Combinations of variables in data visualization are required to identify which variable combinations have correlations that either require further investigation or provide essential information for future model training.
Alghamdi et al. (2024) used a variety of multivariate data visualization methods to gain perspective and understanding of the Cleveland heart disease dataset. A K-means clustering plot was compared with an agglomerative clustering plot for the variables of age and cholesterol with presence or absence of heart disease indicated by color. By comparing two types of clustering plots, trends in the clustering of data points for age and cholesterol are identified for future analysis. Additionally, researchers used a stacked bar chart comparing type of chest pain across target groups of presence or absence of heart disease. Alghamdi et al. (2024) found patients with heart disease having an asymptomatic presentation compared to typical, atypical, and non-anginal pain. Finally, Alghamdi et al. (2024) used pair plots to compare age, cholesterol, and the patient’s maximum heart rate (thalach) with color indicating presence or absence of the target heart disease. Results indicated clusters of data points for each variable. Inspection of color differences within the plot clusters supports identification of valuable features with trends and relationships to be included in prediction models (Alghamdi et al., 2024).
The GGally library was installed to support multivariate data visualization techniques.
install.packages("GGally", repos = "https://cloud.r-project.org")
## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'GGally' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\benke\AppData\Local\Temp\RtmpCYHTtf\downloaded_packages
library(GGally)
## Warning: package 'GGally' was built under R version 4.5.1
A pairs plot visualization developed with base R was used to identify relationships between numerical variables in the Cleveland Heart dataset. The pairs plot provides an opportunity to inspect associations between numerical variable to identify associations which may indicate features important for classification during model development. Positive, neutral, and negative associations are visible by inspecting the shape of data points. Additionally, clustered data points provide information regarding the distribution of the data for each paring of numerical features.
nummyheartdata <- myheartdata[c(1,4:5,8,10,14)]
head(nummyheartdata)
## # A tibble: 6 × 6
## age trestbps chol thalach oldpeak condition
## <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 69 160 1 131 0.1 0
## 2 69 140 2 151 1.8 0
## 3 66 150 3 114 2.6 0
## 4 65 138 4 174 1.4 1
## 5 64 110 5 144 1.8 0
## 6 64 170 6 155 0.6 0
Several insights are apparent in the pairs plot of numerical features. Several possible associations are apparent when paired with age. The age feature when paired with cholesterol (chol) appeared to have distinct groupings of strong negative associations, which indicate many patients in the dataset may have identical values. Stacked values for a cholesterol value of 0 across the age ranges of 30-60 may be further indication of hidden missing values listed in the patient chart as 0. Additionally, there are stacked values across all age ranges for cholesterol levels of 50-100, 100-150 and 200-300. This may indicate cholesterol values are reported in the medical chart in discreet categories rather than continuous leading to the overlapping values.
The age and resting blood pressure (tresbps) variables appear to have a weak, positive correlation, showing all data points with a resting blood pressure above 175 are for patients over 50 years of age. The age values paired with the patient’s maximum heart rate (thalach) have a weak negative correlation, with patients from 30-40 years of age able to achieve a maximum heart rate of 140 or higher while patients over 60 years of age have a maximum heart rate of 140 or lower. The scatter plot for this pair also shows an outlier of a 30 year old patient able to achieve a maximum heart rate of 200 and an outlier of a 70 year old patient unable to achieve a heart rate higher than approximately 75 beats per minute.
The paring of resting blood pressure (trestbps) and the patient’s maximum heart rate (thalach) appears to have an area of higher plot density without a correlation with values of a maximum heart rate between 140 and 180 and a resting blood pressure of 120 - 160. A weak negative correlation also is visible in the paring of cholesterol (chol) with the patient’s maximum heart rate (thalach) with a maximum heart rate of 120-160 with cholesterol values of 0-100 and a maximum heart rate of 90-150 with cholesterol values of 150-250. A shift to a positive correlation is seen with cholesterol levels above 250 with a maximum heart rate (thalach) of 100-200. Across all of the variables except age, the cholesterol variable (chol) has an uniform distribution across the x-axis, indicating an equivalent number of data points across the possible values. The oldpeak (exercise induced ST depression) presented with a high density of data points in the 0-2 range, with scattered values of a lower density above an oldpeak of 2 for all pair plots.
pairs(nummyheartdata)
Although the correlations identified in the scatterplot matrix were weak, the age variable appeared to have identifiable weak correlation trends. To further investigate the relationship between the age variable and categorical variables, box plots were developed to compare age values across two categorical variables.
A box plot comparing types of chest pain with age values provides insight regarding the variability of type of chest pain for patients. The first box plot shows patients with typical anginal chest pain with a 1st quartile of 51 years of age, median of 59 years of age, and a 3rd quartile of 64 years of age. Whiskers extend to a maximum value of 69 without outliers for the upper values and extend below to approximately 34 years of age. A longer whisker is apparent for the lowest age value to the 1st quartile, indicating higher variability for ages 34-51 than for ages 51-69. Inspection of the second box plot indicates the highest variability across age groups for atypical anginal chest pain, with whiskers from 29 to 44 years of age for the lower values and from 56-74 years of age for the higher values. The interquartile range is from 44 to 56 with a median of approximately 52. Inspection of the box plot for non-anginal chest pain, on the other hand, has a shorter whisker for the lower age values of 35 to 46 and a longer whisker for values of approximately 60 year of age to 76, indicating higher variability in the older age ranges. The median is approximately 54, with the first quartile at 46 year of age and the 3rd quartile at 60. Finally, the last box plot of asymptomatic chest pain includes similar whisker ranges for lower and higher age ranges, the first quartile at approximately 51 years of age, the median at 56 years of age and the 3rd quartile at approximately 62 years of age. The variability appears slightly smaller than the previous box plots as well.
ggplot(myheartdata, aes(x = cp, y = age )) +
geom_boxplot(fill = "salmon", color = "darkblue", width = .25, outlier.color = "turquoise", outlier.size = 2) +
labs(
title = "Box Plot for Chest Pain by Age",
x = "Chest Pain: Typical, Atypical,Non-anginal, Asympotamic",
y = "Age"
) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.y = element_text(size = 14),
axis.text.y = element_text(size = 12)
)
A box plot was also used to compare the absence or presence of exercise induced angina with age values. Absence of exercise induced angina shows higher variability when compared to the presence of exercise induced angina, longer whisker lines for both younger and older ages. Absence of exercise induced angina shows the first quartile at approximately 46 years of age with whisker lines extending to 29 years of age. The median for absence of exercise induced angina is 54 years of age. The 3rd quartile is at approximately 61 years of age with whiskers extending to 76 years of age. For patients with exercise induced angina, the youngest patient is approximately 38 with an outlier at 35 years of age. The 1st quartile 51 years of age with whisker lines extended to 38 years of age for the younger ages. The median is 56 years of age and the 3rd quartile is 61 years of age. The whisker lines extend to 74 with one outlier at 76 years of age. The variability is higher for ages younger than the median as compared to ages older than the median age of 56.
ggplot(myheartdata, aes(x = exang, y = age)) +
geom_boxplot(fill = "salmon", color = "darkblue", width = .25, outlier.color = "turquoise", outlier.size = 2) +
labs(
title = "Absence or Presence of Exercise Induced Angina by Age",
x = "Absence & Presence of Exercise Induced Angina",
y = "Age"
) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.title.y = element_text(size = 14),
axis.text.y = element_text(size = 12)
)
cp_exang_table <- table(myheartdata$cp, myheartdata$exang)
cp_exang_table_2 <- addmargins(cp_exang_table)
cp_exang_table
##
## 0 1
## 0 19 4
## 1 45 4
## 2 72 11
## 3 64 78
cp_exang_table_2
##
## 0 1 Sum
## 0 19 4 23
## 1 45 4 49
## 2 72 11 83
## 3 64 78 142
## Sum 200 97 297
Data visualization also included comparisons of categorical variables to determine relationships. The KableExtra package was used to create a classic table of the counts of types of chest pain and exercise induced angina. From the inspection, asymptomatic patient presentation had the highest counts of the total group of patients. Asymptomatic patient presentation had a clearly higher count of 78 patients for the exercise induced angina group compared to the typical, atypical, and non-anginal chest pain presentations in patients.
library(kableExtra)
cp_exang_table_2 %>%
kbl(caption = "Table 1. Chest Pain Type and Exercise Induced Angina") %>%
kable_classic()
0 | 1 | Sum | |
---|---|---|---|
0 | 19 | 4 | 23 |
1 | 45 | 4 | 49 |
2 | 72 | 11 | 83 |
3 | 64 | 78 | 142 |
Sum | 200 | 97 | 297 |
A bar graph was used to further demonstrate the relationships between type of chest pain and the presence or absence of exercise induced chest pain. From the bar graph, the increased number of patients with exercise induced angina, but with asymptomatic presentation without exercise is apparent. A distribution slightly skewed to the left is seen with the group of patients without exercise induced angina.
color <- c("goldenrod", "turquoise", "darkblue", "salmon")
names <- c("Typical", "Atypical", "Non-anginal", "Asymptomatic")
barplot(cp_exang_table, col=color, beside = TRUE, main = "Absence/Presence of Exercise Induced Chest Pain & Type of Chest Pain")
legend("topleft", names, fill=color)
Additionally, a table was created to compare categorical groups of exercise induced chest pain and type of defect. From the table, a higher count is apparent for patients without a defect and without exercise induced angina compared to patients without a defect but with exercise induced angina. Groups of patients with either a fixed or reversible defect appear fairly equally distributed across presence or absence of exercise induced angina.
thal_exang_group<- table(myheartdata$thal, myheartdata$exang)
thal_exang_group_2 <- addmargins(thal_exang_group)
thal_exang_group
##
## 0 1
## 0 133 31
## 1 10 8
## 2 57 58
thal_exang_group_2
##
## 0 1 Sum
## 0 133 31 164
## 1 10 8 18
## 2 57 58 115
## Sum 200 97 297
thal_exang_group_2 %>%
kbl(caption = "Table 2. Exercise Induced Chest Pain and Type of Heart Defect") %>%
kable_classic()
0 | 1 | Sum | |
---|---|---|---|
0 | 133 | 31 | 164 |
1 | 10 | 8 | 18 |
2 | 57 | 58 | 115 |
Sum | 200 | 97 | 297 |
color <- c("turquoise", "darkblue", "salmon")
names <- c("Normal", "Fixed Defect", "Reversable Defect")
barplot(thal_exang_group, col=color, beside = TRUE, main = "Presence or Absence of Exercise Induced Chest Pain & Type of Defect")
legend("topright", names, fill=color)
View(myheartdata)
Multivariate data visualization is a key component of the exploratory data analysis (EDA) used to identify important trends, relationships, and correlations between variables. Alghamdi et al. (2024) used a variety of data visualization techniques with the Cleveland Heart dataset, including but not limited to agglomerative cluster plots to compare groups of numerical variables, bar plots to compare categorical variables, and pair plots with hue for identification of key features.
Pair plots were used in this EDA to identify relationships and correlations between numerical variables. A weak negative correlation was found when comparing cholesterol with maximum heart rate achieved. This indicates in patients with higher cholesterol levels, a lower maximum heart rate was achieved. This association implies patients with a higher cholesterol level show exercise excursion to higher heart rate levels less often, as would be expected. Additionally, the shift in increased maximum heart rate for patients with very high cholesterol (over 250) indicates for these patients, exercise leads to rapid increases in heart rate and this may be an indicator for the presence of heart disease requiring further investigation.
In this EDA, the age variable most consistently demonstrated correlations with other numeric variables. Age and resting blood pressure presented with a weak positive correlation, indicating resting blood pressure shows a weak trend of increasing as age increases. Age also presented with a weak, negative correlation with maximum heart rate achieved, showing younger patients were more often achieving a maximum heart rate over 140 while older patients more often had a heart rate of 140 or lower. This also indicates younger patients may have higher exercise excursion.
The relationship between age and heart disease continues to receive attention due to the association of increased risk of heart disease with aging (CDC, 2024). Gracious and Shirley (2024) propose use of the ensemble learning models Random Forest and XGBoost in a web interface to increase access to heart disease prediction for the general population. Gracious and Shirley (2024) specifically highlight age as an important factor in heart disease prediction models, and the value of machine learning models in leveraging age-related physiological changes with traditional clinical data. Because of the attention to age in heart disease classification model literature, age was closely inspected in this EDA.
Age and cholesterol demonstrated stacked values of four distinct negative correlations. Inspection of the pair plots for age and cholesterol identify problems in the data that may require further pre-processing. The lines of stacked values indicate many identical cholesterol values and may indicate discreet groupings of cholesterol levels rather than continuous data as previously thought. Additionally, a line of stacked values of a cholesterol level of 0 across all ages may provide more evidence that hidden missing values are present in the cholesterol values requiring imputation. The trends identified with the pairs plot in age and cholesterol may also be related to the small number of observations for the Cleveland Heart dataset, as has been discussed in the literature.
Sameen et al. (2025) address the data set size by using a merged dataset from Switzerland, Hungarian, VA, Long Beach, Statlog, and Cleveland. By using the merged dataset, a reduced risk of biased data was achieved by using a robust, diverse dataset. Additionally, Least Absolute Shrinkage and Selection Operator (LASSO) and the Relief method were compared for feature selection in addition to data visualization techniques for increased performance in feature selection. Results indicated the addition of the Relief method to data visualization increased model accuracy when used with the Random Forest Bagging Method for classification of heart disease.
Additionally, presence or absence of exercise induced angina was compared with type of chest pain and type of defect. For patients with exercise induced angina, a large number of patients had an asymptomatic presentation at baseline indicating no perceivable symptoms at rest. Patients without exercise induced angina had a higher number of patients in the normal category with no defect compared to patients with exercise induced angina.
In conclusion, the multivariate data visualization in this EDA highlights the relationship between variables to identify important features for future analysis. In additional to cholesterol, important variables include age and variables related to exercise (maximum heart rate achieved and exercise induced angina). Age, cholesterol and maximum heart rate achieved will be considered in future analysis due to the weak correlations identified. Age also impacted variability in values across types of chest pain and in exercise induced angina. Finally, type of chest pain and type of defect varied in the presence or absence of exercise induced angina requiring further analysis. Overall, the inclusion of multivariate data visualization was essential in identifying important variables for further analysis.
Descriptive statistics are an essential component of exploratory data analysis in identifying information and understanding the characteristics of the data. In this report, descriptive statistics will be reviewed for a deeper understanding of the data in the Cleveland Heart dataset. The remaining sections of this report will provide a review of descriptive statistics for both numerical and categorical features and a discussion of the insights gained.
summary(nummyheartdata)
## age trestbps chol thalach oldpeak
## Min. :29.00 Min. : 94.0 Min. : 1 Min. : 71.0 Min. :0.000
## 1st Qu.:48.00 1st Qu.:120.0 1st Qu.: 75 1st Qu.:133.0 1st Qu.:0.000
## Median :56.00 Median :130.0 Median :149 Median :153.0 Median :0.800
## Mean :54.54 Mean :131.7 Mean :149 Mean :149.6 Mean :1.056
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:223 3rd Qu.:166.0 3rd Qu.:1.600
## Max. :77.00 Max. :200.0 Max. :297 Max. :202.0 Max. :6.200
## condition
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4613
## 3rd Qu.:1.0000
## Max. :1.0000
IQR(nummyheartdata$age)
## [1] 13
IQR(nummyheartdata$trestbps)
## [1] 20
IQR(nummyheartdata$chol)
## [1] 148
IQR(nummyheartdata$thalach)
## [1] 33
IQR(nummyheartdata$oldpeak)
## [1] 1.6
IQR(nummyheartdata$condition)
## [1] 1
var(nummyheartdata, na.rm=FALSE)
## age trestbps chol thalach oldpeak condition
## age 81.897716 46.693682 -150.00338 -81.917201 2.0802553 1.0261284
## trestbps 46.693682 315.517290 -123.36486 -20.011694 3.9613363 1.3614069
## chol -150.003378 -123.364865 7375.50000 -497.918919 17.0317568 16.2263514
## thalach -81.917201 -20.011694 -497.91892 526.315270 -9.3003003 -4.8550937
## oldpeak 2.080255 3.961336 17.03176 -9.300300 1.3598423 0.2469219
## condition 1.026128 1.361407 16.22635 -4.855094 0.2469219 0.2493402
var(nummyheartdata$age)
## [1] 81.89772
sd(nummyheartdata$age)
## [1] 9.049736
sd(nummyheartdata$trestbps)
## [1] 17.76281
sd(nummyheartdata$chol)
## [1] 85.88073
sd(nummyheartdata$thalach)
## [1] 22.94156
sd(nummyheartdata$oldpeak)
## [1] 1.166123
sd(nummyheartdata$condition)
## [1] 0.4993398
library(moments)
skewness(nummyheartdata)
## age trestbps chol thalach oldpeak condition
## -0.2186630 0.6965290 0.0000000 -0.5338265 1.2408238 0.1553487
kurtosis(nummyheartdata)
## age trestbps chol thalach oldpeak condition
## 2.466854 3.781192 1.799973 2.928885 4.465520 1.024133
compute_stats <- function(column, name) {
if (is.numeric(column) || is.integer(column)) {
data.frame(
Variable = name,
Mean = round(mean(column, na.rm = TRUE), 2),
Median = round(median(column, na.rm = TRUE), 2),
St.Deviation = round(sd(column, na.rm = TRUE), 2),
Variance = round(var(column, na.rm = TRUE), 2),
Range = round(diff(range(column, na.rm = TRUE)), 2),
IQR = round(IQR(column, na.rm = TRUE), 2),
Skewness = round(skewness(column, na.rm = TRUE), 2),
Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
stringsAsFactors = FALSE
)
} else {
NULL
}
}
descriptive_stats <- do.call(
rbind,
lapply(names(nummyheartdata), function(col) compute_stats(nummyheartdata[[col]], col))
)
descriptive_stats
## Variable Mean Median St.Deviation Variance Range IQR Skewness Kurtosis
## 1 age 54.54 56.0 9.05 81.90 48.0 13.0 -0.22 2.47
## 2 trestbps 131.69 130.0 17.76 315.52 106.0 20.0 0.70 3.78
## 3 chol 149.00 149.0 85.88 7375.50 296.0 148.0 0.00 1.80
## 4 thalach 149.60 153.0 22.94 526.32 131.0 33.0 -0.53 2.93
## 5 oldpeak 1.06 0.8 1.17 1.36 6.2 1.6 1.24 4.47
## 6 condition 0.46 0.0 0.50 0.25 1.0 1.0 0.16 1.02
descriptive_stats %>%
kbl(caption = " Table 1. Descriptive Statistics") %>%
kable_classic()
Variable | Mean | Median | St.Deviation | Variance | Range | IQR | Skewness | Kurtosis |
---|---|---|---|---|---|---|---|---|
age | 54.54 | 56.0 | 9.05 | 81.90 | 48.0 | 13.0 | -0.22 | 2.47 |
trestbps | 131.69 | 130.0 | 17.76 | 315.52 | 106.0 | 20.0 | 0.70 | 3.78 |
chol | 149.00 | 149.0 | 85.88 | 7375.50 | 296.0 | 148.0 | 0.00 | 1.80 |
thalach | 149.60 | 153.0 | 22.94 | 526.32 | 131.0 | 33.0 | -0.53 | 2.93 |
oldpeak | 1.06 | 0.8 | 1.17 | 1.36 | 6.2 | 1.6 | 1.24 | 4.47 |
condition | 0.46 | 0.0 | 0.50 | 0.25 | 1.0 | 1.0 | 0.16 | 1.02 |
In this EDA, descriptive statistics were derived for numeric variables (Table 1). Symmetrical variables included resting blood pressure (trestbps) and maximum heart rate (thalach). The shape of resting blood pressure (trestbps) data the indicate the presence of high variability, with an IQR of 20 but a range of 106. Skewness is slightly positive at .70 with kurtosis at 3.78. Maximum heart rate achieved (thalach) data also indicate variability, with negative skewness of –0.53, which is also reflected in the lower mean as compared to the median, and a kurtosis of 2.93. This also indicates high variance within the symmetry. The cholesterol variable has the highest variability, with a standard deviation of 85.88, which is over 50% of the mean and median values.
Variance was also calculated for numeric variables, with the highest variance for cholesterol (chol), high variance for resting blood pressure (trestbps) and maximum heart rate achieved (thalach), moderate variance for age, and the lowest variance for exercise induced ST depression compared to rest state (oldpeak).
Across all the numeric variable groups for this dataset, the small number of observations is apparent in the asymmetrical distribution and high variability. Except for the target variable (presence or absence of heart disease), the mean and median values are similar across the variables. Age and maximum heart rate achieved (thalach) both have negative skewness indicating values lower than the first quartile affecting the shape of the data. The ST depression with exercise compared to ST depression at rest, a specific measurement of the electrocardiogram (ECG) has a positive skewness as well as high kurtosis, indicating values trend high without a high number of values in the middle range. Condition is misleading, as condition would be more accurately described as a categorical variable of presence or absence of heart disease, and was maintained as a numeric value only for an inspection of the mean to ensure equal counts across both categories.
catmyheartdata <- myheartdata[c(2,3,6:7,9,11:13)]
head(catmyheartdata)
## # A tibble: 6 × 8
## sex cp fbs restecg exang slope ca thal
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 1 0 1 2 0 1 1 0
## 2 0 0 0 0 0 0 2 0
## 3 0 0 0 0 0 2 0 0
## 4 1 0 1 2 0 1 1 0
## 5 1 0 0 2 1 1 0 0
## 6 1 0 0 2 0 1 0 2
library(plyr)
Correlations of categorical variables were also investigated with the use of contingency tables. The categorical variables of gender (sex), type of chest pain (cp), type of defect (thal) and presence or absence of exercise induced angina (exang) were compared. Goel et al. (2023) investigated the performance of classification models using the Cleveland Heart dataset, using a heat map to identify features most highly correlated with presence or absence of heart disease. Type of chest pain (cp), type of defect (thal) and presence or absence of exercise induced angina (exang) were listed among the six most highly correlated features and were also chosen for investigation in this EDA (Goel et al., 2023).
Inspection of the contingency tables reveals information about the distribution across groups and relationships between categorical variables. A higher percentage of male patients experienced exercise induced angina (37.3%) compared to female patients (22.9%). When comparing exercise induced angina with angina type, only the patients asymptomatic at baseline had a higher count in presence as compared to absence of exercise induced angina. Typical, Atypical, and Non-anginal pain at baseline had comparatively lower counts in the group with exercise induced angina than without. When comparing presence or absence of exercise induced angina to defect type, the groups with defects (reversible and fixed) had a similar number of counts across both presence and absence of exercise induced angina when compared to the Normal group (no defect), with only 18.9% having exercise induced angina and a normal heart structure.
Pearson’s Chi-squared tests were calculated for each of these contingency tables as well. A p-value of 0.019 was found for the group comparing gender with chest pain, a p-value of 2.153e-13 was found comparing chest pain with exercise induced angina, and p-value of 1.252e-07 was found for the comparision between exercise induced angina and type of defect. All three groups were found to have statistically significant correlations, with type of defect and chest pain having the highest significance with exercise induced angina.
summary(catmyheartdata)
## sex cp fbs restecg exang slope ca thal
## 0: 96 0: 23 0:254 0:147 0:200 0:139 0:174 0:164
## 1:201 1: 49 1: 43 1: 4 1: 97 1:137 1: 65 1: 18
## 2: 83 2:146 2: 21 2: 38 2:115
## 3:142 3: 20
sex_cp <- table(catmyheartdata$exang, catmyheartdata$sex)
sex_cp_margins <- addmargins(sex_cp)
colnames(sex_cp_margins) <- c("Female", "Male", "Total")
rownames(sex_cp_margins) <- c("Absent", "Present", "Total")
sex_cp_margins %>%
kbl(caption = "Table 3. Crosstab of Presence or Absence of Exercise Induced Angina by Gender") %>%
kable_classic()
Female | Male | Total | |
---|---|---|---|
Absent | 74 | 126 | 200 |
Present | 22 | 75 | 97 |
Total | 96 | 201 | 297 |
sex_cp_chi <- chisq.test(sex_cp)
sex_cp_chi
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: sex_cp
## X-squared = 5.4857, df = 1, p-value = 0.01917
cp_exang <- table(catmyheartdata$exang, catmyheartdata$cp)
cp_exang_margins <- addmargins(cp_exang)
colnames(cp_exang_margins) <- c("Typical", "Atypical", "Non-anginal", "Asymptomatic", "Total")
rownames(cp_exang_margins) <- c("Absent", "Present", "Total")
cp_exang_margins %>%
kbl(caption = "Table 4. Crosstab of Presence or Absence of Exercise Induced Angina by Angina Type") %>%
kable_classic()
Typical | Atypical | Non-anginal | Asymptomatic | Total | |
---|---|---|---|---|---|
Absent | 19 | 45 | 72 | 64 | 200 |
Present | 4 | 4 | 11 | 78 | 97 |
Total | 23 | 49 | 83 | 142 | 297 |
cp_exang_chi <- chisq.test(cp_exang)
cp_exang_chi
##
## Pearson's Chi-squared test
##
## data: cp_exang
## X-squared = 62.042, df = 3, p-value = 2.153e-13
thal_exang <- table(catmyheartdata$exang, catmyheartdata$thal)
thal_exang_margins <- addmargins(thal_exang)
colnames(thal_exang_margins) <- c("Normal", "Fixed Defect", "Reversible Defect", "Total")
rownames(thal_exang_margins) <- c("Absent", "Present", "Total")
thal_exang_margins %>%
kbl(caption = "Table 5. Crosstab of Presence or Absence of Exercise Induced Angina by Defect Type") %>%
kable_classic()
Normal | Fixed Defect | Reversible Defect | Total | |
---|---|---|---|---|
Absent | 133 | 10 | 57 | 200 |
Present | 31 | 8 | 58 | 97 |
Total | 164 | 18 | 115 | 297 |
thal_exang_chi <- chisq.test(thal_exang)
thal_exang_chi
##
## Pearson's Chi-squared test
##
## data: thal_exang
## X-squared = 31.77, df = 2, p-value = 1.262e-07
The correlation between numeric variables was also calculated and is represented in table form (Table 2) and in a correlation matrix heat map (Figure 1). No correlations above .5 were found in correlations of numeric variables. The strongest negative correlations were found between age and maximum heart rate achieved (-0.39), maximum heart rate achieved and the target condition (-0.42), and maximum heart rate achieved and exercise induced ST depression (-0.35). The strongest positive correlations were found between cholesterol and the target condition (0.37) and exercise induced ST depression and the target condition (0.42).
corr_myheartdata <- cor(nummyheartdata)
corr_myheartdata
## age trestbps chol thalach oldpeak condition
## age 1.0000000 0.29047626 -0.19300538 -0.39456288 0.1971226 0.2270752
## trestbps 0.2904763 1.00000000 -0.08086939 -0.04910766 0.1912431 0.1534900
## chol -0.1930054 -0.08086939 1.00000000 -0.25272017 0.1700668 0.3783807
## thalach -0.3945629 -0.04910766 -0.25272017 1.00000000 -0.3476400 -0.4238171
## oldpeak 0.1971226 0.19124314 0.17006676 -0.34763997 1.0000000 0.4240521
## condition 0.2270752 0.15349003 0.37838067 -0.42381706 0.4240521 1.0000000
corr_myheartdata %>%
kbl(caption = "Table 2. Correlations of Numerical Variables") %>%
kable_classic()
age | trestbps | chol | thalach | oldpeak | condition | |
---|---|---|---|---|---|---|
age | 1.0000000 | 0.2904763 | -0.1930054 | -0.3945629 | 0.1971226 | 0.2270752 |
trestbps | 0.2904763 | 1.0000000 | -0.0808694 | -0.0491077 | 0.1912431 | 0.1534900 |
chol | -0.1930054 | -0.0808694 | 1.0000000 | -0.2527202 | 0.1700668 | 0.3783807 |
thalach | -0.3945629 | -0.0491077 | -0.2527202 | 1.0000000 | -0.3476400 | -0.4238171 |
oldpeak | 0.1971226 | 0.1912431 | 0.1700668 | -0.3476400 | 1.0000000 | 0.4240521 |
condition | 0.2270752 | 0.1534900 | 0.3783807 | -0.4238171 | 0.4240521 | 1.0000000 |
image(1:nrow(corr_myheartdata), 1:ncol(corr_myheartdata),
t(corr_myheartdata[nrow(corr_myheartdata):1, ]),
xlab = "", ylab = "",
main = "Figure 1. Correlation Matrix Heat Map", col = heat.colors(10))
In an EDA, data visualizations provide an opportunity to rapidly view the general data characteristics, compare variables, compare groups, and identify outliers. Visualizations can then identify areas which require further management, such as adjustment or inspection of outliers, to reduce the risk of bias in the data and increase generalizability of findings. Once understanding has been gained through data visualization, further investigation of the relationships between variables is required through statistical analysis. Measures of central tendency, variability, and shape of the distribution in variables provide essential information about the accuracy of the dataset in representing the population sampled.
Cai et al. (2024) completed a systematic review of literature pertaining to prediction models for heart disease using electronic medical records to assess replicability of published articles by using an independent validation score (IVS). The researchers used the prediction risk of bias assessment tool (PROBAST) to assess risk of bias across 79 studies using AI in prediction models for heart disease. The researchers found all models to have a risk of bias as determined by PROBAST. The bias risk related to four general trends, including: 1) data concerns, including inappropriate data sources, and a lack of global populations in datasets, 2) lack of reported detail regarding the predictors, 3) reduced clarity in the definition of the outcome (target variable), such as a different definition of “heart disease” across studies, and 4) inappropriate evaluation of model performance and overfitting (Cai et al., 2024).
With the risk of bias high in machine learning models for heart disease prediction, a thorough understanding of the data is essential. Inconsistencies between the target population and the sample characteristics, unequal group distribution, skewness, and kurtosis provide an understanding of the generalizability of the data to the target population. When small sample sizes are used, such as the dataset used for this EDA, results require replication, with repetition on larger or more representative datasets to confirm generalizability of results.
Despite the small number of observations in the dataset used for this EDA, some important findings can be reviewed from the statistical analysis of this dataset. Age, resting heart rate, and maximum heart rate achieved have the least skewness and the highest normality in the distribution of values, indicating these followed a normal distribution pattern and the variables are representative of the general population. Cholesterol had the highest variability, in both variance and standard deviation scores indicating a wide range of values across all observations. Age and maximum heart rate achieved appeared to have high variability in the lower values, indicating a high number of young adults in the population and a higher number of patients achieving lower maximum heart rates.
Lower ages and absence of heart disease correlated to higher maximum heart rate achieved as indicated by the negative correlation. The variables of cholesterol and the ST depression with exercise were positively correlated indicating higher cholesterol numbers may also have a relationship with a higher ST depression. Patients with exercise induced angina more frequently presented as asymptomatic (no chest pain) at baseline. Patients without exercise induced angina had a normal heart structure (no defect) more often than patients with exercise induced angina. Therefore, from the descriptive statistics, there appears to be a relationship between cholesterol, type of defect, and ST depression during exercise with the condition of heart disease. Additionally, the absence of exercise induced angina is more often associated with normal heart structure.
In conclusion, an explanation of descriptive statistics is essential for a thorough understanding of the normality and shape of variables in a dataset used for EDA. Measures of central tendency provide a single value to represent the variable, while measures of variability provide information regarding the comparability of the data to the target population. Measure of the shape of the data provide information regarding the risk of bias. Calculations regarding the correlations and statistical significance also provide valuable information regarding important variables to be chosen in the development of machine learning models. Taken together, these descriptive statistics provide opportunities to identify important features to be used in machine learning algorithms. Future work on this EDA will include division of the dataset to training and testing sets and feature engineering.
Alghamdi, N. S., Zakariah, M., Shankar, A., & Wattana Viriyasitavat. (2024). Heart disease prediction using autoencoder and DenseNet architecture. Egyptian Informatics Journal, 28(28), 100559–100559. https://doi.org/10.1016/j.eij.2024.100559
Cai, Y., Cai, Y.-Q., Tang, L.-Y., Wang, Y.-H., Gong, M., Jing, T.-C., Li, H.-J., Li-Ling, J., Hu, W., Yin, Z., Gong, D.-X., & Zhang, G.-W. (2024). Artificial intelligence in the risk prediction models of cardiovascular disease and development of an independent validation screening tool: a systematic review. BMC Medicine, 22(1). https://doi.org/10.1186/s12916-024-03273-7
Centers for Disease Control and Prevention. “About Heart Disease.” Heart Disease, CDC, 15 May 2024, www.cdc.gov/heart-disease/about/index.html.
Dwi Normawati, & Sri Winarti. (2018). Feature Selection with Combination Classifier use Rules-Based Data Mining for Diagnosis of Coronary Heart Disease. 12th International Conference on Telecommunication Systems, Services, and Applications (TSSA), Telecommunication Systems, Services, and Applications (TSSA), 2018 12th International Conference. https://doi.org/10.1109/tssa.2018.8708849
Gracious, S., & Shirley, C. P. (2024). ML based Age Related Heart Disease Prediction . Proceedings of 2024 International Conference on Cognitive Robotics and Intelligent Systems (ICC - ROBINS 2024), 312–316.
Goel, N., Yadav, N. P., Prakarti Prakarti, & Pandey, A. (2023). Comparative Analysis of Single Classifier Models against Aggregated Fusion Models for Heart Disease Prediction. 2023 International Conference on Disruptive Technologies (ICDT), 576–580. https://doi.org/10.1109/icdt57929.2023.10150611
“Heart Disease Cleveland.” Www.kaggle.com, www.kaggle.com/datasets/ritwikb3/heart-disease-cleveland.
Kandula, S. T. R. (2025). Comparison and Performance Assessment of Intelligent ML Models for Forecasting Cardiovascular Disease Risks in Healthcare. 2025 4th International Conference on Sensors and Related Networks (SENNET) - Special Focus on Digital Healthcare , 1–6. https://doi.org/10.1109/sennet64220.2025.11136005
Lubna Riyaz, Muheet Ahmed Butt, & Zaman, M. (2022). IMPROVING CORONARY HEART DISEASE PREDICTION BY OUTLIER ELIMINATION. Applied Computer Science, 18(1), 70–88. https://doi.org/10.35784/acs-2022-6
Shifa Sameen, Mahendra Gaikwad, & Sampada Wazalwar. (2025). Efficient Prediction of Cardiovascular Disease Using Machine Learning. 2025 12th International Conference on Emerging Trends in Engineering & Technology - Signal and Information Processing (ICETET - SIP) , 1–6. https://doi.org/10.1109/icetetsip64213.2025.11156734