NorthamJDDS8501-6

A Demonstration of Principles of Exploratory Data Analysis for Data Science Using the Cleveland Heart Dataset: Part 6

Heart disease continues to be the leading cause of death for individuals in the middle adult age range despite decades of research addressing treatment for this condition. The Center for Disease Control (CDC, 2024) describes heart disease as a term encompassing several conditions related to heart function, including coronary artery disease which affects blood flow to the heart and may increase the risk of a heart attack. According to the CDC, 22% of deaths in the United States are due to heart disease (CDC, 2024). Due to the high need for advanced healthcare practices to treat this condition, data science has surfaced as a option using machine learning models to aid understanding of the disease for advancements in treatment.

Source

Completion of an exploratory data analysis (EDA) includes several components to gain an understanding of the data used. This understanding is essential in identifying appropriate features for classification models to ensure optimal performance. Data visualization is required for understanding trends in the data, including the presence of outliers, unbalanced groups, or other aspects of the data that may require management prior to model development. The continued presence of heart disease as a major risk factor for many adults (CDC, 2024) has led to increased attention to data analysis in the healthcare sector related to heart disease. An increased understanding of the relevant data is essential to ensure medical practices continue to evolve to effectively treat or prevent heart disease. The purpose of this EDA is to investigate methods for data visualization using the Cleveland Heart dataset. Part 3 of the EDA provides an inspection of four of the features of the dataset (cholesterol, resting blood pressure, fasting blood sugar and exercise-induced angina). The understanding gained from the data visualization techniques used will guide feature selection for subsequent steps of the EDA process for classification of the presence or absence of heart disease.

The Amelia-II package was installed and required specification of the CRAN mirror.

install.packages("Amelia", repos = "https://cloud.r-project.org")

## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)

## package 'Amelia' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\benke\AppData\Local\Temp\RtmpodVpo4\downloaded_packages

install.packages("Rmisc", repos = "https://cloud.r-project.org")

## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)

## package 'Rmisc' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\benke\AppData\Local\Temp\RtmpodVpo4\downloaded_packages

All previous packages from Part 1 and Part 2 were included.

library(ggplot2)            # Load ggplot2 library
library(scales)             # Load scales library
library(moments)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)

## Warning: package 'readr' was built under R version 4.5.1

## 
## Attaching package: 'readr'

## The following object is masked from 'package:scales':
## 
##     col_factor

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.1

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(tidyr)

The Cleveland Heart dataset was uploaded for analysis.

getwd()

## [1] "C:/Users/benke/OneDrive/NU/DDS 8501"

setwd("C:/Users/benke/OneDrive/NU/DDS 8501")

myheartdata <- read_csv("heart_cleveland_upload.csv")

## Rows: 297 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Information regarding the variables was included in this EDA for reference and clarity.

The Characteristics of The Dataset

This dataset is comprised of the data from the heart_cleveland_upload dataset, which has 14 variables and 297 observations

The variables consist of 13 attributes and a target variable indicating the presence or absence of heart disease. A description of the variables are as follows:

age: patient’s age (Quantitative, Numeric, Continuous, Ratio)

sex: patient’s gender (Qualitative, Categorical, Nominal)

cp: chest pain: This variable includes 4 categories of chest pain (typical angina = 0, atypical angina = 1, non-anginal pain = 2, asympotamic = 3, Qualitative, Categorical, Nominal)

trestbps: patient’s blood pressure at rest (mm/HG, Quantitative, Numeric, Continuous, Ratio)

chol: serum cholesterol (mg/dl, Quantitative, Numeric, Continuous, Ratio)

fbs: fasting blood sugar > 120 mg/dl (Qualitative, Categorical, Nominal)

restecg: electrocardiogram results at rest categorized in 3 values (Normal = 0, ST-T wave abnormality(T wave inversions and/or ST elevation or depression of > 0.05 mV = 1, probable/definite left ventricular hypertrophy = 2, Qualitative, Categorical, Nominal)

thalach: patient’s maximum heart rate (Quantitative, Numeric, Continuous, Interval)

exang: presence/absence of exercise induced angina (Qualitative, Categorical, Nominal)

oldpeak: exercise induced ST-depression compared to rest state (Quantitative, Numeric, Continuous, Interval)

slope: shape of slope of ST segment during peak exercise (Qualitative, Categorical (up, flat, or down, Nominal)

ca: patient’s number of major blood vessles (Qualitative, Categorical, Nominal)

thal: patient’s thalassemia indicating type of defect (Qualitative, Categorical, Normal = 1, Fixed defect = 2, Reversible defect = 3, Nominal)

condition: target: presence or absence of heart disease (Binary, Numerical, Discrete)

Source

Preprocessing

Preprocessing code was included in this notebook for reference. In Part 1 of this EDA, categorical variables with numeric values were reclassified to factors.

factor_columns <- c("sex", "cp", "fbs", "restecg", 
                    "exang","slope", "ca", "thal")
myheartdata[factor_columns] <- lapply(myheartdata[factor_columns], function(col) as.factor(as.character(col)))

Another important preprocessing step is identification of missing values and the percent of “NA” values in this dataset. This dataset does not include “NA” values which require preprocessing.

cols_with_nas <- sum(colSums(is.na(myheartdata)) > 0)
Percent_col_NA <- percent(cols_with_nas / length(myheartdata))
cols_with_nas

## [1] 0

Percent_col_NA

## [1] "0%"

The summary obtained from Part 1 was maintained. This includes counts for categorical data. Only numeric variables have mean and median scores.

summary(myheartdata)

##       age        sex     cp         trestbps          chol       fbs    
##  Min.   :29.00   0: 96   0: 23   Min.   : 94.0   Min.   :126.0   0:254  
##  1st Qu.:48.00   1:201   1: 49   1st Qu.:120.0   1st Qu.:211.0   1: 43  
##  Median :56.00           2: 83   Median :130.0   Median :243.0          
##  Mean   :54.54           3:142   Mean   :131.7   Mean   :247.4          
##  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:276.0          
##  Max.   :77.00                   Max.   :200.0   Max.   :564.0          
##  restecg    thalach      exang      oldpeak      slope   ca      thal   
##  0:147   Min.   : 71.0   0:200   Min.   :0.000   0:139   0:174   0:164  
##  1:  4   1st Qu.:133.0   1: 97   1st Qu.:0.000   1:137   1: 65   1: 18  
##  2:146   Median :153.0           Median :0.800   2: 21   2: 38   2:115  
##          Mean   :149.6           Mean   :1.056           3: 20          
##          3rd Qu.:166.0           3rd Qu.:1.600                          
##          Max.   :202.0           Max.   :6.200                          
##    condition     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4613  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Additionally, use of the str() function reveals the categories have been appropriately reclassified as factors.

str(myheartdata)

## spc_tbl_ [297 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age      : num [1:297] 69 69 66 65 64 64 63 61 60 59 ...
##  $ sex      : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 2 2 1 2 ...
##  $ cp       : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ trestbps : num [1:297] 160 140 150 138 110 170 145 134 150 178 ...
##  $ chol     : num [1:297] 234 239 226 282 211 227 233 234 240 270 ...
##  $ fbs      : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 2 1 1 1 ...
##  $ restecg  : Factor w/ 3 levels "0","1","2": 3 1 1 3 3 3 3 1 1 3 ...
##  $ thalach  : num [1:297] 131 151 114 174 144 155 150 145 171 145 ...
##  $ exang    : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ oldpeak  : num [1:297] 0.1 1.8 2.6 1.4 1.8 0.6 2.3 2.6 0.9 4.2 ...
##  $ slope    : Factor w/ 3 levels "0","1","2": 2 1 3 2 2 2 3 2 1 3 ...
##  $ ca       : Factor w/ 4 levels "0","1","2","3": 2 3 1 2 1 1 1 3 1 1 ...
##  $ thal     : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 3 2 1 1 3 ...
##  $ condition: num [1:297] 0 0 0 1 0 0 0 1 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   sex = col_double(),
##   ..   cp = col_double(),
##   ..   trestbps = col_double(),
##   ..   chol = col_double(),
##   ..   fbs = col_double(),
##   ..   restecg = col_double(),
##   ..   thalach = col_double(),
##   ..   exang = col_double(),
##   ..   oldpeak = col_double(),
##   ..   slope = col_double(),
##   ..   ca = col_double(),
##   ..   thal = col_double(),
##   ..   condition = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Amelia-II package was installed with the CRAN mirror specified and loaded.

Missingness Analysis

Missingness was initially investigated in this EDA using the missingness map method in the Amelia II package in R to identify any missing values. Inspection of the missingness map reveals no missing values for this dataset.

In Part 2 of this EDA, missingness of the dataset was evaluated using the missingness map method from Ameila II.

myheartdata.missmap <- missmap(myheartdata)

## Warning: Unknown or uninitialised column: `arguments`.
## Unknown or uninitialised column: `arguments`.

## Warning: Unknown or uninitialised column: `imputations`.

myheartdata.missmap

## NULL

Additionally, columns were inspected using the colSums(is.na()) function in R. As consistent with the previous analyses, the colSums(is.na()) function returned 100% of 0 counts indicating the absence of missing values in this dataset.

colSums(is.na(myheartdata))

##       age       sex        cp  trestbps      chol       fbs   restecg   thalach 
##         0         0         0         0         0         0         0         0 
##     exang   oldpeak     slope        ca      thal condition 
##         0         0         0         0         0         0

Univariate Data Visualization

Data Visualization methods from Part 3 were maintained.

The target feature of this project is the presence or absence of heart disease. Fourteen features, including the target, 5 numeric and 8 categorical variables are available for analysis. The initial inspection of the data requires a comparison of the target classification groups to ensure balance between the groups.

ggplot(myheartdata, aes(x=condition, fill=condition)) +
  geom_bar(aes(fill=condition), position = "dodge", fill="turquoise", bin=1) +
  labs(x="Group Comparison of Presence or Absence of Heart Disease", y="Count")

## Warning in geom_bar(aes(fill = condition), position = "dodge", fill =
## "turquoise", : Ignoring unknown parameters: `bin`

Rmisc and QQplotr were installed for creation of the 4-plot data visualization method.

library(Rmisc)

## Warning: package 'Rmisc' was built under R version 4.5.1

## Loading required package: lattice

## Loading required package: plyr

## Warning: package 'plyr' was built under R version 4.5.1

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

install.packages("qqplotr", repos = "https://cloud.r-project.org")

## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)

## package 'qqplotr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\benke\AppData\Local\Temp\RtmpodVpo4\downloaded_packages

library(qqplotr)

## Warning: package 'qqplotr' was built under R version 4.5.1

## 
## Attaching package: 'qqplotr'

## The following objects are masked from 'package:ggplot2':
## 
##     stat_qq_line, StatQqLine

The determination of features for comparison is another important part of the EDA. Normawati and Winarti (2018) compared feature selection using the Cleveland Heart dataset. The researchers compared feature selection methods based on motivated feature selection (medical expert-based selection) (MTF), Variable Precision Rough Set (VPRS) method for feature selection and a combination of both methods. Results indicated model accuracy improved with use of VPRS and the merger of MTF with VPRS in feature selection. The authors comment use of MTF alone as feature selection may result in unanticipated bias therefore use of a computer-based feature selection method is likely to improve model performance. The features selected for this EDA used medical expert-based selection to identify features historically associated with heart disease based on risk factors reported by the Center for Disease Control (CDC, 2024). The CDC (2024) reports high blood pressure and cholesterol as key risk factors in the presence of heart disease.
Feature selection methods in future analysis may reduce risk of bias and will be considered in future analysis methods.

For this EDA, a 4-plot was developed using the cholesterol and resting blood pressure variables, as these were reported by the CDC as high-risk indicators for heart disease (CDC, 2024). Figure 2 provides data visualization of a run sequence, a lag plot, a histogram of cholesterol, and a normal probability plot. The run and lag plots show the value of resting blood pressure versus the index of cholesterol, grouped by condition (presence or absence of heart disease). Inspection of the run plot and lag plot show notable trends. The plots show a higher concentration of instances of the presence of heart disease group with values above 150 for cholesterol, as seen in both the run and lag plots. Additionally, the lag plot for cholesterol shows two values with high blood pressure above 180, at approximately 190 and 200. Both of these values have the condition of heart disease based on grouping. Because these values were considered to demonstrate expected variation in electronic health records rather than error, they were included in the analysis although likely to be outliers. These plots indicate that cholesterol above 150 might have an association with heart disease. Additionally, a resting blood pressure above 180 may have an association with heart disease, although the two values appear to be outliers.

This histogram shows the feature of cholesterol has a uniform distribution indicating minimal variability across the values in the data making any interpretation of trends difficult. An expected Gaussian distribution would increase the reliability of the data; however, this dataset includes uniform values for cholesterol indicating a likely bias in the dataset. Finally, the probability plot shows a similar shape for both target groups of with and without heart disease based on the cholesterol value, with deviations from normality at the left and right ends of the plot. This result is expected when considering the uniform distribution of values in cholesterol as observed in the histogram.

myheartdata$chol=seq(1:nrow(myheartdata))
p1=ggplot(myheartdata, aes(x=chol,y=trestbps, group=condition))+geom_line(aes(col=as.factor(condition)))+ggtitle('Run: Cholesterol')
p3=ggplot(myheartdata, aes(x=lag(chol), y=trestbps, group=condition))+geom_point(aes(col=as.factor(condition)))+ggtitle('Lag:Cholesterol')
p2=ggplot(myheartdata, aes(x=chol))+geom_histogram(fill='turquoise',col='black', bin=20)+ggtitle('Histogram:Cholesterol')

## Warning in geom_histogram(fill = "turquoise", col = "black", bin = 20):
## Ignoring unknown parameters: `bin`

p4=ggplot(myheartdata, aes(sample=scale(chol),group=condition))+stat_qq_point(aes(col=as.factor(condition)))+
        ggtitle('Normal Probability Plot: Cholesterol')
multiplot(p1, p2, p3, p4, cols=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Outlier management is another important aspect of date preprocessing. Riyaz et al. (2022) addressed outlier management using the Framingham Heart Study dataset when comparing classification model performance based on outlier detection. Riyaz et al. (2022) found five variables with outliers (total cholesterol, systolic blood pressure, diastolic blood pressure, body mass index, heart rate, and glucose level) using visualization methods including but not limited to box plots for data visualization. The authors found that removal of outliers improved the performance of all classification models used except multinominal naive bayes (Riyaz et al, 2022).

In this EDA, box plots for the variables of interest (cholesterol and resting blood pressure) were used to further identify distribution characteristics. The box plot for cholesterol revealed a median of 150, 1st quartile of approximately 75 and a 3rd quartile of approximately 225. Whiskers extend as low as 0 and as high as 300. Although no outliers are evident, a cholesterol of 0 is indicative of error as it is unlikely a cholesterol result in an electronic health record would be 0. Further inspection to determine if hidden missing values are present with values of “0” is warranted in future analyses. The box plot for resting blood pressure indicates a 1st quartile of 120, a median of approximately 135 and the 3rd quartile of approximately 140. Whisker lines indicate a lowest value slightly above 90 with no outliers and a highest value of 175 with 6 outliers above 175. The outliers were considered representative of expected variation and included in the analysis.

ggplot(myheartdata, aes(x = condition, y = chol)) +
  geom_boxplot(fill = "salmon", color = "darkblue", width = .25, outlier.color = "turquoise", outlier.size = 2) +
  labs(
    title = "Box Plot for Cholesterol",
    x = "Condition",
    y = "Cholesterol"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

ggplot(myheartdata, aes(x = condition, y = trestbps)) +
  geom_boxplot(fill = "salmon", color = "darkblue", width = .25, outlier.color = "turquoise", outlier.size = 2) +
  labs(
    title = "Box Plot for Resting Blood Pressure",
    x = "Condition",
    y = "Resting Blood Pressure"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

head(myheartdata)

## # A tibble: 6 × 14
##     age sex   cp    trestbps  chol fbs   restecg thalach exang oldpeak slope
##   <dbl> <fct> <fct>    <dbl> <int> <fct> <fct>     <dbl> <fct>   <dbl> <fct>
## 1    69 1     0          160     1 1     2           131 0         0.1 1    
## 2    69 0     0          140     2 0     0           151 0         1.8 0    
## 3    66 0     0          150     3 0     0           114 0         2.6 2    
## 4    65 1     0          138     4 1     2           174 0         1.4 1    
## 5    64 1     0          110     5 0     2           144 1         1.8 1    
## 6    64 1     0          170     6 0     2           155 0         0.6 1    
## # ℹ 3 more variables: ca <fct>, thal <fct>, condition <dbl>

View(myheartdata["trestbps"])

Univariate data visualization is also an important aspect of completion of the EDA to review the distribution of counts in features of interest. Inspection of group characteristics supports identification of reduced balance that could impact the performance of classification models. Kandula (2025) used a stacked bar chart to show asymptomatic chest pain with the highest count of observations with the condition of heart disease compared to typical angina, atypical angina, and non-anginal pain groups. Kandula (2025) points out that the condition of heart disease presents without symptomatic pain. In this EDA, group counts were visualized in a bar chart form for the categorical features of fasting blood sugar and exercise induced angina. Exercise-induced angina was chosen due to the inclusion and discussion of this feature in the literature (Kandula 2025). Fasting blood sugar was chosen because this feature was included in a list of important features by motivated feature selection (Normawati and Winarti, 2018). By visualization of the groups, reduced balance in both features is identified. Less than 25% of the dataset includes a value in the category of fasting blood sugar above 120 mg/dl, showing unequal distribution and a high likelihood of bias by using this variable in future analysis (Figure 5). To a lesser extent, reduced balance is also identified in the counts of exercise-induced angina (Figure 6), with the presence of exercise-induced angina containing approximately half of the number of values as compared to the group without exercise-induced angina. Use of these features could introduce bias in future steps of the EDA due to the unbalanced group counts.

library(dplyr)
ggplot(myheartdata, aes(x=fbs, fill=fbs)) +
  geom_bar(aes(fill=fbs), position = "dodge") +
  labs(x="Fasting Blood Sugar >120 mg/dl", y="Count")

ggplot(myheartdata, aes(x=exang, fill=exang)) +
  geom_bar(aes(fill=exang), position = "dodge") +
  labs(x="Exercise Induced Angina", y="Count")

Data visualization techniques for univariate analysis were used to investigate the characteristics of cholesterol, resting blood pressure, fasting blood sugar, and exercise-induced angina in the Cleveland Heart dataset. Variables were chosen for investigation in this report due to the attention these variables received in previous literature using open-source heart disease datasets in the development of heart disease classification models. Both categorical variables were found to have unbalanced groups which could affect the accuracy in model development due to the lack of sufficient data values for the smaller groups. Numerical variables were chosen based on the CDC lists of important risk factors for the development of heart disease (CDC, 2024). The 4-plot data visualization technique demonstrates reduced normality in distribution for both of these variables. The data visualization also shows a possible association between high cholesterol and the presence of heart disease, which requires further investigation.

Data visualization provides insights into the distribution of the values of investigated features prior to model development. Data management requirements, such as outlier management, are identified with univariate data visualization to ensure features selected for future analysis do not introduce bias and negatively affect future model performance.

Multivariate Data Visualization

Multivariate data visualization is another essential component of the exploratory data analysis (EDA). While univariate data visualization provides information regarding the characteristics of a feature, multivariate data visualizations provide meaningful information regarding relationships and associations between features (Alghamdi et al., 2024). Combinations of variables in data visualization are required to identify which variable combinations have correlations that either require further investigation or provide essential information for future model training.

Alghamdi et al. (2024) used a variety of multivariate data visualization methods to gain perspective and understanding of the Cleveland heart disease dataset. A K-means clustering plot was compared with an agglomerative clustering plot for the variables of age and cholesterol with presence or absence of heart disease indicated by color. By comparing two types of clustering plots, trends in the clustering of data points for age and cholesterol are identified for future analysis. Additionally, researchers used a stacked bar chart comparing type of chest pain across target groups of presence or absence of heart disease. Alghamdi et al. (2024) found patients with heart disease having an asymptomatic presentation compared to typical, atypical, and non-anginal pain. Finally, Alghamdi et al. (2024) used pair plots to compare age, cholesterol, and the patient’s maximum heart rate (thalach) with color indicating presence or absence of the target heart disease. Results indicated clusters of data points for each variable. Inspection of color differences within the plot clusters supports identification of valuable features with trends and relationships to be included in prediction models (Alghamdi et al., 2024).

The GGally library was installed to support multivariate data visualization techniques.

install.packages("GGally", repos = "https://cloud.r-project.org")

## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)

## package 'GGally' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\benke\AppData\Local\Temp\RtmpodVpo4\downloaded_packages

library(GGally)

## Warning: package 'GGally' was built under R version 4.5.1

A pairs plot visualization developed with base R was used to identify relationships between numerical variables in the Cleveland Heart dataset. The pairs plot provides an opportunity to inspect associations between numerical variable to identify associations which may indicate features important for classification during model development. Positive, neutral, and negative associations are visible by inspecting the shape of data points. Additionally, clustered data points provide information regarding the distribution of the data for each paring of numerical features.

nummyheartdata <- myheartdata[c(1,4:5,8,10,14)]
head(nummyheartdata)

## # A tibble: 6 × 6
##     age trestbps  chol thalach oldpeak condition
##   <dbl>    <dbl> <int>   <dbl>   <dbl>     <dbl>
## 1    69      160     1     131     0.1         0
## 2    69      140     2     151     1.8         0
## 3    66      150     3     114     2.6         0
## 4    65      138     4     174     1.4         1
## 5    64      110     5     144     1.8         0
## 6    64      170     6     155     0.6         0

Several insights are apparent in the pairs plot of numerical features. Several possible associations are apparent when paired with age. The age feature when paired with cholesterol (chol) appeared to have distinct groupings of strong negative associations, which indicate many patients in the dataset may have identical values. Stacked values for a cholesterol value of 0 across the age ranges of 30-60 may be further indication of hidden missing values listed in the patient chart as 0. Additionally, there are stacked values across all age ranges for cholesterol levels of 50-100, 100-150 and 200-300. This may indicate cholesterol values are reported in the medical chart in discreet categories rather than continuous leading to the overlapping values.

The age and resting blood pressure (tresbps) variables appear to have a weak, positive correlation, showing all data points with a resting blood pressure above 175 are for patients over 50 years of age. The age values paired with the patient’s maximum heart rate (thalach) have a weak negative correlation, with patients from 30-40 years of age able to achieve a maximum heart rate of 140 or higher while patients over 60 years of age have a maximum heart rate of 140 or lower. The scatter plot for this pair also shows an outlier of a 30 year old patient able to achieve a maximum heart rate of 200 and an outlier of a 70 year old patient unable to achieve a heart rate higher than approximately 75 beats per minute.

The paring of resting blood pressure (trestbps) and the patient’s maximum heart rate (thalach) appears to have an area of higher plot density without a correlation with values of a maximum heart rate between 140 and 180 and a resting blood pressure of 120 - 160. A weak negative correlation also is visible in the paring of cholesterol (chol) with the patient’s maximum heart rate (thalach) with a maximum heart rate of 120-160 with cholesterol values of 0-100 and a maximum heart rate of 90-150 with cholesterol values of 150-250. A shift to a positive correlation is seen with cholesterol levels above 250 with a maximum heart rate (thalach) of 100-200. Across all of the variables except age, the cholesterol variable (chol) has an uniform distribution across the x-axis, indicating an equivalent number of data points across the possible values. The oldpeak (exercise induced ST depression) presented with a high density of data points in the 0-2 range, with scattered values of a lower density above an oldpeak of 2 for all pair plots.

pairs(nummyheartdata)

Although the correlations identified in the scatterplot matrix were weak, the age variable appeared to have identifiable weak correlation trends. To further investigate the relationship between the age variable and categorical variables, box plots were developed to compare age values across two categorical variables.

A box plot comparing types of chest pain with age values provides insight regarding the variability of type of chest pain for patients. The first box plot shows patients with typical anginal chest pain with a 1st quartile of 51 years of age, median of 59 years of age, and a 3rd quartile of 64 years of age. Whiskers extend to a maximum value of 69 without outliers for the upper values and extend below to approximately 34 years of age. A longer whisker is apparent for the lowest age value to the 1st quartile, indicating higher variability for ages 34-51 than for ages 51-69. Inspection of the second box plot indicates the highest variability across age groups for atypical anginal chest pain, with whiskers from 29 to 44 years of age for the lower values and from 56-74 years of age for the higher values. The interquartile range is from 44 to 56 with a median of approximately 52. Inspection of the box plot for non-anginal chest pain, on the other hand, has a shorter whisker for the lower age values of 35 to 46 and a longer whisker for values of approximately 60 year of age to 76, indicating higher variability in the older age ranges. The median is approximately 54, with the first quartile at 46 year of age and the 3rd quartile at 60. Finally, the last box plot of asymptomatic chest pain includes similar whisker ranges for lower and higher age ranges, the first quartile at approximately 51 years of age, the median at 56 years of age and the 3rd quartile at approximately 62 years of age. The variability appears slightly smaller than the previous box plots as well.

ggplot(myheartdata, aes(x = cp, y = age )) +
  geom_boxplot(fill = "salmon", color = "darkblue", width = .25, outlier.color = "turquoise", outlier.size = 2) +
  labs(
    title = "Box Plot for Chest Pain by Age",
    x = "Chest Pain: Typical, Atypical,Non-anginal, Asympotamic",
    y = "Age"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

A box plot was also used to compare the absence or presence of exercise induced angina with age values. Absence of exercise induced angina shows higher variability when compared to the presence of exercise induced angina, longer whisker lines for both younger and older ages. Absence of exercise induced angina shows the first quartile at approximately 46 years of age with whisker lines extending to 29 years of age. The median for absence of exercise induced angina is 54 years of age. The 3rd quartile is at approximately 61 years of age with whiskers extending to 76 years of age. For patients with exercise induced angina, the youngest patient is approximately 38 with an outlier at 35 years of age. The 1st quartile 51 years of age with whisker lines extended to 38 years of age for the younger ages. The median is 56 years of age and the 3rd quartile is 61 years of age. The whisker lines extend to 74 with one outlier at 76 years of age. The variability is higher for ages younger than the median as compared to ages older than the median age of 56.

ggplot(myheartdata, aes(x = exang, y = age)) +
  geom_boxplot(fill = "salmon", color = "darkblue", width = .25, outlier.color = "turquoise", outlier.size = 2) +
  labs(
    title = "Absence or Presence of Exercise Induced Angina by Age",
    x = "Absence & Presence of Exercise Induced Angina",
    y = "Age"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

cp_exang_table <- table(myheartdata$cp, myheartdata$exang)
cp_exang_table_2 <- addmargins(cp_exang_table)
cp_exang_table

##    
##      0  1
##   0 19  4
##   1 45  4
##   2 72 11
##   3 64 78

cp_exang_table_2

##      
##         0   1 Sum
##   0    19   4  23
##   1    45   4  49
##   2    72  11  83
##   3    64  78 142
##   Sum 200  97 297

Data visualization also included comparisons of categorical variables to determine relationships. The KableExtra package was used to create a classic table of the counts of types of chest pain and exercise induced angina. From the inspection, asymptomatic patient presentation had the highest counts of the total group of patients. Asymptomatic patient presentation had a clearly higher count of 78 patients for the exercise induced angina group compared to the typical, atypical, and non-anginal chest pain presentations in patients.

library(kableExtra)

cp_exang_table_2 %>%
  kbl(caption = "Table 1. Chest Pain Type and Exercise Induced Angina") %>%
  kable_classic()

Table 1. Chest Pain Type and Exercise Induced Angina
	0	1	Sum
0	19	4	23
1	45	4	49
2	72	11	83
3	64	78	142
Sum	200	97	297

A bar graph was used to further demonstrate the relationships between type of chest pain and the presence or absence of exercise induced chest pain. From the bar graph, the increased number of patients with exercise induced angina, but with asymptomatic presentation without exercise is apparent. A distribution slightly skewed to the left is seen with the group of patients without exercise induced angina.

color <- c("goldenrod", "turquoise", "darkblue", "salmon")
names <- c("Typical", "Atypical", "Non-anginal", "Asymptomatic")
barplot(cp_exang_table, col=color, beside = TRUE, main = "Absence/Presence of Exercise Induced Chest Pain & Type of Chest Pain")
legend("topleft", names, fill=color)

Additionally, a table was created to compare categorical groups of exercise induced chest pain and type of defect. From the table, a higher count is apparent for patients without a defect and without exercise induced angina compared to patients without a defect but with exercise induced angina. Groups of patients with either a fixed or reversible defect appear fairly equally distributed across presence or absence of exercise induced angina.

thal_exang_group<- table(myheartdata$thal, myheartdata$exang)
thal_exang_group_2 <- addmargins(thal_exang_group)
thal_exang_group

##    
##       0   1
##   0 133  31
##   1  10   8
##   2  57  58

thal_exang_group_2

##      
##         0   1 Sum
##   0   133  31 164
##   1    10   8  18
##   2    57  58 115
##   Sum 200  97 297

thal_exang_group_2 %>%
  kbl(caption = "Table 2. Exercise Induced Chest Pain and Type of Heart Defect") %>%
  kable_classic()

Table 2. Exercise Induced Chest Pain and Type of Heart Defect
	0	1	Sum
0	133	31	164
1	10	8	18
2	57	58	115
Sum	200	97	297

color <- c("turquoise", "darkblue", "salmon")
names <- c("Normal", "Fixed Defect", "Reversable Defect")
barplot(thal_exang_group, col=color, beside = TRUE, main = "Presence or Absence of Exercise Induced Chest Pain & Type of Defect")
legend("topright", names, fill=color)

View(myheartdata)

Discussion

Multivariate data visualization is a key component of the exploratory data analysis (EDA) used to identify important trends, relationships, and correlations between variables. Alghamdi et al. (2024) used a variety of data visualization techniques with the Cleveland Heart dataset, including but not limited to agglomerative cluster plots to compare groups of numerical variables, bar plots to compare categorical variables, and pair plots with hue for identification of key features.

Pair plots were used in this EDA to identify relationships and correlations between numerical variables. A weak negative correlation was found when comparing cholesterol with maximum heart rate achieved. This indicates in patients with higher cholesterol levels, a lower maximum heart rate was achieved. This association implies patients with a higher cholesterol level show exercise excursion to higher heart rate levels less often, as would be expected. Additionally, the shift in increased maximum heart rate for patients with very high cholesterol (over 250) indicates for these patients, exercise leads to rapid increases in heart rate and this may be an indicator for the presence of heart disease requiring further investigation.

In this EDA, the age variable most consistently demonstrated correlations with other numeric variables. Age and resting blood pressure presented with a weak positive correlation, indicating resting blood pressure shows a weak trend of increasing as age increases. Age also presented with a weak, negative correlation with maximum heart rate achieved, showing younger patients were more often achieving a maximum heart rate over 140 while older patients more often had a heart rate of 140 or lower. This also indicates younger patients may have higher exercise excursion.

The relationship between age and heart disease continues to receive attention due to the association of increased risk of heart disease with aging (CDC, 2024). Gracious and Shirley (2024) propose use of the ensemble learning models Random Forest and XGBoost in a web interface to increase access to heart disease prediction for the general population. Gracious and Shirley (2024) specifically highlight age as an important factor in heart disease prediction models, and the value of machine learning models in leveraging age-related physiological changes with traditional clinical data. Because of the attention to age in heart disease classification model literature, age was closely inspected in this EDA.

Age and cholesterol demonstrated stacked values of four distinct negative correlations. Inspection of the pair plots for age and cholesterol identify problems in the data that may require further pre-processing. The lines of stacked values indicate many identical cholesterol values and may indicate discreet groupings of cholesterol levels rather than continuous data as previously thought. Additionally, a line of stacked values of a cholesterol level of 0 across all ages may provide more evidence that hidden missing values are present in the cholesterol values requiring imputation. The trends identified with the pairs plot in age and cholesterol may also be related to the small number of observations for the Cleveland Heart dataset, as has been discussed in the literature.

Sameen et al. (2025) address the data set size by using a merged dataset from Switzerland, Hungarian, VA, Long Beach, Statlog, and Cleveland. By using the merged dataset, a reduced risk of biased data was achieved by using a robust, diverse dataset. Additionally, Least Absolute Shrinkage and Selection Operator (LASSO) and the Relief method were compared for feature selection in addition to data visualization techniques for increased performance in feature selection. Results indicated the addition of the Relief method to data visualization increased model accuracy when used with the Random Forest Bagging Method for classification of heart disease.

Additionally, presence or absence of exercise induced angina was compared with type of chest pain and type of defect. For patients with exercise induced angina, a large number of patients had an asymptomatic presentation at baseline indicating no perceivable symptoms at rest. Patients without exercise induced angina had a higher number of patients in the normal category with no defect compared to patients with exercise induced angina.

In conclusion, the multivariate data visualization in this EDA highlights the relationship between variables to identify important features for future analysis. In additional to cholesterol, important variables include age and variables related to exercise (maximum heart rate achieved and exercise induced angina). Age, cholesterol and maximum heart rate achieved will be considered in future analysis due to the weak correlations identified. Age also impacted variability in values across types of chest pain and in exercise induced angina. Finally, type of chest pain and type of defect varied in the presence or absence of exercise induced angina requiring further analysis. Overall, the inclusion of multivariate data visualization was essential in identifying important variables for further analysis.

Descriptive Statistics

Descriptive statistics are an essential component of exploratory data analysis in identifying information and understanding the characteristics of the data. In this report, descriptive statistics will be reviewed for a deeper understanding of the data in the Cleveland Heart dataset. The remaining sections of this report will provide a review of descriptive statistics for both numerical and categorical features and a discussion of the insights gained.

summary(nummyheartdata)

##       age           trestbps          chol        thalach         oldpeak     
##  Min.   :29.00   Min.   : 94.0   Min.   :  1   Min.   : 71.0   Min.   :0.000  
##  1st Qu.:48.00   1st Qu.:120.0   1st Qu.: 75   1st Qu.:133.0   1st Qu.:0.000  
##  Median :56.00   Median :130.0   Median :149   Median :153.0   Median :0.800  
##  Mean   :54.54   Mean   :131.7   Mean   :149   Mean   :149.6   Mean   :1.056  
##  3rd Qu.:61.00   3rd Qu.:140.0   3rd Qu.:223   3rd Qu.:166.0   3rd Qu.:1.600  
##  Max.   :77.00   Max.   :200.0   Max.   :297   Max.   :202.0   Max.   :6.200  
##    condition     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4613  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

IQR(nummyheartdata$age)

## [1] 13

IQR(nummyheartdata$trestbps)

## [1] 20

IQR(nummyheartdata$chol)

## [1] 148

IQR(nummyheartdata$thalach)

## [1] 33

IQR(nummyheartdata$oldpeak)

## [1] 1.6

IQR(nummyheartdata$condition)

## [1] 1

var(nummyheartdata, na.rm=FALSE)

##                   age    trestbps       chol     thalach    oldpeak  condition
## age         81.897716   46.693682 -150.00338  -81.917201  2.0802553  1.0261284
## trestbps    46.693682  315.517290 -123.36486  -20.011694  3.9613363  1.3614069
## chol      -150.003378 -123.364865 7375.50000 -497.918919 17.0317568 16.2263514
## thalach    -81.917201  -20.011694 -497.91892  526.315270 -9.3003003 -4.8550937
## oldpeak      2.080255    3.961336   17.03176   -9.300300  1.3598423  0.2469219
## condition    1.026128    1.361407   16.22635   -4.855094  0.2469219  0.2493402

var(nummyheartdata$age)

## [1] 81.89772

sd(nummyheartdata$age)

## [1] 9.049736

sd(nummyheartdata$trestbps)

## [1] 17.76281

sd(nummyheartdata$chol)

## [1] 85.88073

sd(nummyheartdata$thalach)

## [1] 22.94156

sd(nummyheartdata$oldpeak)

## [1] 1.166123

sd(nummyheartdata$condition)

## [1] 0.4993398

library(moments)

skewness(nummyheartdata)

##        age   trestbps       chol    thalach    oldpeak  condition 
## -0.2186630  0.6965290  0.0000000 -0.5338265  1.2408238  0.1553487

kurtosis(nummyheartdata)

##       age  trestbps      chol   thalach   oldpeak condition 
##  2.466854  3.781192  1.799973  2.928885  4.465520  1.024133

compute_stats <- function(column, name) {
  if (is.numeric(column) || is.integer(column)) {
    data.frame(
      Variable = name,
      Mean = round(mean(column, na.rm = TRUE), 2),
      Median = round(median(column, na.rm = TRUE), 2),
      St.Deviation = round(sd(column, na.rm = TRUE), 2),
      Variance = round(var(column, na.rm = TRUE), 2),
      Range = round(diff(range(column, na.rm = TRUE)), 2),
      IQR = round(IQR(column, na.rm = TRUE), 2),
      Skewness = round(skewness(column, na.rm = TRUE), 2),
      Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
      stringsAsFactors = FALSE
    )
  } else {
    NULL
  }
}

descriptive_stats <- do.call(
  rbind,
  lapply(names(nummyheartdata), function(col) compute_stats(nummyheartdata[[col]], col))
)

descriptive_stats

##    Variable   Mean Median St.Deviation Variance Range   IQR Skewness Kurtosis
## 1       age  54.54   56.0         9.05    81.90  48.0  13.0    -0.22     2.47
## 2  trestbps 131.69  130.0        17.76   315.52 106.0  20.0     0.70     3.78
## 3      chol 149.00  149.0        85.88  7375.50 296.0 148.0     0.00     1.80
## 4   thalach 149.60  153.0        22.94   526.32 131.0  33.0    -0.53     2.93
## 5   oldpeak   1.06    0.8         1.17     1.36   6.2   1.6     1.24     4.47
## 6 condition   0.46    0.0         0.50     0.25   1.0   1.0     0.16     1.02

descriptive_stats %>%
  kbl(caption = " Table 1. Descriptive Statistics") %>%
  kable_classic()

Table 1. Descriptive Statistics
Variable	Mean	Median	St.Deviation	Variance	Range	IQR	Skewness	Kurtosis
age	54.54	56.0	9.05	81.90	48.0	13.0	-0.22	2.47
trestbps	131.69	130.0	17.76	315.52	106.0	20.0	0.70	3.78
chol	149.00	149.0	85.88	7375.50	296.0	148.0	0.00	1.80
thalach	149.60	153.0	22.94	526.32	131.0	33.0	-0.53	2.93
oldpeak	1.06	0.8	1.17	1.36	6.2	1.6	1.24	4.47
condition	0.46	0.0	0.50	0.25	1.0	1.0	0.16	1.02

Descriptive Statistics of Numerical Variables

In this EDA, descriptive statistics were derived for numeric variables (Table 1). Symmetrical variables included resting blood pressure (trestbps) and maximum heart rate (thalach). The shape of resting blood pressure (trestbps) data the indicate the presence of high variability, with an IQR of 20 but a range of 106. Skewness is slightly positive at .70 with kurtosis at 3.78. Maximum heart rate achieved (thalach) data also indicate variability, with negative skewness of –0.53, which is also reflected in the lower mean as compared to the median, and a kurtosis of 2.93. This also indicates high variance within the symmetry. The cholesterol variable has the highest variability, with a standard deviation of 85.88, which is over 50% of the mean and median values.

Variance was also calculated for numeric variables, with the highest variance for cholesterol (chol), high variance for resting blood pressure (trestbps) and maximum heart rate achieved (thalach), moderate variance for age, and the lowest variance for exercise induced ST depression compared to rest state (oldpeak).

Across all the numeric variable groups for this dataset, the small number of observations is apparent in the asymmetrical distribution and high variability. Except for the target variable (presence or absence of heart disease), the mean and median values are similar across the variables. Age and maximum heart rate achieved (thalach) both have negative skewness indicating values lower than the first quartile affecting the shape of the data. The ST depression with exercise compared to ST depression at rest, a specific measurement of the electrocardiogram (ECG) has a positive skewness as well as high kurtosis, indicating values trend high without a high number of values in the middle range. Condition is misleading, as condition would be more accurately described as a categorical variable of presence or absence of heart disease, and was maintained as a numeric value only for an inspection of the mean to ensure equal counts across both categories.

catmyheartdata <- myheartdata[c(2,3,6:7,9,11:13)]
head(catmyheartdata)

## # A tibble: 6 × 8
##   sex   cp    fbs   restecg exang slope ca    thal 
##   <fct> <fct> <fct> <fct>   <fct> <fct> <fct> <fct>
## 1 1     0     1     2       0     1     1     0    
## 2 0     0     0     0       0     0     2     0    
## 3 0     0     0     0       0     2     0     0    
## 4 1     0     1     2       0     1     1     0    
## 5 1     0     0     2       1     1     0     0    
## 6 1     0     0     2       0     1     0     2

library(plyr)

Descriptive Statistics of Categorical Variables

Correlations of categorical variables were also investigated with the use of contingency tables. The categorical variables of gender (sex), type of chest pain (cp), type of defect (thal) and presence or absence of exercise induced angina (exang) were compared. Goel et al. (2023) investigated the performance of classification models using the Cleveland Heart dataset, using a heat map to identify features most highly correlated with presence or absence of heart disease. Type of chest pain (cp), type of defect (thal) and presence or absence of exercise induced angina (exang) were listed among the six most highly correlated features and were also chosen for investigation in this EDA (Goel et al., 2023).

Inspection of the contingency tables reveals information about the distribution across groups and relationships between categorical variables. A higher percentage of male patients experienced exercise induced angina (37.3%) compared to female patients (22.9%). When comparing exercise induced angina with angina type, only the patients asymptomatic at baseline had a higher count in presence as compared to absence of exercise induced angina. Typical, Atypical, and Non-anginal pain at baseline had comparatively lower counts in the group with exercise induced angina than without. When comparing presence or absence of exercise induced angina to defect type, the groups with defects (reversible and fixed) had a similar number of counts across both presence and absence of exercise induced angina when compared to the Normal group (no defect), with only 18.9% having exercise induced angina and a normal heart structure.

Pearson’s Chi-squared tests were calculated for each of these contingency tables as well. A p-value of 0.019 was found for the group comparing gender with chest pain, a p-value of 2.153e-13 was found comparing chest pain with exercise induced angina, and p-value of 1.252e-07 was found for the comparision between exercise induced angina and type of defect. All three groups were found to have statistically significant correlations, with type of defect and chest pain having the highest significance with exercise induced angina.

summary(catmyheartdata)

##  sex     cp      fbs     restecg exang   slope   ca      thal   
##  0: 96   0: 23   0:254   0:147   0:200   0:139   0:174   0:164  
##  1:201   1: 49   1: 43   1:  4   1: 97   1:137   1: 65   1: 18  
##          2: 83           2:146           2: 21   2: 38   2:115  
##          3:142                                   3: 20

sex_cp <- table(catmyheartdata$exang, catmyheartdata$sex)
sex_cp_margins <- addmargins(sex_cp)
colnames(sex_cp_margins) <- c("Female", "Male", "Total")
rownames(sex_cp_margins) <- c("Absent", "Present", "Total")
sex_cp_margins %>%
  kbl(caption = "Table 3. Crosstab of Presence or Absence of Exercise Induced Angina by Gender") %>%
  kable_classic()

Table 3. Crosstab of Presence or Absence of Exercise Induced Angina by Gender
	Female	Male	Total
Absent	74	126	200
Present	22	75	97
Total	96	201	297

sex_cp_chi <- chisq.test(sex_cp)
sex_cp_chi

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  sex_cp
## X-squared = 5.4857, df = 1, p-value = 0.01917

cp_exang <- table(catmyheartdata$exang, catmyheartdata$cp)
cp_exang_margins <- addmargins(cp_exang)
colnames(cp_exang_margins) <- c("Typical", "Atypical", "Non-anginal", "Asymptomatic", "Total")
rownames(cp_exang_margins) <- c("Absent", "Present", "Total")
cp_exang_margins %>%
  kbl(caption = "Table 4. Crosstab of Presence or Absence of Exercise Induced Angina by Angina Type") %>%
  kable_classic()

Table 4. Crosstab of Presence or Absence of Exercise Induced Angina by Angina Type
	Typical	Atypical	Non-anginal	Asymptomatic	Total
Absent	19	45	72	64	200
Present	4	4	11	78	97
Total	23	49	83	142	297

cp_exang_chi <- chisq.test(cp_exang)
cp_exang_chi

## 
##  Pearson's Chi-squared test
## 
## data:  cp_exang
## X-squared = 62.042, df = 3, p-value = 2.153e-13

thal_exang <- table(catmyheartdata$exang, catmyheartdata$thal)
thal_exang_margins <- addmargins(thal_exang)
colnames(thal_exang_margins) <- c("Normal", "Fixed Defect", "Reversible Defect", "Total")
rownames(thal_exang_margins) <- c("Absent", "Present", "Total")
thal_exang_margins %>%
  kbl(caption = "Table 5. Crosstab of Presence or Absence of Exercise Induced Angina by Defect Type") %>%
  kable_classic()

Table 5. Crosstab of Presence or Absence of Exercise Induced Angina by Defect Type
	Normal	Fixed Defect	Reversible Defect	Total
Absent	133	10	57	200
Present	31	8	58	97
Total	164	18	115	297

thal_exang_chi <- chisq.test(thal_exang)
thal_exang_chi

## 
##  Pearson's Chi-squared test
## 
## data:  thal_exang
## X-squared = 31.77, df = 2, p-value = 1.262e-07

The correlation between numeric variables was also calculated and is represented in table form (Table 2) and in a correlation matrix heat map (Figure 1). No correlations above .5 were found in correlations of numeric variables. The strongest negative correlations were found between age and maximum heart rate achieved (-0.39), maximum heart rate achieved and the target condition (-0.42), and maximum heart rate achieved and exercise induced ST depression (-0.35). The strongest positive correlations were found between cholesterol and the target condition (0.37) and exercise induced ST depression and the target condition (0.42).

corr_myheartdata <- cor(nummyheartdata)
corr_myheartdata

##                  age    trestbps        chol     thalach    oldpeak  condition
## age        1.0000000  0.29047626 -0.19300538 -0.39456288  0.1971226  0.2270752
## trestbps   0.2904763  1.00000000 -0.08086939 -0.04910766  0.1912431  0.1534900
## chol      -0.1930054 -0.08086939  1.00000000 -0.25272017  0.1700668  0.3783807
## thalach   -0.3945629 -0.04910766 -0.25272017  1.00000000 -0.3476400 -0.4238171
## oldpeak    0.1971226  0.19124314  0.17006676 -0.34763997  1.0000000  0.4240521
## condition  0.2270752  0.15349003  0.37838067 -0.42381706  0.4240521  1.0000000

corr_myheartdata %>%
  kbl(caption = "Table 2. Correlations of Numerical Variables") %>%
  kable_classic()

Table 2. Correlations of Numerical Variables
	age	trestbps	chol	thalach	oldpeak	condition
age	1.0000000	0.2904763	-0.1930054	-0.3945629	0.1971226	0.2270752
trestbps	0.2904763	1.0000000	-0.0808694	-0.0491077	0.1912431	0.1534900
chol	-0.1930054	-0.0808694	1.0000000	-0.2527202	0.1700668	0.3783807
thalach	-0.3945629	-0.0491077	-0.2527202	1.0000000	-0.3476400	-0.4238171
oldpeak	0.1971226	0.1912431	0.1700668	-0.3476400	1.0000000	0.4240521
condition	0.2270752	0.1534900	0.3783807	-0.4238171	0.4240521	1.0000000

image(1:nrow(corr_myheartdata), 1:ncol(corr_myheartdata), 
      t(corr_myheartdata[nrow(corr_myheartdata):1, ]),
      xlab = "", ylab = "",
      main = "Figure 1. Correlation Matrix Heat Map", col = heat.colors(10))

Importance of Statistics and Visualizations

In an EDA, data visualizations provide an opportunity to rapidly view the general data characteristics, compare variables, compare groups, and identify outliers. Visualizations can then identify areas which require further management, such as adjustment or inspection of outliers, to reduce the risk of bias in the data and increase generalizability of findings. Once understanding has been gained through data visualization, further investigation of the relationships between variables is required through statistical analysis. Measures of central tendency, variability, and shape of the distribution in variables provide essential information about the accuracy of the dataset in representing the population sampled.

Cai et al. (2024) completed a systematic review of literature pertaining to prediction models for heart disease using electronic medical records to assess replicability of published articles by using an independent validation score (IVS). The researchers used the prediction risk of bias assessment tool (PROBAST) to assess risk of bias across 79 studies using AI in prediction models for heart disease. The researchers found all models to have a risk of bias as determined by PROBAST. The bias risk related to four general trends, including: 1) data concerns, including inappropriate data sources, and a lack of global populations in datasets, 2) lack of reported detail regarding the predictors, 3) reduced clarity in the definition of the outcome (target variable), such as a different definition of “heart disease” across studies, and 4) inappropriate evaluation of model performance and overfitting (Cai et al., 2024).

With the risk of bias high in machine learning models for heart disease prediction, a thorough understanding of the data is essential. Inconsistencies between the target population and the sample characteristics, unequal group distribution, skewness, and kurtosis provide an understanding of the generalizability of the data to the target population. When small sample sizes are used, such as the dataset used for this EDA, results require replication, with repetition on larger or more representative datasets to confirm generalizability of results.

Summary of EDA Findings of Descriptive Statistics

Despite the small number of observations in the dataset used for this EDA, some important findings can be reviewed from the statistical analysis of this dataset. Age, resting heart rate, and maximum heart rate achieved have the least skewness and the highest normality in the distribution of values, indicating these followed a normal distribution pattern and the variables are representative of the general population. Cholesterol had the highest variability, in both variance and standard deviation scores indicating a wide range of values across all observations. Age and maximum heart rate achieved appeared to have high variability in the lower values, indicating a high number of young adults in the population and a higher number of patients achieving lower maximum heart rates.

Lower ages and absence of heart disease correlated to higher maximum heart rate achieved as indicated by the negative correlation. The variables of cholesterol and the ST depression with exercise were positively correlated indicating higher cholesterol numbers may also have a relationship with a higher ST depression. Patients with exercise induced angina more frequently presented as asymptomatic (no chest pain) at baseline. Patients without exercise induced angina had a normal heart structure (no defect) more often than patients with exercise induced angina. Therefore, from the descriptive statistics, there appears to be a relationship between cholesterol, type of defect, and ST depression during exercise with the condition of heart disease. Additionally, the absence of exercise induced angina is more often associated with normal heart structure.

Conclusions of Descriptive Statistics

In conclusion, an explanation of descriptive statistics is essential for a thorough understanding of the normality and shape of variables in a dataset used for EDA. Measures of central tendency provide a single value to represent the variable, while measures of variability provide information regarding the comparability of the data to the target population. Measure of the shape of the data provide information regarding the risk of bias. Calculations regarding the correlations and statistical significance also provide valuable information regarding important variables to be chosen in the development of machine learning models. Taken together, these descriptive statistics provide opportunities to identify important features to be used in machine learning algorithms. Future work on this EDA will include division of the dataset to training and testing sets and feature engineering.

Part 6: Transformations and Scaling

In this section of the EDA, transformation of numeric data for normalization and min-max scaling were used to increase the ease of comparing values across the different types of numerical variables

Preprocessing with scaling has received distinct attention in the literature surrounding classification models of heart disease due to the value differences across numeric variables. Tseng et al. (2025) specifically addressed preprocessing methods by comparing the sequences of scaling and balancing. Both data scaling and balancing were identified as particularly important in the analysis of electronic health record data. Balancing was described as an important consideration in health data due to the frequent unbalanced counts in disease versus non-disease classes. The authors compared standardization prior to balancing and balancing prior to scaling for two heart disease datasets, the Cleveland heart dataset with a balanced count for diseased and non-diseased classes and the Heart Failure dataset with unbalanced counts for diseased and non-diseased classes. Tseng et al. (2025) used multiple standardization processes, including Standardisation, MinMaxScaler, MaxAbsScaler, and RobustScaler. Data balancing was completed using both up-sampling techniques (ROS, SMOTE, Borderline-SMOTE, ADASYN and down-sampling techniques (CCS, RUS, CUS, NearMiss). The researchers found an advantage in model performance for balancing prior to standardization for the dataset with unbalanced class counts for diseased and non-diseased classes and recommended balancing prior to standardization in these scenarios (Tseng et al., 2025).

Manish and Perumal (2025) also address data pre-processing. The researchers highlight the use of the Min Max Normalized Data Scaling (MMNDS) algorithm for preprocessing the data in the Cleveland Heart dataset to identify noise and missing values. The authors used the MMNDS algorithm for mean and median imputation of missing values. They also used the MMNDS algorithm to normalize values to a range between 0 and 1 to increase consistency in values. In addition to normalization, the authors used a Genetic-Based Crow Search algorithm for feature selection. Following preprocessing and feature selection, authors used support vector machines (SVM), naive bayes (NB), k-nearest neighbors (KNN), decision tree (DT), concluding that optimal preprocessing and feature selection increase model performance across a variety of models (Manish and Perumal, 2025).

First, rsample was loaded for use.

library(rsample)

## 
## Attaching package: 'rsample'

## The following object is masked from 'package:Rcpp':
## 
##     populate

For this EDA using the Cleveland Heart dataset, rsample was used with set.seed (123) for reproducibility in splitting of training and testing data sets to prepare for use in classification models. Following division of the data, summary statistics were inspected to determine the need for transformation or normalization prior to use in machine learning models.

set.seed(123)
myheartdata_split <- initial_split(myheartdata, prop = .80)

train_myheartdata <- training(myheartdata_split)
test_myheartdata <- testing(myheartdata_split)

head(train_myheartdata)

## # A tibble: 6 × 14
##     age sex   cp    trestbps  chol fbs   restecg thalach exang oldpeak slope
##   <dbl> <fct> <fct>    <dbl> <int> <fct> <fct>     <dbl> <fct>   <dbl> <fct>
## 1    64 0     3          180   179 0     0           154 1         0   0    
## 2    58 0     0          150    14 1     2           162 0         1   0    
## 3    61 1     3          138   195 0     2           125 1         3.6 1    
## 4    52 1     2          172   118 1     0           162 0         0.5 0    
## 5    57 1     3          110   229 0     0           143 1         3   1    
## 6    55 0     3          128   244 0     1           130 1         2   1    
## # ℹ 3 more variables: ca <fct>, thal <fct>, condition <dbl>

summary(train_myheartdata)

##       age        sex     cp         trestbps          chol       fbs    
##  Min.   :29.00   0: 74   0: 18   Min.   : 94.0   Min.   :  1.0   0:201  
##  1st Qu.:47.00   1:163   1: 41   1st Qu.:120.0   1st Qu.: 73.0   1: 36  
##  Median :56.00           2: 65   Median :130.0   Median :148.0          
##  Mean   :54.63           3:113   Mean   :131.9   Mean   :148.1          
##  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:222.0          
##  Max.   :77.00                   Max.   :200.0   Max.   :297.0          
##  restecg    thalach      exang      oldpeak      slope   ca      thal   
##  0:118   Min.   : 71.0   0:160   Min.   :0.000   0:116   0:133   0:130  
##  1:  3   1st Qu.:137.0   1: 77   1st Qu.:0.000   1:105   1: 55   1: 14  
##  2:116   Median :152.0           Median :0.800   2: 16   2: 32   2: 93  
##          Mean   :149.8           Mean   :1.051           3: 17          
##          3rd Qu.:165.0           3rd Qu.:1.600                          
##          Max.   :202.0           Max.   :6.200                          
##    condition     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4726  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

num_train_myheartdata <- train_myheartdata[c(1,4:5,8,10,14)]
head(num_train_myheartdata)

## # A tibble: 6 × 6
##     age trestbps  chol thalach oldpeak condition
##   <dbl>    <dbl> <int>   <dbl>   <dbl>     <dbl>
## 1    64      180   179     154     0           0
## 2    58      150    14     162     1           0
## 3    61      138   195     125     3.6         1
## 4    52      172   118     162     0.5         0
## 5    57      110   229     143     3           1
## 6    55      128   244     130     2           1

head(num_train_myheartdata) %>%
  kbl(caption = "Table 1. Head of Numeric Variables of the Training Set") %>%
  kable_classic()

Table 1. Head of Numeric Variables of the Training Set
age	trestbps	chol	thalach	oldpeak	condition
64	180	179	154	0.0	0
58	150	14	162	1.0	0
61	138	195	125	3.6	1
52	172	118	162	0.5	0
57	110	229	143	3.0	1
55	128	244	130	2.0	1

compute_stats <- function(column, name) {
  if (is.numeric(column) || is.integer(column)) {
    data.frame(
      Variable = name,
      Mean = round(mean(column, na.rm = TRUE), 2),
      Median = round(median(column, na.rm = TRUE), 2),
      St.Deviation = round(sd(column, na.rm = TRUE), 2),
      Variance = round(var(column, na.rm = TRUE), 2),
      Range = round(diff(range(column, na.rm = TRUE)), 2),
      IQR = round(IQR(column, na.rm = TRUE), 2),
      Skewness = round(skewness(column, na.rm = TRUE), 2),
      Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
      stringsAsFactors = FALSE
    )
  } else {
    NULL
  }
}

train_descriptive_stats <- do.call(
  rbind,
  lapply(names(num_train_myheartdata), function(col) compute_stats(num_train_myheartdata[[col]], col))
)

train_descriptive_stats

##    Variable   Mean Median St.Deviation Variance Range   IQR Skewness Kurtosis
## 1       age  54.63   56.0         9.28    86.11  48.0  14.0    -0.22     2.46
## 2  trestbps 131.92  130.0        17.71   313.77 106.0  20.0     0.80     4.01
## 3      chol 148.09  148.0        86.55  7491.18 296.0 149.0     0.03     1.79
## 4   thalach 149.77  152.0        22.65   513.07 131.0  28.0    -0.53     3.07
## 5   oldpeak   1.05    0.8         1.18     1.39   6.2   1.6     1.33     4.80
## 6 condition   0.47    0.0         0.50     0.25   1.0   1.0     0.11     1.01

train_descriptive_stats %>%
  kbl(caption = "Table 2. Descriptive Statistics of Training Set") %>%
  kable_classic()

Table 2. Descriptive Statistics of Training Set
Variable	Mean	Median	St.Deviation	Variance	Range	IQR	Skewness	Kurtosis
age	54.63	56.0	9.28	86.11	48.0	14.0	-0.22	2.46
trestbps	131.92	130.0	17.71	313.77	106.0	20.0	0.80	4.01
chol	148.09	148.0	86.55	7491.18	296.0	149.0	0.03	1.79
thalach	149.77	152.0	22.65	513.07	131.0	28.0	-0.53	3.07
oldpeak	1.05	0.8	1.18	1.39	6.2	1.6	1.33	4.80
condition	0.47	0.0	0.50	0.25	1.0	1.0	0.11	1.01

Inspection of the numeric scales, summary statistics, and descriptive statistics (Table 2) of the training set revealed indications that data management was required for several reasons. Inspection of the numeric variables of the training set using the head() function in R (Table 1) revealed the range of oldpeak in small numeric values while the remaining variables other than the target variable had scales of higher values. The range and type of numeric values differed across variables. For example, the range for oldpeak values was 6.2, while other variables related to health measures included wider ranges, including 48 for age, 106 for resting blood pressure, 296 for cholestserol and 131 for maximum heart rate achieved. Cholesterol (chol) had the highest interquartile range (IQR) of 149.0, with the minimum value of 1 and the maximum value of 297. Resting blood pressure and maximum heart rate achieved both had an IQR of 20 and 28 respectively. Finally, oldpeak had an IQR of a low value of 1.6.

Variance also varied across numeric variables. The highest variance was observed in cholesterol (chol), with a variance value of 7491.18. This value appeared much higher than variance for other quantitative variables. Patient’s maximum heart rate achieved (thalach) had a variance of 513.07 and resting blood pressure (trestbps) had a variance of 313.77. Age had a variance of 86.11. Finally, oldpeak had a variance of only 1.39, likely due to the small values for this variable.

Due to the variability in all quantitative variables, transformation of numeric data was completed (Table 3) to increase similarity in values for clarity across comparisons. A logarithmic transformation was chosen as the transformation method due to the difference in variance between cholesterol and the remaining numeric variables. Additionally, numeric variables of age and patient’s maximum heart rate achieved (thalach) presented with negative skewness while resting blood pressure (trestbps) and oldpeak presented with positive skewness. To complete the logarithmic transformation, the log() function was applied first to the numeric variables of the training test, and subsequently to the numeric variables of the testing set. Completion of the logarithmic transformation resulted in increased similarity in values across the numeric variables.

transformed_num_train_myheartdata <- log(num_train_myheartdata)
head(transformed_num_train_myheartdata)

## # A tibble: 6 × 6
##     age trestbps  chol thalach  oldpeak condition
##   <dbl>    <dbl> <dbl>   <dbl>    <dbl>     <dbl>
## 1  4.16     5.19  5.19    5.04 -Inf          -Inf
## 2  4.06     5.01  2.64    5.09    0          -Inf
## 3  4.11     4.93  5.27    4.83    1.28          0
## 4  3.95     5.15  4.77    5.09   -0.693      -Inf
## 5  4.04     4.70  5.43    4.96    1.10          0
## 6  4.01     4.85  5.50    4.87    0.693         0

head(transformed_num_train_myheartdata) %>%
  kbl(caption = "Table 3. Head of Transformed Numeric Varaibles of Training Set") %>%
  kable_classic()

Table 3. Head of Transformed Numeric Varaibles of Training Set
age	trestbps	chol	thalach	oldpeak	condition
4.158883	5.192957	5.187386	5.036953	-Inf	-Inf
4.060443	5.010635	2.639057	5.087596	0.0000000	-Inf
4.110874	4.927254	5.273000	4.828314	1.2809338	0
3.951244	5.147494	4.770685	5.087596	-0.6931472	-Inf
4.043051	4.700480	5.433722	4.962845	1.0986123	0
4.007333	4.852030	5.497168	4.867534	0.6931472	0

num_test_myheartdata <- test_myheartdata[c(1,4:5,8,10,14)]
head(num_test_myheartdata)

## # A tibble: 6 × 6
##     age trestbps  chol thalach oldpeak condition
##   <dbl>    <dbl> <int>   <dbl>   <dbl>     <dbl>
## 1    69      140     2     151     1.8         0
## 2    66      150     3     114     2.6         0
## 3    59      160    12     125     0           1
## 4    56      120    15     162     1.9         0
## 5    45      110    19     132     1.2         1
## 6    56      140    38     153     1.3         0

transformed_num_test_myheartdata <- log(num_test_myheartdata)
head(transformed_num_test_myheartdata)

## # A tibble: 6 × 6
##     age trestbps  chol thalach  oldpeak condition
##   <dbl>    <dbl> <dbl>   <dbl>    <dbl>     <dbl>
## 1  4.23     4.94 0.693    5.02    0.588      -Inf
## 2  4.19     5.01 1.10     4.74    0.956      -Inf
## 3  4.08     5.08 2.48     4.83 -Inf             0
## 4  4.03     4.79 2.71     5.09    0.642      -Inf
## 5  3.81     4.70 2.94     4.88    0.182         0
## 6  4.03     4.94 3.64     5.03    0.262      -Inf

library(dplyr)

Once transformed, numeric values required further normalization for future use in machine learning models due to the remaining difference between oldpeak and the remaining variables. Logarithmic transformation adjusted values to higher similarity, but variability across features remained an issue. The variables of age, resting blood pressure (tresbps), cholesterol (chol), and maximum heart rate achieved (thalach) had comparable values between 2.0 and 6.0 after logarithmic transformation. The variable of oldpeak, continued to be unable to be compared to other variables, with values from –Inf to 1.30. The target variable of condition also resulted in a lack of comparability now having –Inf for all instances that have the target condition of heart disease. MinMax Normalization was chosen as the normalization process due to the frequent use of MinMax Normalization in the literature for heart disease datasets. MinMax Normalization was applied to the numeric variables in the training dataset by creating a function for MinMax scaling. The function was then applied first to the quantitative variables of the training set, and subsequently to the quantitative variables of the test data set. By normalizing all quantitative variables to values between 0 and 1, consistency had been achieved across the data set allowing for comparisons between variables as can be seen in Table 4. Variance had been successfully managed to 0.00 for most variables following MinMax scaling, except for the variance of cholesterol (chol) at 0.08 and the variance of maximum heart rate achieved (thalach) at 0.01 as seen in Table 5.

min_max_norm <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

scl_trsf_train_myheartdata <- min_max_norm(num_train_myheartdata)
head(scl_trsf_train_myheartdata)

##         age  trestbps       chol   thalach     oldpeak   condition
## 1 0.2154882 0.6060606 0.60269360 0.5185185 0.000000000 0.000000000
## 2 0.1952862 0.5050505 0.04713805 0.5454545 0.003367003 0.000000000
## 3 0.2053872 0.4646465 0.65656566 0.4208754 0.012121212 0.003367003
## 4 0.1750842 0.5791246 0.39730640 0.5454545 0.001683502 0.000000000
## 5 0.1919192 0.3703704 0.77104377 0.4814815 0.010101010 0.003367003
## 6 0.1851852 0.4309764 0.82154882 0.4377104 0.006734007 0.003367003

scl_trsf_test_myheartdata <- min_max_norm(num_test_myheartdata)
head(scl_trsf_test_myheartdata)

##         age  trestbps        chol   thalach     oldpeak   condition
## 1 0.2338983 0.4745763 0.006779661 0.5118644 0.006101695 0.000000000
## 2 0.2237288 0.5084746 0.010169492 0.3864407 0.008813559 0.000000000
## 3 0.2000000 0.5423729 0.040677966 0.4237288 0.000000000 0.003389831
## 4 0.1898305 0.4067797 0.050847458 0.5491525 0.006440678 0.000000000
## 5 0.1525424 0.3728814 0.064406780 0.4474576 0.004067797 0.003389831
## 6 0.1898305 0.4745763 0.128813559 0.5186441 0.004406780 0.000000000

nrow(scl_trsf_train_myheartdata)

## [1] 237

nrow(scl_trsf_test_myheartdata)

## [1] 60

head(scl_trsf_train_myheartdata)%>%
  kbl(caption = "Table 4. Head of Scaled and Transformed Numeric Variables (Training Set)") %>%
  kable_classic()

Table 4. Head of Scaled and Transformed Numeric Variables (Training Set)
age	trestbps	chol	thalach	oldpeak	condition
0.2154882	0.6060606	0.6026936	0.5185185	0.0000000	0.000000
0.1952862	0.5050505	0.0471380	0.5454545	0.0033670	0.000000
0.2053872	0.4646465	0.6565657	0.4208754	0.0121212	0.003367
0.1750842	0.5791246	0.3973064	0.5454545	0.0016835	0.000000
0.1919192	0.3703704	0.7710438	0.4814815	0.0101010	0.003367
0.1851852	0.4309764	0.8215488	0.4377104	0.0067340	0.003367

scl_trsf_train_descriptive_stats <- do.call(
  rbind,
  lapply(names(scl_trsf_train_myheartdata), function(col) compute_stats(scl_trsf_train_myheartdata[[col]], col))
)

scl_trsf_train_descriptive_stats

##    Variable Mean Median St.Deviation Variance Range  IQR Skewness Kurtosis
## 1       age 0.18   0.19         0.03     0.00  0.16 0.05    -0.22     2.46
## 2  trestbps 0.44   0.44         0.06     0.00  0.36 0.07     0.80     4.01
## 3      chol 0.50   0.50         0.29     0.08  1.00 0.50     0.03     1.79
## 4   thalach 0.50   0.51         0.08     0.01  0.44 0.09    -0.53     3.07
## 5   oldpeak 0.00   0.00         0.00     0.00  0.02 0.01     1.33     4.80
## 6 condition 0.00   0.00         0.00     0.00  0.00 0.00     0.11     1.01

scl_trsf_train_descriptive_stats %>%
  kbl(caption = "Table 5. Transformed and Normalized Descriptive Statistics (Training Set)") %>%
  kable_classic()

Table 5. Transformed and Normalized Descriptive Statistics (Training Set)
Variable	Mean	Median	St.Deviation	Variance	Range	IQR	Skewness	Kurtosis
age	0.18	0.19	0.03	0.00	0.16	0.05	-0.22	2.46
trestbps	0.44	0.44	0.06	0.00	0.36	0.07	0.80	4.01
chol	0.50	0.50	0.29	0.08	1.00	0.50	0.03	1.79
thalach	0.50	0.51	0.08	0.01	0.44	0.09	-0.53	3.07
oldpeak	0.00	0.00	0.00	0.00	0.02	0.01	1.33	4.80
condition	0.00	0.00	0.00	0.00	0.00	0.00	0.11	1.01

Summary of Data Preprocessing

Data management was completed in two stages for a 4-step process after division of the dataset into training and testing datasets. First, quantitative variables were transformed using a logarithmic transformation function. This was first applied to the training set, then to the testing set. Second, quantitative variables were scaled using MinMax Normalization with a min-max scaling function. By incorporating these two processes into data management, variance was successfully managed. Following transformation and normalization, values are consistent and comparable across variables with an increased likelihood of accurate predictions when used with machine learning models.

Prediction using classification models requires various combinations of influential variables to identify relationships that lead to highly accurate predictions. When variables include different types of numeric scales, there is a risk of inaccurate predictions. This is due to the risk of larger values overpowering scales with smaller values and reducing overall model accuracy. To manage this risk, variables with numeric values of different scales are transformed and normalized for uniformity. This manages the negative impact of high variance, outlier data values, and allows for equal comparisons across different scales.

Al-Ssulami et al. (2023) approached data management by using both scaling and data augmentation to optimize the dataset prior to use in classification models. The researchers used a comprehensive data set which combined the Cleveland, Hungarian, Long Beach-VA, Switzerland, and Statlog heart disease databases for a base dataset of 1190 instances (918 unique and 272 duplicative instances). Researchers then removed the duplicative instances prior to duplicating misclassified instances using leave-one-out-cross-validation. Researchers normalized four numeric variables: age, resting blood pressure, cholesterol, and maximum heart rate achieved using zero-mean normalization. Results indicated that creation of duplicates of misclassified instances successfully adjusted the decision boundary of the model, using the tendency to overfit as a method to increase accuracy on unseen data. Researchers found overall improvement in accuracy when combining Bagged-DT with the augmented datasets (Al-Ssulami et al., 2023).

In this EDA, a logarithmic transformation was completed to reduce variability and increase similarity in numeric values. Additionally, MinMax Normalization was used to convert values to a scale between 0 and 1, allowing for further uniformity and increasing the ease of comparison across variables. Use of these strategies successfully prepared data for use in prediction models. Data augmentation was not completed for this EDA but may be investigated in future work if accuracy requires further optimization.

Works Cited

Alghamdi, N. S., Zakariah, M., Shankar, A., & Wattana Viriyasitavat. (2024). Heart disease prediction using autoencoder and DenseNet architecture. Egyptian Informatics Journal, 28(28), 100559–100559. https://doi.org/10.1016/j.eij.2024.100559

Al-Ssulami, A. M., Alsorori, R. S., Azmi, A. M., & Hatim Aboalsamh. (2023). Improving Coronary Heart Disease Prediction Through Machine Learning and an Innovative Data Augmentation Technique. Cognitive Computation, 15(5), 1687–1702. https://doi.org/10.1007/s12559-023-10151-6

Cai, Y., Cai, Y.-Q., Tang, L.-Y., Wang, Y.-H., Gong, M., Jing, T.-C., Li, H.-J., Li-Ling, J., Hu, W., Yin, Z., Gong, D.-X., & Zhang, G.-W. (2024). Artificial intelligence in the risk prediction models of cardiovascular disease and development of an independent validation screening tool: a systematic review. BMC Medicine, 22(1). https://doi.org/10.1186/s12916-024-03273-7

Centers for Disease Control and Prevention. “About Heart Disease.” Heart Disease, CDC, 15 May 2024, www.cdc.gov/heart-disease/about/index.html.

Dwi Normawati, & Sri Winarti. (2018). Feature Selection with Combination Classifier use Rules-Based Data Mining for Diagnosis of Coronary Heart Disease. 12th International Conference on Telecommunication Systems, Services, and Applications (TSSA), Telecommunication Systems, Services, and Applications (TSSA), 2018 12th International Conference. https://doi.org/10.1109/tssa.2018.8708849

Gracious, S., & Shirley, C. P. (2024). ML based Age Related Heart Disease Prediction . Proceedings of 2024 International Conference on Cognitive Robotics and Intelligent Systems (ICC - ROBINS 2024), 312–316.

Goel, N., Yadav, N. P., Prakarti Prakarti, & Pandey, A. (2023). Comparative Analysis of Single Classifier Models against Aggregated Fusion Models for Heart Disease Prediction. 2023 International Conference on Disruptive Technologies (ICDT), 576–580. https://doi.org/10.1109/icdt57929.2023.10150611

“Heart Disease Cleveland.” Www.kaggle.com, www.kaggle.com/datasets/ritwikb3/heart-disease-cleveland.

Kandula, S. T. R. (2025). Comparison and Performance Assessment of Intelligent ML Models for Forecasting Cardiovascular Disease Risks in Healthcare. 2025 4th International Conference on Sensors and Related Networks (SENNET) - Special Focus on Digital Healthcare , 1–6. https://doi.org/10.1109/sennet64220.2025.11136005

Lubna Riyaz, Muheet Ahmed Butt, & Zaman, M. (2022). IMPROVING CORONARY HEART DISEASE PREDICTION BY OUTLIER ELIMINATION. Applied Computer Science, 18(1), 70–88. https://doi.org/10.35784/acs-2022-6

Manish, G. S., & Perumal, S. (2025). A Comprehensive Analysis of Machine Learning Methods for Predicting Heart Disease. 2025 International Conference on Networks and Cryptology (NETCRYPT).

Shifa Sameen, Mahendra Gaikwad, & Sampada Wazalwar. (2025). Efficient Prediction of Cardiovascular Disease Using Machine Learning. 2025 12th International Conference on Emerging Trends in Engineering & Technology - Signal and Information Processing (ICETET - SIP) , 1–6. https://doi.org/10.1109/icetetsip64213.2025.11156734

Tseng, C.-W., Sun, L.-C., Lin, K.-F., & Chen, P.-N. (2025). Enhancing coronary heart disease diagnosis: Comparative analysis of data pre-processing techniques and machine learning models using clinical medical records. Health Informatics Journal, 31(3). https://doi.org/10.1177/14604582251366160