NorthamJDDS8501-3

A Demonstration of Principles of Exploratory Data Analysis for Data Science Using the Cleveland Heart Dataset: Part 3

Heart disease continues to be the leading cause of death for individuals in the middle adult age range despite decades of research addressing treatment for this condition. The Center for Disease Control (CDC, 2024) describes heart disease as a term encompassing several conditions related to heart function, including coronary artery disease which affects blood flow to the heart and may increase the risk of a heart attack. According to the CDC, 22% of deaths in the United States are due to heart disease (CDC, 2024). Due to the high need for advanced healthcare practices to treat this condition, data science has surfaced as a option using machine learning models to aid understanding of the disease for advancements in treatment.

Source

Completion of an exploratory data analysis (EDA) includes several components to gain an understanding of the data used. This understanding is essential in identifying appropriate features for classification models to ensure optimal performance. Data visualization is required for understanding trends in the data, including the presence of outliers, unbalanced groups, or other aspects of the data that may require management prior to model development. The continued presence of heart disease as a major risk factor for many adults (CDC, 2024) has led to increased attention to data analysis in the healthcare sector related to heart disease. An increased understanding of the relevant data is essential to ensure medical practices continue to evolve to effectively treat or prevent heart disease. The purpose of this EDA is to investigate methods for data visualization using the Cleveland Heart dataset. Part 3 of the EDA provides an inspection of four of the features of the dataset (cholesterol, resting blood pressure, fasting blood sugar and exercise-induced angina). The understanding gained from the data visualization techniques used will guide feature selection for subsequent steps of the EDA process for classification of the presence or absence of heart disease.

The Amelia-II package was installed and required specification of the CRAN mirror.

install.packages("Amelia", repos = "https://cloud.r-project.org")

## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)

## package 'Amelia' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'Amelia'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\benke\AppData\Local\R\win-library\4.5\00LOCK\Amelia\libs\x64\Amelia.dll
## to C:\Users\benke\AppData\Local\R\win-library\4.5\Amelia\libs\x64\Amelia.dll:
## Permission denied

## Warning: restored 'Amelia'

## 
## The downloaded binary packages are in
##  C:\Users\benke\AppData\Local\Temp\Rtmpq2Msj4\downloaded_packages

install.packages("Rmisc", repos = "https://cloud.r-project.org")

## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)

## package 'Rmisc' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\benke\AppData\Local\Temp\Rtmpq2Msj4\downloaded_packages

All previous packages from Part 1 and Part 2 were included.

library(ggplot2)            # Load ggplot2 library
library(scales)             # Load scales library
library(moments)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)

## Warning: package 'readr' was built under R version 4.5.1

## 
## Attaching package: 'readr'

## The following object is masked from 'package:scales':
## 
##     col_factor

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.1

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

The Cleveland Heart dataset was uploaded for analysis.

getwd()

## [1] "C:/Users/benke/OneDrive/NU/DDS 8501"

setwd("C:/Users/benke/OneDrive/NU/DDS 8501")

myheartdata <- read_csv("heart_cleveland_upload.csv")

## Rows: 297 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Information regarding the variables was included in this EDA for reference and clarity.

The Characteristics of The Dataset

This dataset is comprised of the data from the heart_cleveland_upload dataset, which has 14 variables and 297 observations

The variables consist of 13 attributes and a target variable indicating the presence or absence of heart disease. A description of the variables are as follows:

age: patient’s age (Quantitative, Numeric, Continuous, Ratio)

sex: patient’s gender (Qualitative, Categorical, Nominal)

cp: chest pain: This variable includes 4 categories of chest pain (typical angina = 0, atypical angina = 1, non-anginal pain = 2, asympotamic = 3, Qualitative, Categorical, Nominal)

trestbps: patient’s blood pressure at rest (mm/HG, Quantitative, Numeric, Continuous, Ratio)

chol: serum cholesterol (mg/dl, Quantitative, Numeric, Continuous, Ratio)

fbs: fasting blood sugar > 120 mg/dl (Qualitative, Categorical, Nominal)

restecg: electrocardiogram results at rest categorized in 3 values (Normal = 0, ST-T wave abnormality(T wave inversions and/or ST elevation or depression of > 0.05 mV = 1, probable/definite left ventricular hypertrophy = 2, Qualitative, Categorical, Nominal)

thalach: patient’s maximum heart rate (Quantitative, Numeric, Continuous, Interval)

exang: presence/absence of exercise induced angina (Qualitative, Categorical, Nominal)

oldpeak: exercise induced ST-depression compared to rest state (Quantitative, Numeric, Continuous, Interval)

slope: shape of slope of ST segment during peak exercise (Qualitative, Categorical (up, flat, or down, Nominal)

ca: patient’s number of major blood vessles (Qualitative, Categorical, Nominal)

thal: patient’s thalassemia indicating type of defect (Qualitative, Categorical, Normal = 1, Fixed defect = 2, Reversible defect = 3, Nominal)

condition: target: presence or absence of heart disease (Binary, Numerical, Discrete)

Source

Preprocessing

Preprocessing code was included in this notebook for reference. In Part 1 of this EDA, categorical variables with numeric values were reclassified to factors.

factor_columns <- c("sex", "cp", "fbs", "restecg", 
                    "exang","slope", "ca", "thal")
myheartdata[factor_columns] <- lapply(myheartdata[factor_columns], function(col) as.factor(as.character(col)))

Another important preprocessing step is identification of missing values and the percent of “NA” values in this dataset. This dataset does not include “NA” values which require preprocessing.

cols_with_nas <- sum(colSums(is.na(myheartdata)) > 0)
Percent_col_NA <- percent(cols_with_nas / length(myheartdata))
cols_with_nas

## [1] 0

Percent_col_NA

## [1] "0%"

The summary obtained from Part 1 was maintained. This includes counts for categorical data. Only numeric variables have mean and median scores.

summary(myheartdata)

##       age        sex     cp         trestbps          chol       fbs    
##  Min.   :29.00   0: 96   0: 23   Min.   : 94.0   Min.   :126.0   0:254  
##  1st Qu.:48.00   1:201   1: 49   1st Qu.:120.0   1st Qu.:211.0   1: 43  
##  Median :56.00           2: 83   Median :130.0   Median :243.0          
##  Mean   :54.54           3:142   Mean   :131.7   Mean   :247.4          
##  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:276.0          
##  Max.   :77.00                   Max.   :200.0   Max.   :564.0          
##  restecg    thalach      exang      oldpeak      slope   ca      thal   
##  0:147   Min.   : 71.0   0:200   Min.   :0.000   0:139   0:174   0:164  
##  1:  4   1st Qu.:133.0   1: 97   1st Qu.:0.000   1:137   1: 65   1: 18  
##  2:146   Median :153.0           Median :0.800   2: 21   2: 38   2:115  
##          Mean   :149.6           Mean   :1.056           3: 20          
##          3rd Qu.:166.0           3rd Qu.:1.600                          
##          Max.   :202.0           Max.   :6.200                          
##    condition     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4613  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Additionally, use of the str() function reveals the categories have been appropriately reclassified as factors.

str(myheartdata)

## spc_tbl_ [297 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age      : num [1:297] 69 69 66 65 64 64 63 61 60 59 ...
##  $ sex      : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 2 2 1 2 ...
##  $ cp       : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ trestbps : num [1:297] 160 140 150 138 110 170 145 134 150 178 ...
##  $ chol     : num [1:297] 234 239 226 282 211 227 233 234 240 270 ...
##  $ fbs      : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 2 1 1 1 ...
##  $ restecg  : Factor w/ 3 levels "0","1","2": 3 1 1 3 3 3 3 1 1 3 ...
##  $ thalach  : num [1:297] 131 151 114 174 144 155 150 145 171 145 ...
##  $ exang    : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ oldpeak  : num [1:297] 0.1 1.8 2.6 1.4 1.8 0.6 2.3 2.6 0.9 4.2 ...
##  $ slope    : Factor w/ 3 levels "0","1","2": 2 1 3 2 2 2 3 2 1 3 ...
##  $ ca       : Factor w/ 4 levels "0","1","2","3": 2 3 1 2 1 1 1 3 1 1 ...
##  $ thal     : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 3 2 1 1 3 ...
##  $ condition: num [1:297] 0 0 0 1 0 0 0 1 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   sex = col_double(),
##   ..   cp = col_double(),
##   ..   trestbps = col_double(),
##   ..   chol = col_double(),
##   ..   fbs = col_double(),
##   ..   restecg = col_double(),
##   ..   thalach = col_double(),
##   ..   exang = col_double(),
##   ..   oldpeak = col_double(),
##   ..   slope = col_double(),
##   ..   ca = col_double(),
##   ..   thal = col_double(),
##   ..   condition = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Amelia-II package was installed with the CRAN mirror specified and loaded.

Missingness Analysis

Missingness was initially investigated in this EDA using the missingness map method in the Amelia II package in R to identify any missing values. Inspection of the missingness map reveals no missing values for this dataset.

In Part 2 of this EDA, missingness of the dataset was evaluated using the missingness map method from Ameila II.

myheartdata.missmap <- missmap(myheartdata)

## Warning: Unknown or uninitialised column: `arguments`.
## Unknown or uninitialised column: `arguments`.

## Warning: Unknown or uninitialised column: `imputations`.

myheartdata.missmap

## NULL

Additionally, columns were inspected using the colSums(is.na()) function in R. As consistent with the previous analyses, the colSums(is.na()) function returned 100% of 0 counts indicating the absence of missing values in this dataset.

colSums(is.na(myheartdata))

##       age       sex        cp  trestbps      chol       fbs   restecg   thalach 
##         0         0         0         0         0         0         0         0 
##     exang   oldpeak     slope        ca      thal condition 
##         0         0         0         0         0         0

Data Visualization

The target feature of this project is the presence or absence of heart disease. Fourteen features, including the target, 5 numeric and 8 categorical variables are available for analysis. The initial inspection of the data requires a comparison of the target classification groups to ensure balance between the groups. Kandula (2025) used the data visualization technique of a pie chart to share information regarding the group balance in the dataset. Although typically avoided due to reduced communication clarity, the use of a pie chart for communication of the distribution between the two classification groups in this particular study was helpful, showing slightly over half (51%) of the dataset with a classification of the presence of heart disease and 49% with a classification of no heart disease. In this EDA, a bar chart was used to portray the distribution between the two target groups. As can be seen from the bar chart, the two group counts are similar and balanced. In this dataset, the presence of heart disease has a count of 140 while the absence of heart disease has a count of 156. The presence of heart disease has 47% of the total observations and the absence of heart disease has 53%. Further investigation is required to determine the source of the difference in the group counts and percentages between the Kandula (2025) publication and this EDA.

ggplot(myheartdata, aes(x=condition, fill=condition)) +
  geom_bar(aes(fill=condition), position = "dodge", fill="turquoise", bin=1) +
  labs(x="Group Comparison of Presence or Absence of Heart Disease", y="Count")

## Warning in geom_bar(aes(fill = condition), position = "dodge", fill =
## "turquoise", : Ignoring unknown parameters: `bin`

Rmisc and QQplotr were installed for creation of the 4-plot data visualization method.

library(Rmisc)

## Warning: package 'Rmisc' was built under R version 4.5.1

## Loading required package: lattice

## Loading required package: plyr

## Warning: package 'plyr' was built under R version 4.5.1

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

install.packages("qqplotr", repos = "https://cloud.r-project.org")

## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)

## package 'qqplotr' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\benke\AppData\Local\Temp\Rtmpq2Msj4\downloaded_packages

library(qqplotr)

## Warning: package 'qqplotr' was built under R version 4.5.1

## 
## Attaching package: 'qqplotr'

## The following objects are masked from 'package:ggplot2':
## 
##     stat_qq_line, StatQqLine

The determination of features for comparison is another important part of the EDA. Normawati and Winarti (2018) compared feature selection using the Cleveland Heart dataset. The researchers compared feature selection methods based on motivated feature selection (medical expert-based selection) (MTF), Variable Precision Rough Set (VPRS) method for feature selection and a combination of both methods. Results indicated model accuracy improved with use of VPRS and the merger of MTF with VPRS in feature selection. The authors comment use of MTF alone as feature selection may result in unanticipated bias therefore use of a computer-based feature selection method is likely to improve model performance. The features selected for this EDA used medical expert-based selection to identify features historically associated with heart disease based on risk factors reported by the Center for Disease Control (CDC, 2024). The CDC (2024) reports high blood pressure and cholesterol as key risk factors in the presence of heart disease.
Feature selection methods in future analysis may reduce risk of bias and will be considered in future analysis methods.

For this EDA, a 4-plot was developed using the cholesterol and resting blood pressure variables, as these were reported by the CDC as high-risk indicators for heart disease (CDC, 2024). Figure 2 provides data visualization of a run sequence, a lag plot, a histogram of cholesterol, and a normal probability plot. The run and lag plots show the value of resting blood pressure versus the index of cholesterol, grouped by condition (presence or absence of heart disease). Inspection of the run plot and lag plot show notable trends. The plots show a higher concentration of instances of the presence of heart disease group with values above 150 for cholesterol, as seen in both the run and lag plots. Additionally, the lag plot for cholesterol shows two values with high blood pressure above 180, at approximately 190 and 200. Both of these values have the condition of heart disease based on grouping. Because these values were considered to demonstrate expected variation in electronic health records rather than error, they were included in the analysis although likely to be outliers. These plots indicate that cholesterol above 150 might have an association with heart disease. Additionally, a resting blood pressure above 180 may have an association with heart disease, although the two values appear to be outliers.

This histogram shows the feature of cholesterol has a uniform distribution indicating minimal variability across the values in the data making any interpretation of trends difficult. An expected Gaussian distribution would increase the reliability of the data; however, this dataset includes uniform values for cholesterol indicating a likely bias in the dataset. Finally, the probability plot shows a similar shape for both target groups of with and without heart disease based on the cholesterol value, with deviations from normality at the left and right ends of the plot. This result is expected when considering the uniform distribution of values in cholesterol as observed in the histogram.

myheartdata$chol=seq(1:nrow(myheartdata))
p1=ggplot(myheartdata, aes(x=chol,y=trestbps, group=condition))+geom_line(aes(col=as.factor(condition)))+ggtitle('Run: Cholesterol')
p3=ggplot(myheartdata, aes(x=lag(chol), y=trestbps, group=condition))+geom_point(aes(col=as.factor(condition)))+ggtitle('Lag:Cholesterol')
p2=ggplot(myheartdata, aes(x=chol))+geom_histogram(fill='turquoise',col='black', bin=20)+ggtitle('Histogram:Cholesterol')

## Warning in geom_histogram(fill = "turquoise", col = "black", bin = 20):
## Ignoring unknown parameters: `bin`

p4=ggplot(myheartdata, aes(sample=scale(chol),group=condition))+stat_qq_point(aes(col=as.factor(condition)))+
        ggtitle('Normal Probability Plot: Cholesterol')
multiplot(p1, p2, p3, p4, cols=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Outlier management is another important aspect of date preprocessing. Riyaz et al. (2022) addressed outlier management using the Framingham Heart Study dataset when comparing classification model performance based on outlier detection. Riyaz et al. (2022) found five variables with outliers (total cholesterol, systolic blood pressure, diastolic blood pressure, body mass index, heart rate, and glucose level) using visualization methods including but not limited to box plots for data visualization. The authors found that removal of outliers improved the performance of all classification models used except multinominal naive bayes (Riyaz et al, 2022).

In this EDA, box plots for the variables of interest (cholesterol and resting blood pressure) were used to further identify distribution characteristics. The box plot for cholesterol revealed a median of 150, 1st quartile of approximately 75 and a 3rd quartile of approximately 225. Whiskers extend as low as 0 and as high as 300. Although no outliers are evident, a cholesterol of 0 is indicative of error as it is unlikely a cholesterol result in an electronic health record would be 0. Further inspection to determine if hidden missing values are present with values of “0” is warranted in future analyses. The box plot for resting blood pressure indicates a 1st quartile of 120, a median of approximately 135 and the 3rd quartile of approximately 140. Whisker lines indicate a lowest value slightly above 90 with no outliers and a highest value of 175 with 6 outliers above 175. The outliers were considered representative of expected variation and included in the analysis.

ggplot(myheartdata, aes(x = condition, y = chol)) +
  geom_boxplot(fill = "salmon", color = "darkblue", width = .25, outlier.color = "turquoise", outlier.size = 2) +
  labs(
    title = "Box Plot for Cholesterol",
    x = "Condition",
    y = "Cholesterol"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

ggplot(myheartdata, aes(x = condition, y = trestbps)) +
  geom_boxplot(fill = "salmon", color = "darkblue", width = .25, outlier.color = "turquoise", outlier.size = 2) +
  labs(
    title = "Box Plot for Resting Blood Pressure",
    x = "Condition",
    y = "Resting Blood Pressure"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

head(myheartdata)

## # A tibble: 6 × 14
##     age sex   cp    trestbps  chol fbs   restecg thalach exang oldpeak slope
##   <dbl> <fct> <fct>    <dbl> <int> <fct> <fct>     <dbl> <fct>   <dbl> <fct>
## 1    69 1     0          160     1 1     2           131 0         0.1 1    
## 2    69 0     0          140     2 0     0           151 0         1.8 0    
## 3    66 0     0          150     3 0     0           114 0         2.6 2    
## 4    65 1     0          138     4 1     2           174 0         1.4 1    
## 5    64 1     0          110     5 0     2           144 1         1.8 1    
## 6    64 1     0          170     6 0     2           155 0         0.6 1    
## # ℹ 3 more variables: ca <fct>, thal <fct>, condition <dbl>

View(myheartdata["trestbps"])

Univariate data visualization is also an important aspect of completion of the EDA to review the distribution of counts in features of interest. Inspection of group characteristics supports identification of reduced balance that could impact the performance of classification models. Kandula (2025) used a stacked bar chart to show asymptomatic chest pain with the highest count of observations with the condition of heart disease compared to typical angina, atypical angina, and non-anginal pain groups. Kandula (2025) points out that the condition of heart disease presents without symptomatic pain. In this EDA, group counts were visualized in a bar chart form for the categorical features of fasting blood sugar and exercise induced angina. Exercise-induced angina was chosen due to the inclusion and discussion of this feature in the literature (Kandula 2025). Fasting blood sugar was chosen because this feature was included in a list of important features by motivated feature selection (Normawati and Winarti, 2018). By visualization of the groups, reduced balance in both features is identified. Less than 25% of the dataset includes a value in the category of fasting blood sugar above 120 mg/dl, showing unequal distribution and a high likelihood of bias by using this variable in future analysis (Figure 5). To a lesser extent, reduced balance is also identified in the counts of exercise-induced angina (Figure 6), with the presence of exercise-induced angina containing approximately half of the number of values as compared to the group without exercise-induced angina. Use of these features could introduce bias in future steps of the EDA due to the unbalanced group counts.

library(dplyr)
ggplot(myheartdata, aes(x=fbs, fill=fbs)) +
  geom_bar(aes(fill=fbs), position = "dodge") +
  labs(x="Fasting Blood Sugar >120 mg/dl", y="Count")

ggplot(myheartdata, aes(x=exang, fill=exang)) +
  geom_bar(aes(fill=exang), position = "dodge") +
  labs(x="Exercise Induced Angina", y="Count")

In conclusion, this report focuses on the data visualization techniques used to investigate the characteristics of cholesterol, resting blood pressure, fasting blood sugar, and exercise-induced angina in the Cleveland Heart dataset. Variables were chosen for investigation in this report due to the attention these variables received in previous literature using open-source heart disease datasets in the development of heart disease classification models. Both categorical variables were found to have unbalanced groups which could affect the accuracy in model development due to the lack of sufficient data values for the smaller groups. Numerical variables were chosen based on the CDC lists of important risk factors for the development of heart disease (CDC, 2024). The 4-plot data visualization technique demonstrates reduced normality in distribution for both of these variables. The data visualization also shows a possible association between high cholesterol and the presence of heart disease, which requires further investigation.

Data visualization provides insights into the distribution of the values of investigated features prior to model development. Data management requirements, such as outlier management, are identified with univariate data visualization to ensure features selected for future analysis do not introduce bias and negatively affect future model performance.

NorthamJDDS8501-3

Heather Northam

2025-09-021

A Demonstration of Principles of Exploratory Data Analysis for Data Science Using the Cleveland Heart Dataset: Part 3

Preprocessing

Missingness Analysis

Data Visualization

Works Cited