library(ggplot2)            # Load ggplot2 library
library(scales)             # Load scales library
library(moments)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
## Warning: package 'readr' was built under R version 4.5.1
## 
## Attaching package: 'readr'
## The following object is masked from 'package:scales':
## 
##     col_factor
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.1
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

A Demonstration of Principles of Exploratory Data Analysis for Data Science Using the Cleveland Heart Dataset: Part 1

Heart disease continues to be the leading cause of death for individuals in the middle adult age range despite decades of research addressing treatment for this condition. The Center for Disease Control (CDC, 2024) describes heart disease as a term encompassing several conditions related to heart function, including coronary artery disease which affects blood flow to the heart and may increase the risk of a heart attack. According to the CDC, 22% of deaths in the United States are due to heart disease (CDC, 2024). Due to the high need for advanced healthcare practices to treat this condition, data science has surfaced as a option using machine learning models to aid understanding of the disease for advancements in treatment.

Source

Data which lacks quality and integrity due to omitted values, missing values, inaccurate column labeling, bias, and outliers risks inaccurate conclusions and inappropriate or potentially harmful action steps based on findings. In healthcare, research findings hold value only when resulting in beneficial actions steps. Maintaining the quality of data prior to statistical analysis is imperative, for which completion of an EDA is required.

In this notebook, an exploratory data analysis (EDA) of the Cleveland Heart dataset was used to identify patterns in the data to advance understanding of heart disease prediction algorithms.

The Cleveland Heart dataset was uploaded for analysis.

getwd()
## [1] "C:/Users/benke/OneDrive/NU/DDS 8501"
setwd("C:/Users/benke/OneDrive/NU/DDS 8501")
myheartdata <- read_csv("heart_cleveland_upload.csv")
## Rows: 297 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(myheartdata)

Initial review of the data set revealed statistical analysis of all variables, with categorical data represented as numeric data with measures of central tendency. Further information regarding number of counts per category for categorical variables is required to understand the dataset and determine preprocessing steps.

summary(myheartdata)
str(myheartdata)

The Characteristics of The Dataset

This dataset is comprised of the data from the heart_cleveland_upload dataset, which has 14 variables and 297 observations

The variables consist of 13 attributes and a target variable indicating the presence or absence of heart disease. A description of the variables are as follows:

age: patient’s age (Quantative, Numeric, Continuous, Ratio)

sex: patient’s gender (Qualitative, Categorical, Nominal)

cp: chest pain: This variable includes 4 categories of chest pain (typical angina = 0, atypical angina = 1, non-anginal pain = 2, asympotamic = 3, Qualitative, Categorical, Nominal)

trestbps: patient’s blood pressure at rest (mm/HG, Quantative, Numeric, Continuous, Ratio)

chol: serum cholesterol (mg/dl, Quantative, Numeric, Continuous, Ratio)

fbs: fasting blood sugar > 120 mg/dl (Qualitative, Categorical, Nominal)

restecg: electrocardiogram results at rest categorized in 3 values (Normal = 0, ST-T wave abnormality(T wave inversions and/or ST elevation or depression of > 0.05 mV = 1, probable/definite left ventricular hypertrophy = 2, Qualitative, Categorical, Nominal)

thalach: patient’s maximum heart rate (Quantative, Numeric, Continuous, Interval)

exang: presence/absence of exercise induced angina (Qualitative, Categorical, Nominal)

oldpeak: exercise induced ST-depression compared to rest state (Quantative, Numeric, Continuous, Interval)

slope: shape of slope of ST segment during peak exercise (Qualitatve, Categorical (up, flat, or down, Nominal)

ca: patient’s number of major blood vessles (Qualitative, Categorical, Nominal)

thal: patient’s thalassemia indicating type of defect (Qualitative, Categorical, Normal = 1, Fixed defect = 2, Reversible defect = 3, Nominal)

condition: target: presence or absence of heart disease (Binary, Numerical, Discrete)

Source

Source

Preprocessing

One step of preprocessing required reclassification of variables to categorical variables by classification of categorical variables as factors. For example, an investigation of the sex variable in this EDA highlights the importance of appropriate labeled variable types. In the Cleveland Health dataset prior to preprocessing, the categorical values had been classified in numeric, binary values leading to a summary with a mean prior to preprocessing. The meaning of the value with binary classification changed in a manner that created an irrelevant calculation. The previous summary statistics provided a mean of 0.6768 and a median of 1 in comparisons of the categories of male and female. In consideration of meaning, male and female are not numeric values nor are the variable types ordered. Following preprocessing below, a more informative result of counts per category was obtained of 96 males and 201 females. By reclassifying categorical data into factors and inspecting variable types, it becomes apparent that only mode and contingency correlations would provide valuable information when planning future model development for categorical variables. These variables were reclassified using the lapply and as.factor functions to create factor variable types for future processing.

factor_columns <- c("sex", "cp", "fbs", "restecg", 
                    "exang","slope", "ca", "thal")
myheartdata[factor_columns] <- lapply(myheartdata[factor_columns], function(col) as.factor(as.character(col)))
head(myheartdata)
## # A tibble: 6 × 14
##     age sex   cp    trestbps  chol fbs   restecg thalach exang oldpeak slope
##   <dbl> <fct> <fct>    <dbl> <dbl> <fct> <fct>     <dbl> <fct>   <dbl> <fct>
## 1    69 1     0          160   234 1     2           131 0         0.1 1    
## 2    69 0     0          140   239 0     0           151 0         1.8 0    
## 3    66 0     0          150   226 0     0           114 0         2.6 2    
## 4    65 1     0          138   282 1     2           174 0         1.4 1    
## 5    64 1     0          110   211 0     2           144 1         1.8 1    
## 6    64 1     0          170   227 0     2           155 0         0.6 1    
## # ℹ 3 more variables: ca <fct>, thal <fct>, condition <dbl>

Another imporant preprocessing step is identification of missing values and the percent of “NA” values in this dataset. This dataset does not include “NA” values which require preprocessing.

cols_with_nas <- sum(colSums(is.na(myheartdata)) > 0)
Percent_col_NA <- percent(cols_with_nas / length(myheartdata))
cols_with_nas
## [1] 0
Percent_col_NA
## [1] "0%"

The updated summary includes counts for categorical data. Only numeric variables have mean and median scores.

summary(myheartdata)
##       age        sex     cp         trestbps          chol       fbs    
##  Min.   :29.00   0: 96   0: 23   Min.   : 94.0   Min.   :126.0   0:254  
##  1st Qu.:48.00   1:201   1: 49   1st Qu.:120.0   1st Qu.:211.0   1: 43  
##  Median :56.00           2: 83   Median :130.0   Median :243.0          
##  Mean   :54.54           3:142   Mean   :131.7   Mean   :247.4          
##  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:276.0          
##  Max.   :77.00                   Max.   :200.0   Max.   :564.0          
##  restecg    thalach      exang      oldpeak      slope   ca      thal   
##  0:147   Min.   : 71.0   0:200   Min.   :0.000   0:139   0:174   0:164  
##  1:  4   1st Qu.:133.0   1: 97   1st Qu.:0.000   1:137   1: 65   1: 18  
##  2:146   Median :153.0           Median :0.800   2: 21   2: 38   2:115  
##          Mean   :149.6           Mean   :1.056           3: 20          
##          3rd Qu.:166.0           3rd Qu.:1.600                          
##          Max.   :202.0           Max.   :6.200                          
##    condition     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4613  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

Additionally, use of the str() function reveals the categories have been appropriately reclassified as factors.

str(myheartdata)
## spc_tbl_ [297 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age      : num [1:297] 69 69 66 65 64 64 63 61 60 59 ...
##  $ sex      : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 2 2 1 2 ...
##  $ cp       : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ trestbps : num [1:297] 160 140 150 138 110 170 145 134 150 178 ...
##  $ chol     : num [1:297] 234 239 226 282 211 227 233 234 240 270 ...
##  $ fbs      : Factor w/ 2 levels "0","1": 2 1 1 2 1 1 2 1 1 1 ...
##  $ restecg  : Factor w/ 3 levels "0","1","2": 3 1 1 3 3 3 3 1 1 3 ...
##  $ thalach  : num [1:297] 131 151 114 174 144 155 150 145 171 145 ...
##  $ exang    : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 1 ...
##  $ oldpeak  : num [1:297] 0.1 1.8 2.6 1.4 1.8 0.6 2.3 2.6 0.9 4.2 ...
##  $ slope    : Factor w/ 3 levels "0","1","2": 2 1 3 2 2 2 3 2 1 3 ...
##  $ ca       : Factor w/ 4 levels "0","1","2","3": 2 3 1 2 1 1 1 3 1 1 ...
##  $ thal     : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 3 2 1 1 3 ...
##  $ condition: num [1:297] 0 0 0 1 0 0 0 1 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   sex = col_double(),
##   ..   cp = col_double(),
##   ..   trestbps = col_double(),
##   ..   chol = col_double(),
##   ..   fbs = col_double(),
##   ..   restecg = col_double(),
##   ..   thalach = col_double(),
##   ..   exang = col_double(),
##   ..   oldpeak = col_double(),
##   ..   slope = col_double(),
##   ..   ca = col_double(),
##   ..   thal = col_double(),
##   ..   condition = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Recommendations

Further preprocessing is recommended to identify missing values as required prior to data visualization and model development.

Conclusions

In conclusion, the EDA completed for the Cleveland Health dataset included valuable preprocessing, variable re-classification, and data inspection to ensure data integrity is maintained. Data integrity is required for appropriate development of data visualization, model development, and for drawing conclusions from statistical results.

Works Cited

Centers for Disease Control and Prevention. “About Heart Disease.” Heart Disease, CDC, 15 May 2024, www.cdc.gov/heart-disease/about/index.html.

“Heart Disease Cleveland.” Www.kaggle.com, www.kaggle.com/datasets/ritwikb3/heart-disease-cleveland.

Ismail, Aishah. “Exploratory Data Analysis on Heart Disease UCI Data Set** | towards Data Science.” Towards Data Science, 13 Sept. 2020, towardsdatascience.com/exploratory-data-analysis-on-heart-disease-uci-data-set-ae129e47b323/. Accessed 6 Sept. 2025.