Heart disease is a leading global health concern, accounting for a significant number of deaths each year. It encompasses various conditions affecting the heart, including coronary artery disease, heart attacks, and arrhythmias. Early detection and accurate diagnosis are critical in preventing complications, reducing healthcare costs, and improving patient outcomes.
Hospitals, healthcare providers, and insurance companies are increasingly interested in using data-driven insights to support early intervention strategies. Early identification of individuals at risk of developing heart disease not only improves patient outcomes but also plays a vital role in preventive cardiology initiatives.
The Heart Disease dataset from the UCI Machine Learning Repository provides a valuable collection of patient attributes, including demographic information, medical test results, and lifestyle factors. Key indicators such as age, cholesterol levels, blood pressure, and chest pain type are included, offering a rich basis for analysis. Analyzing this data can help uncover patterns and correlations that contribute to heart disease and support better decision-making in medical settings.
Data visualization plays a crucial role in understanding complex medical datasets. By employing exploratory data analysis (EDA) and visualization techniques, we can uncover hidden trends, compare different risk factors, and generate insights that support clinical and operational decisions. This project focuses on utilizing various data visualization methods to analyze the UCI Heart Disease dataset and provide a clearer understanding of the factors influencing heart disease risk.
At this stage, no predictive model is being developed. The goal is purely exploratory: to generate hypotheses, guide future analytical strategies, and understand the structure and relationships in the data.
The dataset used is the Heart Disease dataset from the UCI Machine Learning Repository. It is a compilation of four databases (Cleveland, Hungary, Switzerland, and Long Beach VA), but most studies (and this project) focus primarily on the Cleveland dataset, which has the most complete records.
The dataset contains various features related to patients’ health and demographic information. We will explore the dataset to understand its structure and relationships between variables.
The dataset contains 14 key attributes that are either numerical or categorical. These attributes are:
age: Age of the patient (numeric)sex: Gender of the patient (1 = male,
0 = female)cp: Chest pain type (categorical:
1-4)trestbps: Resting blood pressure
(numeric)chol: Serum cholesterol (numeric)fbs: Fasting blood sugar (1 = true, 0
= false)restecg: Resting electrocardiographic
results (categorical)thalach: Maximum heart rate achieved
(numeric)exang: Exercise-induced angina (1 =
yes, 0 = no)oldpeak: ST depression induced by
exercise (numeric)slope: The slope of the peak exercise
ST segment (categorical)ca: Number of major vessels (0-3,
numeric)thal: Thalassemia (categorical: 1 =
normal, 2 = fixed defect, 3 = reversible defect)Note: Some fields (e.g., “ca” and “thal”) may contain missing or invalid values coded as ‘?’, which need attention during preprocessing.
The dataset contains 303 instances (patients) and 14 main attributes plus the target variable (presence or absence of heart disease).
| Attribute | Data Type | Description | Contraints/Rules |
|---|---|---|---|
age |
Numerical | The age of the patient in years | Range: 29 - 77 (Based on the dataset statistics) |
sex |
Categorical | The gender of the patient | Values: 1 = Male, 0 = Female |
cp |
Categorical | Type of chest pain experienced by the patient | Values: 1 = Typical angina, 2 = Atypical angina, 3 = Non-anginal pain, 4 = Asymptomatic |
trestbps |
Numerical | Resting blood pressure of the patient, measured in mmHg | Range: Typically, between 94 and 200 mmHg |
chol |
Numerical | Serum cholesterol level in mg/dl | Range: Typically, between 126 and 564 mg/dl |
fbs |
Categorical | Fasting blood sugar level > 120 mg/dl | Values: 1 = True, 0 = False |
restecg |
Categorical | Results of the patient’s resting electrocardiogram | Values: 0 = Normal, 1 = ST-T wave abnormality, 2 = Probable or definite left ventricular hypertrophy |
thalach |
Numerical | Maximum heart rate achieved during a stress test | Range: Typically, between 71 and 202 bpm |
exang |
Categorical | Whether the patient experiences exercise-induced angina | Values: 1 = Yes, 0 = No |
oldpeak |
Numerical | ST depression induced by exercise relative to rest (an ECG measure) | Range: 0.0 to 6.2 (higher values indicate more severe abnormalities) |
slope |
Categorical | Slope of the peak exercise ST segment | Values: 1 = Upsloping, 2 = Flat, 3 = Downsloping |
ca |
Numerical | Number of major vessels colored by fluoroscopy | Range: 0-3 |
thal |
Categorical | Blood disorder variable related to thalassemia | Values: 3 = Normal, 6 = Fixed defect, 7 = Reversible defect |
target |
Categorical | Diagnosis of heart disease | Values: 0 = No heart disease, 1 = Presence of heart disease |
ca
and thal may have missing values or unknown entries
(‘?’).chol (cholesterol) and trestbps (blood
pressure) may have outliers that need to be detected and considered in
analysis.# Create url object to retrieve the dataset from UCI Machine Learning Repository
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
# Read the dataset into a dataframe
Heart.df <- read.csv(url(url), header = FALSE, na.strings = "?")[1] 303 14
View the first six rows of the dataset
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1 63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
2 67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
3 67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
4 37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
5 41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
6 56 1 2 120 236 0 0 178 0 0.8 1 0 3 0
colnames(Heart.df) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target")'data.frame': 303 obs. of 14 variables:
$ age : num 63 67 67 37 41 56 62 57 63 53 ...
$ sex : num 1 1 1 1 0 1 0 0 1 1 ...
$ cp : num 1 4 4 3 2 2 4 4 4 4 ...
$ trestbps: num 145 160 120 130 130 120 140 120 130 140 ...
$ chol : num 233 286 229 250 204 236 268 354 254 203 ...
$ fbs : num 1 0 0 0 0 0 0 0 0 1 ...
$ restecg : num 2 2 2 0 2 0 2 0 2 2 ...
$ thalach : num 150 108 129 187 172 178 160 163 147 155 ...
$ exang : num 0 1 1 0 0 0 0 1 0 1 ...
$ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
$ slope : num 3 2 2 3 1 1 3 1 2 3 ...
$ ca : num 0 3 2 0 0 0 2 0 1 0 ...
$ thal : num 6 3 7 3 3 3 3 3 7 7 ...
$ target : int 0 2 1 0 0 0 3 0 2 1 ...
Display the statistical summary of the dataframe
age sex cp trestbps
Min. :29.00 Min. :0.0000 Min. :1.000 Min. : 94.0
1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:120.0
Median :56.00 Median :1.0000 Median :3.000 Median :130.0
Mean :54.44 Mean :0.6799 Mean :3.158 Mean :131.7
3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:140.0
Max. :77.00 Max. :1.0000 Max. :4.000 Max. :200.0
chol fbs restecg thalach
Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
Median :241.0 Median :0.0000 Median :1.0000 Median :153.0
Mean :246.7 Mean :0.1485 Mean :0.9901 Mean :149.6
3rd Qu.:275.0 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:166.0
Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
exang oldpeak slope ca
Min. :0.0000 Min. :0.00 Min. :1.000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000
Median :0.0000 Median :0.80 Median :2.000 Median :0.0000
Mean :0.3267 Mean :1.04 Mean :1.601 Mean :0.6722
3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
Max. :1.0000 Max. :6.20 Max. :3.000 Max. :3.0000
NA's :4
thal target
Min. :3.000 Min. :0.0000
1st Qu.:3.000 1st Qu.:0.0000
Median :3.000 Median :0.0000
Mean :4.734 Mean :0.9373
3rd Qu.:7.000 3rd Qu.:2.0000
Max. :7.000 Max. :4.0000
NA's :2
According to the Data Dictionary, the following attributes should be
have binary variables, sex, fbs,
exang, and target. But, some shows to have
values besides 0’s and 1’s.Heart.df$sex <- ifelse(Heart.df$sex > 0, 1, 0)
Heart.df$fbs <- ifelse(Heart.df$fbs > 0, 1, 0)
Heart.df$exang <- ifelse(Heart.df$exang > 0, 1, 0)
Heart.df$target <- ifelse(Heart.df$target > 0, 1, 0) age sex cp trestbps chol fbs restecg thalach
0 0 0 0 0 0 0 0
exang oldpeak slope ca thal target
0 0 0 4 2 0
From the summary and the table above, there are some missing values in
ca and thal columns.# If missing values exist in 'ca' or 'thal', handle them using mean/mode imputation
Heart.df$ca[is.na(Heart.df$ca)] <- median(Heart.df$ca, na.rm = TRUE)
Heart.df$ca[Heart.df$ca == "?"] <- median(Heart.df$ca, na.rm = TRUE)
Heart.df$thal[is.na(Heart.df$thal)] <- median(Heart.df$thal, na.rm = TRUE)
Heart.df$thal[Heart.df$thal == "?"] <- median(Heart.df$ca, na.rm = TRUE)dupes <- Heart.df[duplicated(Heart.df) | duplicated(Heart.df, fromLast = TRUE), ]
# Print or inspect the duplicate entries
print(dupes) [1] age sex cp trestbps chol fbs restecg thalach
[9] exang oldpeak slope ca thal target
<0 rows> (or 0-length row.names)
Convert categorical attributes to factors
# Define a list of categorical columns with their levels and labels
categorical_columns <- list(
sex = list(levels = c(0, 1), labels = c("Female", "Male")),
cp = list(levels = c(1, 2, 3, 4), labels = c("Typical Angina", "Atypical Angina", "Non-Angina", "Asymptomatic")),
fbs = list(levels = c(0, 1), labels = c("False", "True")),
restecg = list(levels = c(0, 1, 2), labels = c("Normal", "Wave-abnormality", "Probable")),
exang = list(levels = c(0, 1), labels = c("No", "Yes")),
slope = list(levels = c(1, 2, 3), labels = c("Upsloping", "Flat", "Downsloping")),
thal = list(levels = c(3, 6, 7), labels = c("Normal", "Fixed Defect", "Reversible")),
target = list(levels = c(1, 0), labels = c("Yes", "No"))
)
# Apply the factor transformation using a loop
for (col in names(categorical_columns)) {
Heart.df[[col]] <- factor(Heart.df[[col]],
levels = categorical_columns[[col]]$levels,
labels = categorical_columns[[col]]$labels)
}HeartDiseaseBoxplot <- function(var1, var2) {
ggplot(Heart.df, aes(x = .data[[var1]],
y = .data[[var2]],
fill = .data[[var1]])) +
geom_boxplot() + theme_test() +
labs(title = paste("Boxplot of", var2, "by", var1),
x = var1, y = var2, fill = "Heart Disease")
}HeartDiseaseBar <- function(var) {
ggplot(Heart.df, aes(x = .data[[var]], fill = target)) +
geom_bar(position = "dodge") + theme_test() +
labs(title = paste("Distribution of Heart Disease by", var),
x = var, fill = "Heart Disease")
}HeartDiseaseHist <- function(var1) {
ggplot(Heart.df, aes(x = .data[[var1]], fill = target)) +
geom_histogram(bins = 15) + theme_test() +
labs(title = paste("Distribution of", var1),
x = var1, fill = "Heart Disease")
}HeartDiseaseScatter <- function(point1, point2){
ggplot(Heart.df, aes(x = .data[[point1]],
y = .data[[point2]],
color = target)) +
geom_point(size = 2) + theme_test() +
geom_smooth(method = "lm", se = FALSE, color = "blue", formula = y ~ x) +
labs(title = paste("Scatterplot of", point1, "by", point2),
x = point1, y = point2, color = "Heart Disease")
}# Create the plots
p1 <- HeartDiseaseBoxplot("target", "age")
p2 <- HeartDiseaseBoxplot("target", "trestbps")
p3 <- HeartDiseaseBoxplot("target", "chol")
p4 <- HeartDiseaseBoxplot("target", "thalach")
p5 <- HeartDiseaseBoxplot("target", "oldpeak")
# Combine plot using patchwork
(p1 | p2) /
(p3 | p4) /
(p5)# Create the plots
g1 <- ggplot(Heart.df, aes(x=target, fill=target))+
geom_bar() + theme_test() +
ggtitle("Distribution of Heart Disease") +
labs(x = "Heart Disease", fill = "Heart Disease")
g2 <- HeartDiseaseBar("sex")
g3 <- HeartDiseaseBar("cp")
g4 <- HeartDiseaseBar("fbs")
g5 <- HeartDiseaseBar("restecg")
g6 <- HeartDiseaseBar("exang")
g7 <- HeartDiseaseBar("slope")
g8 <- HeartDiseaseBar("thal")
# Combine plot using patchwork
(g1 | g2) /
(g3 | g4) /
(g5 | g6) /
(g7 | g8)# Create the plots
p1 <- HeartDiseaseHist("age")
p2 <- HeartDiseaseHist("trestbps")
p3 <- HeartDiseaseHist("chol")
p4 <- HeartDiseaseHist("thalach")
p5 <- HeartDiseaseHist("oldpeak")
# Combine plot using patchwork
(p1) /
(p2 | p3) /
(p4 | p5)# Create the plots
p1 <- HeartDiseaseScatter("age", "oldpeak")
p2 <- HeartDiseaseScatter("age", "chol")
p3 <- HeartDiseaseScatter("age", "trestbps")
p4 <- HeartDiseaseScatter("age", "thalach")
p5 <- HeartDiseaseScatter("chol", "thalach")
p6 <- HeartDiseaseScatter("trestbps", "chol")
p7 <- HeartDiseaseScatter("thalach", "oldpeak")
# Combine plot using patchwork
(p1 | p2) /
(p3 | p4) /
(p5 | p6) /
(p7)# Create a colored pair plot for selected variables
ggpairs(Heart.df[, c("age", "trestbps", "chol",
"thalach", "oldpeak", "target")],
aes(color = target, fill = target))The distribution of Age between patients with and without heart disease overlaps considerably. However, individuals without heart disease (“No”) tend to have a slightly younger age distribution. The boxplot comparison highlights this difference, though it is not particularly dramatic.
Looking at Trestbps (resting blood pressure), there is a modest positive correlation with age overall (0.273). Interestingly, patients without heart disease show a slightly stronger correlation (0.297) compared to those diagnosed with heart disease (0.206).
When examining Chol (cholesterol levels), correlations with other variables are very weak in both the “Yes” and “No” groups. Cholesterol does not appear to be a strong factor for separating patients with heart disease from those without, based on the current plots.
In contrast, Thalach (maximum heart rate achieved) shows a strong negative correlation with age, particularly among patients without heart disease (No: -0.444***). This suggests that patients without heart disease are able to achieve higher maximum heart rates compared to those with heart disease.
Oldpeak (measuring ST depression induced by exercise) exhibits a moderate positive correlation with age, especially for patients without heart disease (No: 0.191). Higher Oldpeak values are more commonly associated with the presence of heart disease.
Focusing on the Target variable (indicating heart disease “Yes” or “No”), Oldpeak and Thalach demonstrate the clearest separation between groups. Patients with heart disease typically exhibit higher Oldpeak values and lower Thalach values.
Among all variables, the strongest correlations observed are the negative relationships between thalach and age, and between oldpeak and thalach. Despite these findings, most variables individually do not display extremely strong correlations.
In summary, older patients tend to have lower maximum heart rates, and higher Oldpeak values are linked to a greater risk of heart disease. Meanwhile, cholesterol and resting blood pressure do not visually differentiate heart disease patients from healthy individuals in this analysis.
# Selecting only continuous variables
continuous_vars <- c("age", "trestbps", "chol", "thalach", "oldpeak")
continuous_data <- Heart.df %>% select(all_of(continuous_vars))
# Calculating correlation matrix
correlation_matrix <- cor(continuous_data)
# Plotting the correlation matrix
corrplot(correlation_matrix, method = "circle",
type = "lower", tl.col = "black")For exploratory purposes, the project aims to investigate and uncover patterns in patient demographics, clinical features, and diagnostic results that correlate with the presence or absence of heart disease. Specifically, the project will: