library(dplyr)
library(tidyr)
library(ggplot2)
library(reshape2)
library(caret)
heart <- read.csv("heart.csv")
str(heart)
## 'data.frame': 1025 obs. of 14 variables:
## $ age : int 52 53 70 61 62 58 58 55 46 54 ...
## $ sex : int 1 1 1 1 0 0 1 1 1 1 ...
## $ cp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ trestbps: int 125 140 145 148 138 100 114 160 120 122 ...
## $ chol : int 212 203 174 203 294 248 318 289 249 286 ...
## $ fbs : int 0 1 0 0 1 0 0 0 0 0 ...
## $ restecg : int 1 0 1 1 1 0 2 0 0 0 ...
## $ thalach : int 168 155 125 161 106 122 140 145 144 116 ...
## $ exang : int 0 1 1 0 0 0 0 1 0 1 ...
## $ oldpeak : num 1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
## $ slope : int 2 0 0 2 1 1 0 1 2 1 ...
## $ ca : int 2 0 0 1 3 0 3 1 0 2 ...
## $ thal : int 3 3 3 3 2 2 1 3 3 2 ...
## $ target : int 0 0 0 0 0 1 0 0 0 0 ...
The str() function reveals the data types and dimensions
of the dataset. It contains 1025 observations and 14 variables. Most are
numeric integers (e.g., age, sex, cp) while oldpeak is a
continuous numeric value representing ST depression.
names(heart)
## [1] "age" "sex" "cp" "trestbps" "chol" "fbs"
## [7] "restecg" "thalach" "exang" "oldpeak" "slope" "ca"
## [13] "thal" "target"
The names() function returns all 14 column names in the
dataset, representing clinical attributes collected during patient
cardiac assessments.
head(heart, 15)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 52 1 0 125 212 0 1 168 0 1.0 2 2 3
## 2 53 1 0 140 203 1 0 155 1 3.1 0 0 3
## 3 70 1 0 145 174 0 1 125 1 2.6 0 0 3
## 4 61 1 0 148 203 0 1 161 0 0.0 2 1 3
## 5 62 0 0 138 294 1 1 106 0 1.9 1 3 2
## 6 58 0 0 100 248 0 0 122 0 1.0 1 0 2
## 7 58 1 0 114 318 0 2 140 0 4.4 0 3 1
## 8 55 1 0 160 289 0 0 145 1 0.8 1 1 3
## 9 46 1 0 120 249 0 0 144 0 0.8 2 0 3
## 10 54 1 0 122 286 0 0 116 1 3.2 1 2 2
## 11 71 0 0 112 149 0 1 125 0 1.6 1 0 2
## 12 43 0 0 132 341 1 0 136 1 3.0 1 0 3
## 13 34 0 1 118 210 0 1 192 0 0.7 2 0 2
## 14 51 1 0 140 298 0 1 122 1 4.2 1 3 3
## 15 52 1 0 128 204 1 1 156 1 1.0 1 0 0
## target
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 1
## 7 0
## 8 0
## 9 0
## 10 0
## 11 1
## 12 0
## 13 1
## 14 0
## 15 0
The head() function displays the first 15 rows of the
dataset, providing a quick preview of the data values across all
variables.
classify_cholesterol <- function(chol_value) {
category <- ifelse(chol_value < 200, "Desirable",
ifelse(chol_value < 240, "Borderline High", "High"))
return(category)
}
heart$chol_category <- classify_cholesterol(heart$chol)
table(heart$chol_category)
##
## Borderline High Desirable High
## 339 169 517
This custom function classifies each patient’s cholesterol into
clinical risk categories based on medical guidelines: Desirable (below
200 mg/dL), Borderline High (200–239 mg/dL), or High (240+ mg/dL). The
table() output shows the distribution across these
categories.
high_risk_males <- heart %>%
filter(sex == 1, trestbps > 140, age > 50)
cat("Number of high-risk male patients:", nrow(high_risk_males), "\n")
## Number of high-risk male patients: 120
head(high_risk_males, 10)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 70 1 0 145 174 0 1 125 1 2.6 0 0 3
## 2 61 1 0 148 203 0 1 161 0 0.0 2 1 3
## 3 55 1 0 160 289 0 0 145 1 0.8 1 1 3
## 4 70 1 2 160 269 0 1 112 1 2.9 1 1 3
## 5 67 1 2 152 212 0 0 150 0 0.8 1 0 3
## 6 57 1 1 154 232 0 0 164 0 0.0 2 1 2
## 7 59 1 2 150 212 1 1 157 0 1.6 2 0 2
## 8 59 1 3 170 288 0 0 159 0 0.2 1 0 3
## 9 59 1 0 170 326 0 0 140 1 3.4 0 0 3
## 10 68 1 0 144 193 1 1 141 0 3.4 1 2 3
## target chol_category
## 1 0 Desirable
## 2 0 Borderline High
## 3 0 High
## 4 0 High
## 5 0 Borderline High
## 6 0 Borderline High
## 7 1 Borderline High
## 8 0 High
## 9 0 High
## 10 0 Desirable
This filters male patients (sex == 1) over age 50 with
resting blood pressure above 140 mmHg, a clinically relevant criteria
for identifying hypertensive males at elevated cardiovascular risk. The
head() limits the output to 10 rows to keep the report
concise.
selected_vars <- heart %>%
select(age, chol, trestbps, thalach, target)
heart_long <- selected_vars %>%
pivot_longer(cols = c(age, chol, trestbps, thalach),
names_to = "variable",
values_to = "value")
head(heart_long, 12)
## # A tibble: 12 × 3
## target variable value
## <int> <chr> <int>
## 1 0 age 52
## 2 0 chol 212
## 3 0 trestbps 125
## 4 0 thalach 168
## 5 0 age 53
## 6 0 chol 203
## 7 0 trestbps 140
## 8 0 thalach 155
## 9 0 age 70
## 10 0 chol 174
## 11 0 trestbps 145
## 12 0 thalach 125
The dependent variable is target (heart disease
diagnosis: 1 = present, 0 = absent). The independent variables are the
clinical predictors. Here, four key numeric predictors
(age, chol, trestbps,
thalach) are selected and reshaped from wide to long format
using pivot_longer(). This creates a new data frame where
each row represents one measurement per patient, which is useful for
grouped visualizations and comparisons.
cat("Missing values per column:\n")
## Missing values per column:
colSums(is.na(heart))
## age sex cp trestbps chol
## 0 0 0 0 0
## fbs restecg thalach exang oldpeak
## 0 0 0 0 0
## slope ca thal target chol_category
## 0 0 0 0 0
heart_clean <- na.omit(heart)
cat("\nRows before:", nrow(heart), "| Rows after:", nrow(heart_clean))
##
## Rows before: 1025 | Rows after: 1025
The colSums(is.na()) function counts missing values in
each column. The na.omit() function removes any rows
containing missing data. In this dataset there are no missing values, so
the row count remains unchanged. This is still an important validation
step in any data analysis workflow.
cat("Number of duplicated rows:", sum(duplicated(heart_clean)), "\n")
## Number of duplicated rows: 723
heart_clean <- heart_clean[!duplicated(heart_clean), ]
cat("Rows after removing duplicates:", nrow(heart_clean))
## Rows after removing duplicates: 302
The duplicated() function identifies rows that are exact
copies of earlier rows. This dataset contains duplicate records, which
is a known issue with the Kaggle version of the UCI Heart Disease
dataset. Removing duplicates ensures that no single patient is counted
more than once, which would skew statistical results and bias any
predictive models built from this data.
heart_sorted <- heart_clean %>%
arrange(desc(chol), desc(age))
head(heart_sorted, 10)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 67 0 2 115 564 0 0 160 0 1.6 1 0 3
## 2 65 0 2 140 417 1 0 157 0 0.8 2 1 2
## 3 56 0 0 134 409 0 0 150 1 1.9 1 2 3
## 4 63 0 0 150 407 0 0 154 0 4.0 1 3 3
## 5 62 0 0 140 394 0 0 157 0 1.2 1 0 2
## 6 65 0 2 160 360 0 0 151 0 0.8 2 0 2
## 7 57 0 0 120 354 0 1 163 1 0.6 2 0 2
## 8 55 1 0 132 353 0 1 132 1 1.2 1 1 3
## 9 55 0 1 132 342 0 1 166 0 1.2 2 0 2
## 10 43 0 0 132 341 1 0 136 1 3.0 1 0 3
## target chol_category
## 1 1 High
## 2 1 High
## 3 0 High
## 4 0 High
## 5 1 High
## 6 1 High
## 7 1 High
## 8 0 High
## 9 1 High
## 10 0 High
The dataset is sorted by cholesterol in descending order, with ties broken by age in descending order. This places the highest-risk patients (oldest with highest cholesterol) at the top, which is useful for quickly identifying extreme cases requiring clinical attention.
heart_renamed <- heart_clean %>%
rename(
Age = age,
Sex = sex,
ChestPainType = cp,
RestingBP = trestbps,
Cholesterol = chol,
FastingBS = fbs,
RestingECG = restecg,
MaxHeartRate = thalach,
ExerciseAngina = exang,
STDepression = oldpeak,
HeartDisease = target
)
names(heart_renamed)
## [1] "Age" "Sex" "ChestPainType" "RestingBP"
## [5] "Cholesterol" "FastingBS" "RestingECG" "MaxHeartRate"
## [9] "ExerciseAngina" "STDepression" "slope" "ca"
## [13] "thal" "HeartDisease" "chol_category"
The rename() function replaces shorthand column codes
with descriptive clinical labels. This improves readability for anyone
reviewing the analysis who may not be familiar with the original UCI
dataset abbreviations.
heart_clean$chol_double <- heart_clean$chol * 2
heart_clean$bp_hr_ratio <- round(heart_clean$trestbps / heart_clean$thalach, 3)
head(heart_clean[, c("chol", "chol_double", "trestbps", "thalach", "bp_hr_ratio")], 10)
## chol chol_double trestbps thalach bp_hr_ratio
## 1 212 424 125 168 0.744
## 2 203 406 140 155 0.903
## 3 174 348 145 125 1.160
## 4 203 406 148 161 0.919
## 5 294 588 138 106 1.302
## 6 248 496 100 122 0.820
## 7 318 636 114 140 0.814
## 8 289 578 160 145 1.103
## 9 249 498 120 144 0.833
## 10 286 572 122 116 1.052
Two new variables are created: chol_double multiplies
cholesterol by 2 (demonstrating a basic mathematical transformation),
and bp_hr_ratio divides resting blood pressure by maximum
heart rate. The BP-to-HR ratio is a clinically meaningful derived metric
— a higher ratio suggests the heart is working under greater pressure
relative to its maximum capacity, which may indicate cardiovascular
inefficiency.
set.seed(123)
train_index <- sample(1:nrow(heart_clean), size = 0.7 * nrow(heart_clean))
train_set <- heart_clean[train_index, ]
test_set <- heart_clean[-train_index, ]
cat("Training set rows:", nrow(train_set), "\n")
## Training set rows: 211
cat("Test set rows:", nrow(test_set), "\n")
## Test set rows: 91
The set.seed(123) ensures reproducibility by fixing the
random number generator so the same split is produced every time the
code runs. The dataset is split 70/30 into training and test sets using
sample().
summary(heart_clean)
## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :0.0000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:120.0
## Median :55.50 Median :1.0000 Median :1.0000 Median :130.0
## Mean :54.42 Mean :0.6821 Mean :0.9636 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.0000 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 Min. :0.000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:133.2
## Median :240.5 Median :0.000 Median :1.0000 Median :152.5
## Mean :246.5 Mean :0.149 Mean :0.5265 Mean :149.6
## 3rd Qu.:274.8 3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.800 Median :1.000 Median :0.0000
## Mean :0.3278 Mean :1.043 Mean :1.397 Mean :0.7185
## 3rd Qu.:1.0000 3rd Qu.:1.600 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.200 Max. :2.000 Max. :4.0000
## thal target chol_category chol_double
## Min. :0.000 Min. :0.000 Length:302 Min. : 252.0
## 1st Qu.:2.000 1st Qu.:0.000 Class :character 1st Qu.: 422.0
## Median :2.000 Median :1.000 Mode :character Median : 481.0
## Mean :2.315 Mean :0.543 Mean : 493.0
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 549.5
## Max. :3.000 Max. :1.000 Max. :1128.0
## bp_hr_ratio
## Min. :0.5250
## 1st Qu.:0.7580
## Median :0.8655
## Mean :0.9050
## 3rd Qu.:0.9928
## Max. :1.8220
The summary() function provides the five-number summary
(minimum, 1st quartile, median, 3rd quartile, maximum) plus the mean for
each numeric variable. Key observations: the median age is 54, median
cholesterol is 223 mg/dL, and median maximum heart rate is 153 bpm. The
target variable has a mean that indicates the proportion of
patients diagnosed with heart disease in this cleaned dataset.
get_mode <- function(x) {
uniq_vals <- unique(x)
uniq_vals[which.max(tabulate(match(x, uniq_vals)))]
}
cat("=== Cholesterol (chol) ===\n")
## === Cholesterol (chol) ===
cat("Mean: ", mean(heart_clean$chol), "\n")
## Mean: 246.5
cat("Median:", median(heart_clean$chol), "\n")
## Median: 240.5
cat("Mode: ", get_mode(heart_clean$chol), "\n")
## Mode: 204
cat("Range: ", range(heart_clean$chol), "\n\n")
## Range: 126 564
cat("=== Maximum Heart Rate (thalach) ===\n")
## === Maximum Heart Rate (thalach) ===
cat("Mean: ", mean(heart_clean$thalach), "\n")
## Mean: 149.5695
cat("Median:", median(heart_clean$thalach), "\n")
## Median: 152.5
cat("Mode: ", get_mode(heart_clean$thalach), "\n")
## Mode: 162
cat("Range: ", range(heart_clean$thalach), "\n")
## Range: 71 202
A custom get_mode() function is defined because R has no
built-in mode function for numeric data. The statistics are computed on
cholesterol and maximum heart rate. Cholesterol shows a mean higher than
the median, suggesting a right-skewed distribution with some patients
having extremely high values. For maximum heart rate, the values cluster
around 150 bpm, consistent with a middle-aged patient population
undergoing cardiac stress testing.
ggplot(heart_clean, aes(x = age, y = thalach, color = as.factor(target))) +
geom_point(alpha = 0.6, size = 2) +
labs(title = "Scatter Plot: Age vs Maximum Heart Rate",
x = "Age",
y = "Maximum Heart Rate (thalach)",
color = "Heart Disease") +
scale_color_manual(values = c("0" = "#E74C3C", "1" = "#2ECC71"),
labels = c("No Disease", "Disease")) +
theme_minimal()
The scatter plot reveals a negative relationship between age and maximum heart rate — as patients get older, their maximum achievable heart rate decreases. This is physiologically expected. The color coding by heart disease status shows that patients with heart disease (green) tend to cluster at higher heart rates for their age compared to patients without disease (red), suggesting that achieving a higher heart rate during stress testing may be associated with better cardiac function.
ggplot(heart_clean, aes(x = as.factor(cp), fill = as.factor(target))) +
geom_bar(position = "dodge") +
labs(title = "Bar Plot: Chest Pain Type by Heart Disease Status",
x = "Chest Pain Type",
y = "Count",
fill = "Heart Disease") +
scale_fill_manual(values = c("0" = "#3498DB", "1" = "#E67E22"),
labels = c("No Disease", "Disease")) +
scale_x_discrete(labels = c("0" = "Typical Angina", "1" = "Atypical",
"2" = "Non-Anginal", "3" = "Asymptomatic")) +
theme_minimal()
The bar plot shows the distribution of patients across four chest pain types, grouped by heart disease status. Notably, asymptomatic patients (type 0) are predominantly in the no-disease group, while patients with atypical angina and non-anginal pain show higher proportions of heart disease. This highlights that chest pain type is a strong differentiator in cardiac diagnosis.
cor_value <- cor(heart_clean$age, heart_clean$thalach, method = "pearson")
cat("Pearson Correlation (Age vs Max Heart Rate):", cor_value, "\n")
## Pearson Correlation (Age vs Max Heart Rate): -0.3952352
cor.test(heart_clean$age, heart_clean$thalach, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: heart_clean$age and heart_clean$thalach
## t = -7.4525, df = 300, p-value = 9.858e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4864024 -0.2955546
## sample estimates:
## cor
## -0.3952352
The Pearson correlation coefficient between age and maximum heart
rate is negative, confirming the inverse relationship visible in the
scatter plot. The cor.test() function provides the
correlation coefficient along with a p-value and 95% confidence
interval. A statistically significant p-value (< 0.05) confirms this
relationship is not due to random chance. This finding is clinically
expected, maximum achievable heart rate naturally declines with age,
commonly estimated by the formula 220 minus age.