For this assignment, we used the heart disease dataset to explore how different patient characteristics may be related to heart disease. The dataset includes several clinical and demographic variables, such as age, sex, resting blood pressure, cholesterol, maximum heart rate, exercise-induced angina, and oldpeak. These variables make it possible to look at patterns and compare patients with and without heart disease. Using R Programming, the dataset was cleaned, organized, and analyzed using summary statistics, graphs, and Pearson correlation. The purpose of the analysis was to identify which variables seem to provide the most meaningful insight into heart disease in this dataset.
Before presenting the analysis, a few key terms from the dataset are defined below for clarity. Some of the variable names are abbreviated clinical terms, so these definitions help make the results easier to understand.
1 = disease, 0 = no disease)1 = male,
0 = female)1 = true, 0 = false)1 = yes, 0 = no)Not in the dataset but important to understand: - ST segment - is a part of the ECG wave that helps show how the heart is responding electrically after a heartbeat. Abnormal changes in this part of the tracing can sometimes indicate heart-related problems. - ST depression - is when the ST segment on an ECG drops below its usual level. This may indicate an abnormal heart response during exercise or stress.
# Used by Mae, Ola, and Mencha
heart <- read.csv("heart.csv")
str(heart)
## 'data.frame': 1025 obs. of 14 variables:
## $ age : int 52 53 70 61 62 58 58 55 46 54 ...
## $ sex : int 1 1 1 1 0 0 1 1 1 1 ...
## $ cp : int 0 0 0 0 0 0 0 0 0 0 ...
## $ trestbps: int 125 140 145 148 138 100 114 160 120 122 ...
## $ chol : int 212 203 174 203 294 248 318 289 249 286 ...
## $ fbs : int 0 1 0 0 1 0 0 0 0 0 ...
## $ restecg : int 1 0 1 1 1 0 2 0 0 0 ...
## $ thalach : int 168 155 125 161 106 122 140 145 144 116 ...
## $ exang : int 0 1 1 0 0 0 0 1 0 1 ...
## $ oldpeak : num 1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
## $ slope : int 2 0 0 2 1 1 0 1 2 1 ...
## $ ca : int 2 0 0 1 3 0 3 1 0 2 ...
## $ thal : int 3 3 3 3 2 2 1 3 3 2 ...
## $ target : int 0 0 0 0 0 1 0 0 0 0 ...
# Used by Mae, Ola, and Mencha
names(heart)
## [1] "age" "sex" "cp" "trestbps" "chol" "fbs"
## [7] "restecg" "thalach" "exang" "oldpeak" "slope" "ca"
## [13] "thal" "target"
# Used by Mae, Ola, and Mencha
head(heart, 15)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 52 1 0 125 212 0 1 168 0 1.0 2 2 3
## 2 53 1 0 140 203 1 0 155 1 3.1 0 0 3
## 3 70 1 0 145 174 0 1 125 1 2.6 0 0 3
## 4 61 1 0 148 203 0 1 161 0 0.0 2 1 3
## 5 62 0 0 138 294 1 1 106 0 1.9 1 3 2
## 6 58 0 0 100 248 0 0 122 0 1.0 1 0 2
## 7 58 1 0 114 318 0 2 140 0 4.4 0 3 1
## 8 55 1 0 160 289 0 0 145 1 0.8 1 1 3
## 9 46 1 0 120 249 0 0 144 0 0.8 2 0 3
## 10 54 1 0 122 286 0 0 116 1 3.2 1 2 2
## 11 71 0 0 112 149 0 1 125 0 1.6 1 0 2
## 12 43 0 0 132 341 1 0 136 1 3.0 1 0 3
## 13 34 0 1 118 210 0 1 192 0 0.7 2 0 2
## 14 51 1 0 140 298 0 1 122 1 4.2 1 3 3
## 15 52 1 0 128 204 1 1 156 1 1.0 1 0 0
## target
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 1
## 7 0
## 8 0
## 9 0
## 10 0
## 11 1
## 12 0
## 13 1
## 14 0
## 15 0
# Based on the combined group direction
# Simple function to classify resting blood pressure
bp_category <- function(bp) {
if (bp < 120) {
return("Normal")
} else if (bp < 140) {
return("Elevated")
} else {
return("High")
}
}
bp_category(130)
## [1] "Elevated"
# Ola's code idea:
# filter high-risk male patients using sex, cholesterol, and age
high_risk_males <- heart %>%
filter(sex == 1, chol > 240, age > 55)
cat("Number of high-risk male patients:", nrow(high_risk_males), "\n")
## Number of high-risk male patients: 164
head(high_risk_males, 10)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 58 1 0 114 318 0 2 140 0 4.4 0 3 1
## 2 56 1 2 130 256 1 0 142 1 0.6 1 1 1
## 3 70 1 2 160 269 0 1 112 1 2.9 1 1 3
## 4 59 1 0 138 271 0 0 182 0 0.0 2 0 2
## 5 64 1 0 128 263 0 1 105 1 0.2 1 1 3
## 6 67 1 0 100 299 0 0 125 1 0.9 1 2 2
## 7 59 1 3 170 288 0 0 159 0 0.2 1 0 3
## 8 59 1 0 170 326 0 0 140 1 3.4 0 0 3
## 9 56 1 0 125 249 1 0 144 1 1.2 1 1 2
## 10 65 1 0 110 248 0 0 158 0 0.6 2 2 1
## target
## 1 0
## 2 0
## 3 0
## 4 1
## 5 1
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
# Mae's code idea:
# target as dependent variable
dependentvariable <- as.data.frame(cbind(heart$target))
names(dependentvariable)[1] <- "Target"
head(dependentvariable, n = 10)
## Target
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 1
## 7 0
## 8 0
## 9 0
## 10 0
# Ola's code idea:
# reshape selected variables into long format
vars <- heart %>%
select(age, thalach, oldpeak, exang, target)
heart_long <- vars %>%
pivot_longer(
cols = c(thalach, oldpeak, exang),
names_to = "variable",
values_to = "value"
)
head(heart_long, 15)
## # A tibble: 15 × 4
## age target variable value
## <int> <int> <chr> <dbl>
## 1 52 0 thalach 168
## 2 52 0 oldpeak 1
## 3 52 0 exang 0
## 4 53 0 thalach 155
## 5 53 0 oldpeak 3.1
## 6 53 0 exang 1
## 7 70 0 thalach 125
## 8 70 0 oldpeak 2.6
## 9 70 0 exang 1
## 10 61 0 thalach 161
## 11 61 0 oldpeak 0
## 12 61 0 exang 0
## 13 62 0 thalach 106
## 14 62 0 oldpeak 1.9
## 15 62 0 exang 0
# Ola and Mencha both checked missing values
colSums(is.na(heart))
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
heart_clean <- na.omit(heart)
cat("Rows before:", nrow(heart), "| Rows after:", nrow(heart_clean))
## Rows before: 1025 | Rows after: 1025
# Ola and Mencha both removed duplicates
cat("Number of duplicated rows:", sum(duplicated(heart_clean)), "\n")
## Number of duplicated rows: 723
heart_clean <- heart_clean[!duplicated(heart_clean), ]
cat("Rows after removing duplicates:", nrow(heart_clean))
## Rows after removing duplicates: 302
# Mencha's code idea:
# sort by cholesterol and age
heart_sorted <- heart_clean %>%
arrange(desc(chol), desc(age))
head(heart_sorted, 10)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 67 0 2 115 564 0 0 160 0 1.6 1 0 3
## 2 65 0 2 140 417 1 0 157 0 0.8 2 1 2
## 3 56 0 0 134 409 0 0 150 1 1.9 1 2 3
## 4 63 0 0 150 407 0 0 154 0 4.0 1 3 3
## 5 62 0 0 140 394 0 0 157 0 1.2 1 0 2
## 6 65 0 2 160 360 0 0 151 0 0.8 2 0 2
## 7 57 0 0 120 354 0 1 163 1 0.6 2 0 2
## 8 55 1 0 132 353 0 1 132 1 1.2 1 1 3
## 9 55 0 1 132 342 0 1 166 0 1.2 2 0 2
## 10 43 0 0 132 341 1 0 136 1 3.0 1 0 3
## target
## 1 1
## 2 1
## 3 0
## 4 0
## 5 1
## 6 1
## 7 1
## 8 0
## 9 1
## 10 0
# Mencha and Ola both renamed columns for readability
heart_renamed <- heart_clean %>%
rename(
Age = age,
Sex = sex,
ChestPainType = cp,
RestingBP = trestbps,
Cholesterol = chol,
FastingBS = fbs,
RestECG = restecg,
MaxHeartRate = thalach,
ExerciseAngina = exang,
STDepression = oldpeak,
HeartDisease = target
)
names(heart_renamed)
## [1] "Age" "Sex" "ChestPainType" "RestingBP"
## [5] "Cholesterol" "FastingBS" "RestECG" "MaxHeartRate"
## [9] "ExerciseAngina" "STDepression" "slope" "ca"
## [13] "thal" "HeartDisease"
# Ola's code idea:
# create Age_Group
heart_clean$Age_Group <- case_when(
heart_clean$age < 40 ~ "Young",
heart_clean$age >= 40 & heart_clean$age < 60 ~ "Middle-aged",
TRUE ~ "Senior"
)
# Mencha's code idea:
# create cholesterol category and mathematical variables
heart_clean$chol_category <- case_when(
heart_clean$chol < 200 ~ "Desirable",
heart_clean$chol >= 200 & heart_clean$chol < 240 ~ "Borderline High",
TRUE ~ "High"
)
heart_clean$chol_double <- heart_clean$chol * 2
heart_clean$bp_hr_ratio <- round(heart_clean$trestbps / heart_clean$thalach, 3)
# Ola's code idea:
# create a risk score
heart_clean$risk_score <- 0.3 * heart_clean$age +
0.4 * heart_clean$chol +
0.3 * heart_clean$trestbps
head(heart_clean[, c("age", "chol", "chol_double", "trestbps", "thalach", "bp_hr_ratio", "risk_score")], 10)
## age chol chol_double trestbps thalach bp_hr_ratio risk_score
## 1 52 212 424 125 168 0.744 137.9
## 2 53 203 406 140 155 0.903 139.1
## 3 70 174 348 145 125 1.160 134.1
## 4 61 203 406 148 161 0.919 143.9
## 5 62 294 588 138 106 1.302 177.6
## 6 58 248 496 100 122 0.820 146.6
## 7 58 318 636 114 140 0.814 178.8
## 8 55 289 578 160 145 1.103 180.1
## 9 46 249 498 120 144 0.833 149.4
## 10 54 286 572 122 116 1.052 167.2
# Ola's code idea
set.seed(1234)
train_index <- sample(1:nrow(heart_clean), size = 0.70 * nrow(heart_clean))
TrainingSet <- heart_clean[train_index, ]
TestingSet <- heart_clean[-train_index, ]
dim(TrainingSet)
## [1] 211 19
dim(TestingSet)
## [1] 91 19
# Mencha and Mae both included summary/statistical description
summary(heart_clean)
## age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :0.0000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:120.0
## Median :55.50 Median :1.0000 Median :1.0000 Median :130.0
## Mean :54.42 Mean :0.6821 Mean :0.9636 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.0000 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 Min. :0.000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:133.2
## Median :240.5 Median :0.000 Median :1.0000 Median :152.5
## Mean :246.5 Mean :0.149 Mean :0.5265 Mean :149.6
## 3rd Qu.:274.8 3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.800 Median :1.000 Median :0.0000
## Mean :0.3278 Mean :1.043 Mean :1.397 Mean :0.7185
## 3rd Qu.:1.0000 3rd Qu.:1.600 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.200 Max. :2.000 Max. :4.0000
## thal target Age_Group chol_category
## Min. :0.000 Min. :0.000 Length:302 Length:302
## 1st Qu.:2.000 1st Qu.:0.000 Class :character Class :character
## Median :2.000 Median :1.000 Mode :character Mode :character
## Mean :2.315 Mean :0.543
## 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :3.000 Max. :1.000
## chol_double bp_hr_ratio risk_score
## Min. : 252.0 Min. :0.5250 Min. :102.0
## 1st Qu.: 422.0 1st Qu.:0.7580 1st Qu.:139.0
## Median : 481.0 Median :0.8655 Median :152.1
## Mean : 493.0 Mean :0.9050 Mean :154.4
## 3rd Qu.: 549.5 3rd Qu.:0.9928 3rd Qu.:167.4
## Max. :1128.0 Max. :1.8220 Max. :280.2
# Mencha and Mae both used a custom mode function
get_mode <- function(x) {
uniq_vals <- unique(x)
uniq_vals[which.max(tabulate(match(x, uniq_vals)))]
}
cat("=== Cholesterol ===\n")
## === Cholesterol ===
cat("Mean: ", mean(heart_clean$chol), "\n")
## Mean: 246.5
cat("Median: ", median(heart_clean$chol), "\n")
## Median: 240.5
cat("Mode: ", get_mode(heart_clean$chol), "\n")
## Mode: 204
cat("Range: ", range(heart_clean$chol), "\n\n")
## Range: 126 564
cat("=== Maximum Heart Rate ===\n")
## === Maximum Heart Rate ===
cat("Mean: ", mean(heart_clean$thalach), "\n")
## Mean: 149.5695
cat("Median: ", median(heart_clean$thalach), "\n")
## Median: 152.5
cat("Mode: ", get_mode(heart_clean$thalach), "\n")
## Mode: 162
cat("Range: ", range(heart_clean$thalach), "\n")
## Range: 71 202
# Combined idea
# instead of the weaker age vs cholesterol version
ggplot(heart_clean, aes(x = age, y = thalach, color = as.factor(target))) +
geom_point(alpha = 0.6, size = 2) +
labs(
title = "Scatter Plot: Age vs Maximum Heart Rate",
x = "Age",
y = "Maximum Heart Rate",
color = "Heart Disease"
) +
scale_color_discrete(labels = c("No Disease", "Disease")) +
theme_minimal()
Analysis: Age vs. Maximum Heart Rate
The scatter plot shows the relationship between age and maximum heart rate, with the points colored by heart disease status. Overall, the pattern shows a downward trend, which suggests that as age increases, maximum heart rate tends to decrease. This means there is an inverse relationship between the two variables. The color also helps show how patients with and without heart disease are spread across the graph. Based on this pattern, maximum heart rate appears to be an important variable when looking at heart disease in this dataset.
# Combined idea + Mary lol
bar_data <- heart_clean %>%
group_by(target) %>%
summarise(avg_oldpeak = mean(oldpeak))
ggplot(bar_data, aes(x = factor(target), y = avg_oldpeak, fill = factor(target))) +
geom_bar(stat = "identity") +
labs(
title = "Average Oldpeak by Heart Disease Status",
x = "Heart Disease Status",
y = "Average Oldpeak",
fill = "Heart Disease Status"
) +
scale_x_discrete(labels = c("No Disease", "Disease")) +
scale_fill_discrete(labels = c("No Disease", "Disease")) +
theme_minimal()
Analysis: Average Oldpeak by Heart Disease Status
The bar plot compares the average oldpeak between patients with and without heart disease. Interestingly, the no disease group has the higher average oldpeak in this dataset. This suggests that oldpeak by itself may not be enough to clearly explain heart disease status here. Instead, it shows that the dataset is a bit more complex, and that heart disease is probably better understood by looking at several variables together rather than relying on just one measure.
# combined idea
# age vs maximum heart rate
cor_value <- cor(heart_clean$age, heart_clean$thalach, method = "pearson")
cor_value
## [1] -0.3952352
Analysis:
The Pearson correlation coefficient between age and maximum heart rate is -0.3952352, which indicates a moderate negative relationship. This means that as age increases, maximum heart rate tends to decrease. In simpler terms, older patients in this dataset generally had lower maximum heart rates than younger patients. Even though the relationship is not very strong, it is still strong enough to show a noticeable pattern. This also supports what we see in the scatter plot.
Based on our analysis, heart disease in this dataset does not seem to be explained by just one factor like cholesterol. A stronger story came from looking at multiple variables together, especially age, resting blood pressure, maximum heart rate, exercise-induced angina, and oldpeak. The results suggest that exercise-related and cardiovascular response variables may be especially helpful in showing differences between patients with and without heart disease. For example, the moderate negative correlation between age and maximum heart rate suggests that older patients generally reached lower maximum heart rates in this dataset. Overall, the findings show that heart disease is influenced by a combination of factors, and this assignment demonstrated how R can be used to clean, organize, and analyze healthcare data in a meaningful way.