Cleveland Clinic Heart Disease Data Analysis

Introduction

Data source: heart.csv (derived from the UCI Heart Disease data — Cleveland Clinic angiography patients). Original study: patients undergoing angiography at the Cleveland Clinic (1980s). For further data background, see: https://archive.ics.uci.edu/dataset/45/heart+disease

Observation unit: one patient (one row per patient).

Goal: Make use of descriptive statistics, hypothesis tests, and extensions to compare heart disease prevalence by sex and explore predictors.

Generalizations: The results of these tests can be generalized to the entire patient population at the Cleveland Clinic identifying as male or female.

Limitations: It is difficult to generalize the results of our study to a larger population because our sample is very niche. The sample that we worked with was composed of patients of the Cleveland Clinic specifically, so we are cautious in generalizing to the the entire Ohio or US population as well.

Variables

HeartDisease: binary response (Presence/Absence)

Sex: Patients sex (e.g., Male, Female)

Extensions: Age, Cholesterol, RestBP, MaxHR

Data Summary

heart_data <- read.csv("heart.csv")

heart_data <- heart_data |> 
  mutate(HeartDisease = factor(HeartDisease, labels = c("Absence", "Presence")), 
    Sex = factor(Sex, labels = c("Female", "Male")))    
counts <- heart_data |> count(Sex, HeartDisease)
counts

##      Sex HeartDisease   n
## 1 Female      Absence  67
## 2 Female     Presence  20
## 3   Male      Absence  83
## 4   Male     Presence 100

ggplot(counts, aes(x = Sex, y = n, fill = HeartDisease)) +
geom_col(position = "dodge", width = 0.6, show.legend = TRUE) +
scale_fill_manual(values = c("Absence" = "#C9E4DE",
                               "Presence" = "#F2C6DE"))+
labs(x = "Sex", y = "Count", title = "Heart Disease Count by Sex") +
theme_minimal()

p_hat_female <- sum(heart_data$Sex == "Female" & heart_data$HeartDisease == "Presence") / sum(heart_data$Sex == "Female")
p_hat_male <- sum(heart_data$Sex == "Male" & heart_data$HeartDisease == "Presence") / sum(heart_data$Sex == "Male")
tibble(Group = c("Female","Male"), Proportion = c(p_hat_female, p_hat_male))

## # A tibble: 2 × 2
##   Group  Proportion
##   <chr>       <dbl>
## 1 Female      0.230
## 2 Male        0.546

In this sample, the estimated probability of heart disease is lower for females than for males per the proportions in the table above.

Hypothesis Test 1: Relative Risk

We defined success as Presence and failure as Absence.

Motivation: Compare the risk (probability) of heart disease between females and males

table_sex_hd <- table(heart_data$Sex, heart_data$HeartDisease)
table_sex_hd

##         
##          Absence Presence
##   Female      67       20
##   Male        83      100

female_HD <- table_sex_hd["Female","Presence"]
female_total <- sum(table_sex_hd["Female",])
male_HD <- table_sex_hd["Male","Presence"]
male_total <- sum(table_sex_hd["Male",])
p_hat_f <- female_HD / female_total
p_hat_m <- male_HD / male_total
RR <- p_hat_f / p_hat_m
list(p_hat_female = p_hat_f, p_hat_male = p_hat_m, Relative_Risk = RR)

## $p_hat_female
## [1] 0.2298851
## 
## $p_hat_male
## [1] 0.5464481
## 
## $Relative_Risk
## [1] 0.4206897

sd<-sqrt((1/67)-(1/(67+20))+(1/83)-(1/(83+100)))
upper_bound<- log(RR)+1.96*sd
lower_bound<- log(RR)- 1.96*sd
exp((upper_bound))

## [1] 0.5118547

exp(lower_bound)

## [1] 0.3457618

ggplot(heart_data, aes(x = Sex, fill = HeartDisease)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_manual(values = c("Absence" = "#C9E4DE",
"Presence" = "#F2C6DE")) +
labs(
x = "Sex",
y = "Proportion",
title = "Proportion of Heart Disease by Sex"
) +
theme_minimal()

The relative risk (female/male) is less than 1, which means that females have lower observed risk in this sample. The Relative Risk (RR) was about 0.42, which indicates that females are 0.42 times less at risk than males in this data set to have heart disease. Moreover, we obtained a 95% confidence interval of 0.35 to 0.51 which means that we are 95% sure that the true RR of getting heart disease falls within this interval. Since the confidence interval does not capture 1 that means that the probability of getting heart disease is likely not the same between males and females. Given the reasoning above, our data suggests that males exhibit a greater risk of getting heart disease.

In the bar plot of heart disease proportions of heart disease by sex, the height of the red region (“Presence”) represents the risk in each group (female vs. male). Visually our relative risk comparison is directly supported.

Hypothesis Test 2: Binomial Test for Females

We defined success as Presence and failure as Absence.

Motivation: Assess whether the probability of having heart disease among females equals 0.5. Null and alternative Hypothesis Null hypothesis: The probability of having heart disease for women is \(p = 0.5\). Alternative hypothesis: The probability of having heart disease for women is \(p \neq 0.5\).

x <- female_HD

n <- female_total

binom_res <- binom.test(x, n, p = 0.5, alternative = "two.sided", conf.level = 0.95)

binom_res

## 
##  Exact binomial test
## 
## data:  x and n
## number of successes = 20, number of trials = 87, p-value = 4.305e-07
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.1464306 0.3324679
## sample estimates:
## probability of success 
##              0.2298851

female_data <- heart_data %>% filter(Sex == "Female") %>% count(HeartDisease)
ggplot(female_data, aes(x = HeartDisease, y = n, fill = HeartDisease)) +
geom_col(width = 0.6) +
scale_fill_manual(values = c("Absence" = "#C9E4DE",
"Presence" = "#F2C6DE")) +
labs(
x = "Heart Disease Status",
y = "Count",
title = "Heart Disease Among Women (Counts)"
) +
theme_minimal() +
guides(fill = "none")

The exact binomial test gives a p-value of 4.305e-07, which is far below the alpha significance of 0.05, which means we reject the null hypothesis that the true probability of heart disease among women is \(p = 0.5\). Furthermore, the estimated sample probability is p_hat = 0.2298851. The 95% confidence interval for the true probability is \(0.146 < p < 0.332\) and since the entire interval lies well below 0.5, we have strong evidence that the true probability of heart disease among women in this sample is much lower than 0.5.

The bar plot showcasing number of females with vs. without heart disease visually demonstrates the rareness of observing heart disease among women in our sample, which matches the binomial test result where p_hat = 0.2298851.

Extension 1: Logistic regression (simple multivariable model) Motivation: Model the probability of heart disease using sex and age, cholesterol as predictors. Logistic regression allows adjustment for covariaties and provides odds ratios.

glm1 <- glm(HeartDisease == "Presence" ~ Sex, data = heart_data, family = binomial)
summary(glm1)

## 
## Call:
## glm(formula = HeartDisease == "Presence" ~ Sex, family = binomial, 
##     data = heart_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2090     0.2548  -4.745 2.09e-06 ***
## SexMale       1.3953     0.2949   4.731 2.23e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 370.96  on 269  degrees of freedom
## Residual deviance: 345.92  on 268  degrees of freedom
## AIC: 349.92
## 
## Number of Fisher Scoring iterations: 4

glm2 <- glm(HeartDisease == "Presence" ~ Sex + Age + Cholesterol, data = heart_data, family = binomial)
summary(glm2)

## 
## Call:
## glm(formula = HeartDisease == "Presence" ~ Sex + Age + Cholesterol, 
##     family = binomial, data = heart_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.515014   1.211952  -5.376 7.63e-08 ***
## SexMale      1.836293   0.337012   5.449 5.07e-08 ***
## Age          0.059183   0.016204   3.652  0.00026 ***
## Cholesterol  0.006989   0.002807   2.490  0.01279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 370.96  on 269  degrees of freedom
## Residual deviance: 321.25  on 266  degrees of freedom
## AIC: 329.25
## 
## Number of Fisher Scoring iterations: 4

exp_coef <- exp(coef(glm1))
exp_confint <- exp(confint(glm1))

## Waiting for profiling to be done...

tibble(term = names(exp_coef), OR = as.numeric(exp_coef), CI_low = exp_confint[,1], CI_high = exp_confint[,2])

## # A tibble: 2 × 4
##   term           OR CI_low CI_high
##   <chr>       <dbl>  <dbl>   <dbl>
## 1 (Intercept) 0.299  0.177   0.482
## 2 SexMale     4.04   2.30    7.34

age_grid_female <- tibble(
  Age = seq(min(heart_data$Age), max(heart_data$Age), length.out = 100),
  Cholesterol = mean(heart_data$Cholesterol),
  Sex = factor("Female", levels = c("Female", "Male"))
)

age_grid_male <- tibble(
  Age = seq(min(heart_data$Age), max(heart_data$Age), length.out = 100),
  Cholesterol = mean(heart_data$Cholesterol),
  Sex = factor("Male", levels = c("Female", "Male"))
)

age_grid <- bind_rows(age_grid_female, age_grid_male)

age_grid$pred <- predict(glm2, newdata = age_grid, type = "response")

ggplot(age_grid, aes(x = Age, y = pred, color = Sex)) +
  geom_line(linewidth = 1.2) +
  labs(
    x = "Age",
    y = "Predicted Probability of Heart Disease",
    title = "Predicted Probability vs. Age by Sex"
  ) +
  scale_color_manual(values = c("Female" = "#C9E4DE", "Male" = "#F2C6DE")) +
  theme_minimal()

chol_grid_female <- tibble(
  Cholesterol = seq(min(heart_data$Cholesterol), max(heart_data$Cholesterol), length.out = 100),
  Age = mean(heart_data$Age),
  Sex = factor("Female", levels = c("Female", "Male"))
)

chol_grid_male <- tibble(
  Cholesterol = seq(min(heart_data$Cholesterol), max(heart_data$Cholesterol), length.out = 100),
  Age = mean(heart_data$Age),
  Sex = factor("Male", levels = c("Female", "Male"))
)

chol_grid <- bind_rows(chol_grid_female, chol_grid_male)

chol_grid$pred <- predict(glm2, newdata = chol_grid, type = "response")

ggplot(chol_grid, aes(x = Cholesterol, y = pred, color = Sex)) +
  geom_line(size = 1.2) +
  labs(
    x = "Cholesterol",
    y = "Predicted Probability",
    title = "Predicted Probability vs. Cholesterol by Sex"
  ) +
  scale_color_manual(values = c("Female" = "#C9E4DE", "Male" = "#F2C6DE")) +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Model 1: Heart Disease ~ Sex

\(logit(p) = -1.209 + 1.395(SexMale)\)

SexMale Estimate = 1.3953 Odds ratio: OR = \(e^{1.3953} \approx 4.03\)

Males have approximately 4 times higher odds of having heart disease compared to females in this data set. This model significantly improves fit compared to the null model granted that the residual deviance is 345.9 for this model compared to 371.0 for the null model.

Model 2: Heart Disease ~ Sex + Age + Cholesterol

\(logit(p)=−6.515+1.836(SexMale)+0.059(Age)+0.007(Cholesterol)\)

All predictors are statistically significant.

Sex Estimate = 1.836 Odds ratio: OR = \(e^{1.836293} \approx 6.27\)

After adjusting for age and cholesterol, males have over 6 times higher odds of heart disease compared to females.

Age Estimate = 0.059183 Odds ratio: OR = \(e^{0.059183} \approx 1.06\)

Each additional year of age is associated with about 6% increase in the odds of heart disease. Visually, in Predicted Probability vs. Age by Sex, we see how males have higher predicted probability at every age.

Cholesterol Estimate = 0.006989 Odds ration: OR = \(e^{0.006989} \approx 1.007\)

For each 1-unit increase in cholesterol, the odds of heart disease increase by \(\approx 0.7%\). Visually, in Predicted Probability vs. Cholesterol by Sex we that predicted heart disease risk increases slightly as cholesterol rises, but males consistently have a higher predicted probability than females across all values.

Model Comparison

heart_data$predicted <- predict(glm2, type = "response")

ggplot(heart_data, aes(x = predicted, y = HeartDisease, color = Sex)) +
  geom_jitter(height = 0.1, width = 0, alpha = 0.7) +
  labs(
    x = "Predicted Probability",
    y = "Observed Outcome",
    title = "Predicted Probability vs Observed Heart Disease"
  ) +
  scale_color_manual(values = c("Female" = "#C9E4DE", "Male" = "#F2C6DE")) +
  theme_minimal()

Model 1 Akaike Information Criterion (AIC): 349.92 Model 2 Akaike Information Criterion (AIC): 329.25

Lower AIC is indicative of better fit, therefore Model 2 is a better fit. Additionally, residual deviance also drops from 345.92 to 321.25, suggesting that age and cholesterol meaningfully improves prediction.Predicted Probability vs Observed Heart Disease visually shows this improvement: Model 2’s predicted probabilities align more closely with the observed outcomes, with tighter clustering and less scatter. This indicates that including age and cholesterol helps the model better separate patients with and without heart disease.

Extension 2: Interactive model of the relationship between sex and reported pain type for patients with and without heart disease Motivation: We wanted to model the difference in reported pain level type between males and females considering if they actually have the heart disease or not. A common stereotype is that women have a higher pain tolerance compared to men and that men often exaggerate their pain level. While we never want to downplay an individual’s pain we think it would be interesting to observe if men report a higher pain level on average compared to women, especially when they do not have the disease.

#install.packages("plotly")
library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

p <- ggplot(heart_data, aes(x = Sex, y = Chest_pain_type, color=HeartDisease)) +
  geom_point(position = position_jitter(width = 0.15), 
             alpha = 0.1, size = 3) +
  labs(x = "Sex",
       y = "Pain Type",
       title = "Interactive Scatterplot of Sex vs Pain Type in the Presence and Absence of Heart Disease")

ggplotly(p)

The plot provides interesting information on the difference in reported pain level types between males and females. An interesting observation is that on average there were more males with heart disease that reported lower pain levels compared to women also with heart disease. A limitation to this is that there were more men that had heart disease so this observation could be due to the fact that there is a greater pool of men in the heart disease condition. Additionally, the data shows that men and women without the disease reported roughly the same pain level types. The difference in reported pain types is indistinguishable given the data shown in the visualization. This is contrary to what we might have expected if men exaggerated their pain more compared to women. If this were true we would have expected there to be a significant decrease in reported pain levels for “healthy” women compared to “healthy” men. It is important to note that there are many factors that could affect an individual’s reported pain level like whether they are on medication or at what stage they are in the disease. Because of this we are not making any causal or definitive conclusion about whether men or women have higher pain tolerances.

Cleveland Clinic Heart Disease Data Analysis

12-12-2025